This is an updated version of my previous cpuset patch on top of the latest mainline git. The patch fixes CPU hotplug handling issues in the current cpusets code. Namely circular locking in rebuild_sched_domains() and unsafe access to the cpu_online_map in the cpuset cpu hotplug handler. This version includes changes suggested by Paul Jackson (naming, comments, style, etc). I also got rid of the separate workqueue thread because it is now safe to call get_online_cpus() from workqueue callbacks. -- Here are some more details. rebuild_sched_domains() is the only way to rebuild sched domains correctly based on the current cpuset settings. What this means is that we need to be able to call it from different contexts, like cpu hotplug for example. Also latest scheduler code in -tip now calls rebuild_sched_domains() directly from functions like arch_reinit_sched_domains(). In order to support that properly we need to rework cpuset locking rules to avoid circular dependencies, which is what this patch does. New lock nesting rules are explained in the comments. We can now safely call rebuild_sched_domains() from virtually any context. The only requirement is that it needs to be called under get_online_cpus(). This allows cpu hotplug handlers and the scheduler to call rebuild_sched_domains() directly. The rest of the cpuset code now offloads sched domains rebuilds to a workqueue (async_rebuild_sched_domains()). This version of the patch addresses comments from the previous review. I fixed all miss-formated comments and trailing spaces. I also factored out the code that builds domain masks and split up CPU and memory hotplug handling. This was needed to simplify locking, to avoid unsafe access to the cpu_online_map from mem hotplug handler, and in general to make things cleaner. The patch passes moderate testing (building kernel with -j 16, creating & removing domains and bringing cpus off/online at the same time) on the quad-core2 based machine. It passes lockdep checks, ...
Minor correction. I meant unsafe access to the cpu_online_map in the _memory_ --
One more thing Max, while you are allocating memory for "dattr" I think it needs to check that it is successful or not . I've shown it on one of my previous patch on 7th --
I replied to your patch saying that we do not need to check it. And Paul suggested to add a comment to explain why. Max --
Best not to quote an entire long message when only
a little bit of it is needed to provide context for
the reply. It forces all readers to scan down many
pages trying to find the useful reply.
I agree with Max's reply to this ... his comment
explaining why checking dattr is not NULL looks good.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
Acked-by: Paul Jackson <pj@sgi.com>
... based on code reading and comparing with the
previous version - looks good. Nice work, Max.
Thanks.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
applied to tip/sched/cpuset, thanks. (will show up in tip/sched/urgent as well soon, for v2.6.27 merging.) Ingo --
FYI, this new lockdep warning showed up in -tip testing, after i added
this patch.
Ingo
[ 59.750200] =======================================================
[ 59.750200] [ INFO: possible circular locking dependency detected ]
[ 59.750200] 2.6.27-rc3-tip-00076-g75f9a29-dirty #1
[ 59.750200] -------------------------------------------------------
[ 59.750200] Xorg/6623 is trying to acquire lock:
[ 59.750200] (cpu_add_remove_lock){--..}, at: [<c0147dae>] cpu_maps_update_begin+0x14/0x16
[ 59.750200]
[ 59.750200] but task is already holding lock:
[ 59.750200] (polldev_mutex){--..}, at: [<c07eb66f>] input_close_polled_device+0x22/0x47
[ 59.750200]
[ 59.750200] which lock already depends on the new lock.
[ 59.750200]
[ 59.750200]
[ 59.750200] the existing dependency chain (in reverse order) is:
[ 59.750200]
[ 59.750200] -> #5 (polldev_mutex){--..}:
[ 59.750200] [<c01632e3>] __lock_acquire+0x848/0x9ab
[ 59.750200] [<c01634b6>] lock_acquire+0x70/0x97
[ 59.750200] [<c09eaf41>] __mutex_lock_common+0x8a/0x278
[ 59.750200] [<c09eb15d>] mutex_lock_interruptible_nested+0x2e/0x35
[ 59.750200] [<c07eb745>] input_open_polled_device+0x1c/0xa3
[ 59.750200] [<c07e8f84>] input_open_device+0x5a/0x86
[ 59.750200] [<c07eded0>] evdev_open+0x103/0x14e
[ 59.750200] [<c07ea620>] input_open_file+0x44/0x60
[ 59.750200] [<c019ff1c>] chrdev_open+0x106/0x11d
[ 59.750200] [<c019c269>] __dentry_open+0x119/0x1f0
[ 59.750200] [<c019c364>] nameidata_to_filp+0x24/0x38
[ 59.750200] [<c01a6952>] do_filp_open+0x309/0x5b2
[ 59.750200] [<c019c04e>] do_sys_open+0x47/0xc1
[ 59.750200] [<c019c114>] sys_open+0x23/0x2b
[ 59.750200] [<c011bb2f>] sysenter_do_call+0x12/0x43
[ 59.750200] [<ffffffff>] 0xffffffff
[ 59.750200]
[ 59.750200] -> #4 (&dev->mutex){--..}:
[ 59.750200] [<c01632e3>] __lock_acquire+0x848/0x9ab
[ ...Hmm, unless I'm missing something this one is unrelated. There are no cpu hotplug, sched or cpusets paths in the trace besides cpu_maps_update_begin(). But that one is taken in the regular destroy_workqueue() path. The issues is polldev_mutex and cpu_add_remove_lock nesting. I bet you can trigger that without cpusets. CC'ing Dmitry Torokhov. btw It seems to be triggered when X closes polled input device. There aren't that many of them $ git grep input_register_polled_device | wc -l 16 What do you have hooked up to the test box ? --
Dmitry, See the bootlog below. Upstream base of the tree below is 1fca254274823876. The first warnings came with b635acec48. Ingo ------------------> Initializing cgroup subsys cpuset Linux version 2.6.27-rc3-tip-00607-g3c311a4-dirty (mingo@europe) (gcc version 4.2.2) #1 SMP Mon Aug 18 21:31:41 CEST 2008 BIOS-provided physical RAM map: BIOS-e820: 0000000000000000 - 000000000009f000 (usable) BIOS-e820: 000000000009f000 - 00000000000a0000 (reserved) BIOS-e820: 00000000000d2000 - 00000000000d4000 (reserved) BIOS-e820: 00000000000dc000 - 0000000000100000 (reserved) BIOS-e820: 0000000000100000 - 000000007f6e0000 (usable) BIOS-e820: 000000007f6e0000 - 000000007f6f3000 (ACPI data) BIOS-e820: 000000007f6f3000 - 000000007f700000 (ACPI NVS) BIOS-e820: 000000007f700000 - 0000000080000000 (reserved) BIOS-e820: 00000000f0000000 - 00000000f4000000 (reserved) BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved) BIOS-e820: 00000000fed00000 - 00000000fed00400 (reserved) BIOS-e820: 00000000fed14000 - 00000000fed1a000 (reserved) BIOS-e820: 00000000fed1c000 - 00000000fed90000 (reserved) BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved) BIOS-e820: 00000000ff800000 - 0000000100000000 (reserved) last_pfn = 0x7f6e0 max_arch_pfn = 0x100000 kernel direct mapping tables up to 38000000 @ 7000-c000 DMI present. ACPI: RSDP 000F68A0, 0024 (r2 LENOVO) ACPI: XSDT 7F6E631F, 0074 (r1 LENOVO TP-79 1020 LTP 0) ACPI: FACP 7F6E6400, 00F4 (r3 LENOVO TP-79 1020 LNVO 1) ACPI Warning (tbfadt-0442): Optional field "Gpe1Block" has zero address or length: 000000000000102C/0 [20080609] ACPI: DSDT 7F6E65E7, C765 (r1 LENOVO TP-79 1020 MSFT 100000E) ACPI: FACS 7F6F4000, 0040 ACPI: SSDT 7F6E65B4, 0033 (r1 LENOVO TP-79 1020 MSFT 100000E) ACPI: ECDT 7F6F2D4C, 0052 (r1 LENOVO TP-79 1020 LNVO 1) ACPI: TCPA 7F6F2D9E, 0032 (r2 LENOVO TP-79 1020 LNVO 1) ACPI: APIC 7F6F2DD0, 0068 (r1 LENOVO TP-79 1020 LNVO ...
Ingo, I am not quite sure why lockdep complains. From the input POV everything is pretty simple - we use polldev_mutex to start ipolldevd workqueue on demand, when first user appears, and also to destroy workqueue after last iser is gone. Is it reporoducible for you? Is it only with your tree or mainline as well? -- Dmitry --
