Re: [PATCH] cpuset: Rework sched domains and CPU hotplug handling (take 4)

Previous thread: p54usb not working on DWL-G120 Spinnaker wireless card by d i on Monday, August 11, 2008 - 2:14 pm. (3 messages)

Next thread: [PATCH 1/1] iwlwifi: fix printk newlines by Jiri Slaby on Monday, August 11, 2008 - 2:49 pm. (7 messages)
From: Max Krasnyansky
Date: Monday, August 11, 2008 - 2:33 pm

This is an updated version of my previous cpuset patch on top of
the latest mainline git.
The patch fixes CPU hotplug handling issues in the current cpusets code.
Namely circular locking in rebuild_sched_domains() and unsafe access to
the cpu_online_map in the cpuset cpu hotplug handler.

This version includes changes suggested by Paul Jackson (naming, comments,
style, etc). I also got rid of the separate workqueue thread because it is
now safe to call get_online_cpus() from workqueue callbacks.

--
Here are some more details.

rebuild_sched_domains() is the only way to rebuild sched domains
correctly based on the current cpuset settings. What this means
is that we need to be able to call it from different contexts,
like cpu hotplug for example.
Also latest scheduler code in -tip now calls rebuild_sched_domains()
directly from functions like arch_reinit_sched_domains().

In order to support that properly we need to rework cpuset locking
rules to avoid circular dependencies, which is what this patch does.
New lock nesting rules are explained in the comments.
We can now safely call rebuild_sched_domains() from virtually any
context. The only requirement is that it needs to be called under
get_online_cpus(). This allows cpu hotplug handlers and the scheduler
to call rebuild_sched_domains() directly.
The rest of the cpuset code now offloads sched domains rebuilds to
a workqueue (async_rebuild_sched_domains()).

This version of the patch addresses comments from the previous review.
I fixed all miss-formated comments and trailing spaces.

I also factored out the code that builds domain masks and split up CPU and
memory hotplug handling. This was needed to simplify locking, to avoid unsafe
access to the cpu_online_map from mem hotplug handler, and in general to make
things cleaner.

The patch passes moderate testing (building kernel with -j 16, creating &
removing domains and bringing cpus off/online at the same time) on the
quad-core2 based machine.
It passes lockdep checks, ...
From: Max Krasnyansky
Date: Monday, August 11, 2008 - 2:38 pm

Minor correction. I meant unsafe access to the cpu_online_map in the _memory_
--

From: Rakib Mullick
Date: Monday, August 11, 2008 - 8:21 pm

One more thing Max, while you are allocating memory for "dattr"  I
think it  needs to check that it is successful or not . I've shown it
on one of my previous patch on 7th
--

From: Max Krasnyansky
Date: Monday, August 11, 2008 - 8:27 pm

I replied to your patch saying that we do not need to check it. And Paul
suggested to add a comment to explain why.

Max


--

From: Paul Jackson
Date: Monday, August 11, 2008 - 8:33 pm

Best not to quote an entire long message when only
a little bit of it is needed to provide context for
the reply.  It forces all readers to scan down many
pages trying to find the useful reply.

I agree with Max's reply to this ... his comment
explaining why checking dattr is not NULL looks good.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: Paul Jackson
Date: Monday, August 11, 2008 - 4:31 pm

Acked-by: Paul Jackson <pj@sgi.com>

... based on code reading and comparing with the
previous version - looks good.  Nice work, Max.
Thanks.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: Ingo Molnar
Date: Thursday, August 14, 2008 - 2:24 am

applied to tip/sched/cpuset, thanks. (will show up in tip/sched/urgent 
as well soon, for v2.6.27 merging.)

	Ingo
--

From: Ingo Molnar
Date: Thursday, August 14, 2008 - 4:16 am

FYI, this new lockdep warning showed up in -tip testing, after i added 
this patch.

	Ingo

[   59.750200] =======================================================
[   59.750200] [ INFO: possible circular locking dependency detected ]
[   59.750200] 2.6.27-rc3-tip-00076-g75f9a29-dirty #1
[   59.750200] -------------------------------------------------------
[   59.750200] Xorg/6623 is trying to acquire lock:
[   59.750200]  (cpu_add_remove_lock){--..}, at: [<c0147dae>] cpu_maps_update_begin+0x14/0x16
[   59.750200] 
[   59.750200] but task is already holding lock:
[   59.750200]  (polldev_mutex){--..}, at: [<c07eb66f>] input_close_polled_device+0x22/0x47
[   59.750200] 
[   59.750200] which lock already depends on the new lock.
[   59.750200] 
[   59.750200] 
[   59.750200] the existing dependency chain (in reverse order) is:
[   59.750200] 
[   59.750200] -> #5 (polldev_mutex){--..}:
[   59.750200]        [<c01632e3>] __lock_acquire+0x848/0x9ab
[   59.750200]        [<c01634b6>] lock_acquire+0x70/0x97
[   59.750200]        [<c09eaf41>] __mutex_lock_common+0x8a/0x278
[   59.750200]        [<c09eb15d>] mutex_lock_interruptible_nested+0x2e/0x35
[   59.750200]        [<c07eb745>] input_open_polled_device+0x1c/0xa3
[   59.750200]        [<c07e8f84>] input_open_device+0x5a/0x86
[   59.750200]        [<c07eded0>] evdev_open+0x103/0x14e
[   59.750200]        [<c07ea620>] input_open_file+0x44/0x60
[   59.750200]        [<c019ff1c>] chrdev_open+0x106/0x11d
[   59.750200]        [<c019c269>] __dentry_open+0x119/0x1f0
[   59.750200]        [<c019c364>] nameidata_to_filp+0x24/0x38
[   59.750200]        [<c01a6952>] do_filp_open+0x309/0x5b2
[   59.750200]        [<c019c04e>] do_sys_open+0x47/0xc1
[   59.750200]        [<c019c114>] sys_open+0x23/0x2b
[   59.750200]        [<c011bb2f>] sysenter_do_call+0x12/0x43
[   59.750200]        [<ffffffff>] 0xffffffff
[   59.750200] 
[   59.750200] -> #4 (&dev->mutex){--..}:
[   59.750200]        [<c01632e3>] __lock_acquire+0x848/0x9ab
[   ...
From: Max Krasnyansky
Date: Thursday, August 14, 2008 - 11:21 am

Hmm, unless I'm missing something this one is unrelated. There are no cpu
hotplug, sched or cpusets paths in the trace besides cpu_maps_update_begin().
But that one is taken in the regular destroy_workqueue() path.

The issues is polldev_mutex and cpu_add_remove_lock nesting. I bet you can
trigger that without cpusets. CC'ing Dmitry Torokhov.

btw It seems to be triggered when X closes polled input device.
There aren't that many of them
   $ git grep input_register_polled_device | wc -l
   16
What do you have hooked up to the test box ?

--

From: Ingo Molnar
Date: Tuesday, August 19, 2008 - 11:44 pm

Dmitry,


See the bootlog below. Upstream base of the tree below is 
1fca254274823876. The first warnings came with b635acec48.

	Ingo

------------------>
Initializing cgroup subsys cpuset
Linux version 2.6.27-rc3-tip-00607-g3c311a4-dirty (mingo@europe) (gcc version 4.2.2) #1 SMP Mon Aug 18 21:31:41 CEST 2008
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009f000 (usable)
 BIOS-e820: 000000000009f000 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000d2000 - 00000000000d4000 (reserved)
 BIOS-e820: 00000000000dc000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 000000007f6e0000 (usable)
 BIOS-e820: 000000007f6e0000 - 000000007f6f3000 (ACPI data)
 BIOS-e820: 000000007f6f3000 - 000000007f700000 (ACPI NVS)
 BIOS-e820: 000000007f700000 - 0000000080000000 (reserved)
 BIOS-e820: 00000000f0000000 - 00000000f4000000 (reserved)
 BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
 BIOS-e820: 00000000fed00000 - 00000000fed00400 (reserved)
 BIOS-e820: 00000000fed14000 - 00000000fed1a000 (reserved)
 BIOS-e820: 00000000fed1c000 - 00000000fed90000 (reserved)
 BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
 BIOS-e820: 00000000ff800000 - 0000000100000000 (reserved)
last_pfn = 0x7f6e0 max_arch_pfn = 0x100000
kernel direct mapping tables up to 38000000 @ 7000-c000
DMI present.
ACPI: RSDP 000F68A0, 0024 (r2 LENOVO)
ACPI: XSDT 7F6E631F, 0074 (r1 LENOVO TP-79        1020  LTP        0)
ACPI: FACP 7F6E6400, 00F4 (r3 LENOVO TP-79        1020 LNVO        1)
ACPI Warning (tbfadt-0442): Optional field "Gpe1Block" has zero address or length: 000000000000102C/0 [20080609]
ACPI: DSDT 7F6E65E7, C765 (r1 LENOVO TP-79        1020 MSFT  100000E)
ACPI: FACS 7F6F4000, 0040
ACPI: SSDT 7F6E65B4, 0033 (r1 LENOVO TP-79        1020 MSFT  100000E)
ACPI: ECDT 7F6F2D4C, 0052 (r1 LENOVO TP-79        1020 LNVO        1)
ACPI: TCPA 7F6F2D9E, 0032 (r2 LENOVO TP-79        1020 LNVO        1)
ACPI: APIC 7F6F2DD0, 0068 (r1 LENOVO TP-79        1020 LNVO        ...
From: Dmitry Torokhov
Date: Wednesday, August 20, 2008 - 6:08 am

Ingo,

I am not quite sure why lockdep complains. From the input POV
everything is pretty simple - we use polldev_mutex to start ipolldevd
workqueue on demand, when first user appears, and also to destroy
workqueue after last iser is gone.

Is it reporoducible for you? Is it only with your tree or mainline as
well?

-- 
Dmitry
--

Previous thread: p54usb not working on DWL-G120 Spinnaker wireless card by d i on Monday, August 11, 2008 - 2:14 pm. (3 messages)

Next thread: [PATCH 1/1] iwlwifi: fix printk newlines by Jiri Slaby on Monday, August 11, 2008 - 2:49 pm. (7 messages)