Re: [PATCH 7/8] Clean up workqueue.c with respect to the freezer based cpu-hotplug

Previous thread: none

Next thread: [patch 02/17] Remove CONFIG_DEBUG_PARAVIRT by Jeremy Fitzhardinge on Sunday, April 1, 2007 - 10:56 pm. (1 message)
From: Gautham R Shenoy
Date: Sunday, April 1, 2007 - 10:34 pm

Hello Everybody, 

This is another attempt towards process-freezer based cpu-hotplug.
This patchset covers just about everything that was discussed on
the LKML with respect to the freezer-based cpu-hotplug.

Following are new features from the version I last posted:

- Enhancements to the freezer interface so that
  it can be used for events other than Software suspend.

- A reentrant process freezer. Software suspend needs since it uses
  cpu-hotplug.

- All non-singlethreaded workqueues freezeable by default.

And yes, this time around, I have some numbers to give an idea 
of the performance.

Test m/c configuration: 4-way i386 Xeon Box with 2.5G Memory.
Test case: 'make -jN' running parallely with cpu-offline/online operations
	    in a tight loop.

With N=256.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Kernel: linux-2.6.21-rc5-mm3 vanilla
---------------------------------------------------------
Maximum time taken for a hotplug operation	:  1m5.357s
Minimum time taken for a hotplug operation	:  0m0.288s
Average number of attempts to offline AND Online
cpu1, cpu2 and cpu3				: 6
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Kernel: linux-2.6.21-rc5-mm3 + the patchset.
---------------------------------------------------------
Maximum time taken for a hotplug operation	: 0m1s.261s
Minimum time taken for a hotplug operation	: 0m0s.808s
Average number of attempts to offline and online
cpu1, cpu2 and cpu3				: 14
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The numbers look ok.

However, on an average it took around 14 attempts to offline AND online
cpu1,cpu2, cpu3 when otherwise it should have taken only 6.

The failures were due to the freezer failing to freeze all the tasks
within the timeout period (20sec).

As N increased (512, 1024, 2048 and 8192), even the average number
of attempts to online AND offline the 3 cpus increased.

At this point of time, I worked towards ensuring that cpu-hotplug works
reliably ...
From: Gautham R Shenoy
Date: Sunday, April 1, 2007 - 10:37 pm

This patch provides an interface to extend the use of the process
 freezer beyond Suspend.

The tasks can selectively mark themselves to be exempted from specific
freeze events like SUSPEND /KPROBES/CPU_HOTPLUG.

This patch however, *does not* sort non freezable threads into
different categories based on the freeze events. Thus all 
tasks which were previously marked PF_NOFREEZE are now
exempted from freezer using 
	freezer_exempt(FE_ALL);
which means exempt from all kinds of freezes.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>

--
 arch/i386/kernel/apm.c              |    2 -
 drivers/block/loop.c                |    2 -
 drivers/char/apm-emulation.c        |    6 ++--
 drivers/ieee1394/ieee1394_core.c    |    2 -
 drivers/md/md.c                     |    2 -
 drivers/mmc/card/queue.c            |    3 +-
 drivers/mtd/mtd_blkdevs.c           |    3 +-
 drivers/scsi/libsas/sas_scsi_host.c |    2 -
 drivers/scsi/scsi_error.c           |    2 -
 drivers/usb/storage/usb.c           |    2 -
 include/linux/freezer.h             |   52 ++++++++++++++++++++++++++++++++----
 include/linux/sched.h               |   15 +++++++++-
 init/do_mounts_initrd.c             |    2 -
 kernel/fork.c                       |    2 -
 kernel/kprobes.c                    |    4 +-
 kernel/power/disk.c                 |    4 +-
 kernel/power/main.c                 |    6 ++--
 kernel/power/process.c              |   29 +++++++++++---------
 kernel/power/user.c                 |    8 ++---
 kernel/rcutorture.c                 |    4 +-
 kernel/sched.c                      |    2 -
 kernel/softirq.c                    |    2 -
 kernel/softlockup.c                 |    2 -
 kernel/workqueue.c                  |    2 -
 24 files changed, 110 insertions(+), 50 deletions(-)

Index: ...
From: Pavel Machek
Date: Monday, April 2, 2007 - 6:56 am

Hmmm, I do not really like softlockup watchdog running during suspend.
Can we make this freezeable and make watchdog shut itself off while



Same here.


Does this mean we have userland /linuxrc running with PF_NOFREEZE?

Eh? Why does kprobes code depend on config_pm?

						Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-

From: Rafael J. Wysocki
Date: Monday, April 2, 2007 - 1:48 pm

Why do you think so?  I think the freezer should be compiled automatically

Generally, I agree, but this patch only replaces the existing instances
of PF_NOFREEZE with the new mechanism.  The changes you're talking about
require a separate patch series (or at least one separate patch), I think, and

No, actually it is _required_ for the userland resume to work.  Well, perhaps
I should place a comment in there so that I don't have to explain this again

Because it uses the freezer? ;-)

Greetings,
Rafael
-

From: Gautham R Shenoy
Date: Tuesday, April 3, 2007 - 12:59 am

Is that why?! Then I guess we can remove it. Because the freezer is

thanks
gautham.
-- 
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"
-

From: Pavel Machek
Date: Monday, April 2, 2007 - 1:51 pm

(Actually it would be nice to say

/* This thread should not be frozen for suspend, becuase it is needed

Kconfig can do that. ("select statement"). If we have one such ifdef,


Can you put big bold comment there?


That is no longer true after this patch... Ugly ifdef above makes sure
freezer is there for kprobes. I'm trying to say that #if above is
now broken. Actually it was probably always broken, but it just became
more so.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-

From: Rafael J. Wysocki
Date: Friday, April 6, 2007 - 7:34 am

No, it doesn't, but this task spawns linuxrc and then calls sys_wait4() in a
loop.

Well, actually, I'll try to plant try_to_freeze() in this loop and see if that

Agreed.

Greetings,
Rafael
-

From: Nigel Cunningham
Date: Friday, April 6, 2007 - 3:20 pm

Hi.


It works. I've had:

                while (pid != sys_wait4(-1, NULL, 0, NULL)) {
                        yield();
                        try_to_freeze();
                }

there for ages for Suspend2.

Regards,

Nigel

-

From: Rafael J. Wysocki
Date: Saturday, April 7, 2007 - 2:33 am

OK, thanks.  Is there any particular reason to place try_to_freeze() after
yield()?

Greetings,
Rafael
-

From: Nigel Cunningham
Date: Saturday, April 7, 2007 - 2:47 am

Hi.


Not that I remember. I haven't touched that for years :)

Nigel

-

From: Gautham R Shenoy
Date: Sunday, April 8, 2007 - 8:04 pm

I have already removed it from in my version 3.

Thanks and Regards
gautham.
-- 
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"
-

From: Oleg Nesterov
Date: Thursday, April 5, 2007 - 2:46 am

So, a kernel_thread should call freezer_exempt(FE_XXX) somewhere at the
beginning if it doesn't want to be considered as freezeable. But what if
the freezing is already in progress? In that case freezer_exempt() should
somehow clear TIF_FREEZE (if exempt_freeze_event doesn't match freeze_event
parameter of freeze_processes()), otherwise we may hit a nasty bug, much
worse than a freezing failure (which could be restarted).

try_to_freeze_tasks() succeeds because the task is !freezeable(), the
task goes to refrigerator (while it should not), thaw_tasks() ignores
process and it stays frozen.

Alternatively, we can do a re-check in refrigerator() to fix this race.

This is a real nitpick, but it was hard to me to understand this change.
Because it looks as if we have a subtle reason to set TASK_INTERRUPTIBLE
before freezer_exempt(). Unless I missed something, I'd suggest to move
freezer_exempt() up, before set_current_state().

The same for apm_mainloop().

Oleg.

-

From: Gautham R Shenoy
Date: Thursday, April 5, 2007 - 3:59 am

That's a nice catch! The problem still exists in the current -mm with


In the patch 2/8, I am maintaining a global system_freeze_event_mask.
The reads and writes to this mask are serialized by freezer_mutex.
But the mutex is held only in the freeze_processes() and
thaw_processes() path.

So the check in refrigerator() which is not in the freeze_processes()/
thaw_processes() path might need some more thought. Will cook up

Ok, no subtle reasons for freezer_exempt()ing after set_current_state().
So no problems changing the order. But (just curious), is there any specific

Thanks for the review.
Regards
gautham.
-- 
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"
-

From: Oleg Nesterov
Date: Thursday, April 5, 2007 - 4:30 am

No, no, it was just a nitpick :) May be this is just me, but when I see the
code like

	set_current_state(TASK_XXX);
	something_which_doesnt_need_TASK_XXX();

, I can't read the code further, trying to understand where I was wrong
and why do we need to change task->state here.

Oleg.

-

From: Gautham R Shenoy
Date: Sunday, April 1, 2007 - 10:37 pm

This patch adds provision to make the process freezer reentrant 
for different kinds of freeze events.

Credit to Rafael Wysocki for the system_freeze_event_mask idea.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Rafael J. Wysocki <rjw@sisk.pl>

--
 kernel/power/process.c |   57 ++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 50 insertions(+), 7 deletions(-)

Index: linux-2.6.21-rc3/kernel/power/process.c
===================================================================
--- linux-2.6.21-rc3.orig/kernel/power/process.c
+++ linux-2.6.21-rc3/kernel/power/process.c
@@ -23,6 +23,13 @@
 
 #define FREEZER_KERNEL_THREADS 0
 #define FREEZER_USER_SPACE 1
+/*
+ * Global mask of the events (think suspend, cpu-hotplug, etc) for
+ * which the services of the processes freezer have been
+ * currently employed.
+ */
+static unsigned long system_freeze_event_mask;
+static DEFINE_MUTEX(freezer_mutex);
 
 static inline int freezeable(struct task_struct * p, unsigned long freeze_event)
 {
@@ -33,6 +40,25 @@ static inline int freezeable(struct task
 	return 1;
 }
 
+/* Returns the mask of the events for which p is currently frozen */
+static inline int process_frozen_event_mask(struct task_struct *p)
+{
+	return freezeable_event_mask(p) & system_freeze_event_mask;
+}
+
+static inline int thawable(struct task_struct *p, unsigned long freeze_event)
+{
+	if (!freezeable(p, freeze_event))
+		return 0;
+
+	/* p can be thawed iff the process p has been
+	 * frozen for freeze_event alone.
+	 */
+	if (process_frozen_event_mask(p) & ~freeze_event)
+		return 0;
+
+	return 1;
+}
 /* Refrigerator is place where frozen processes are stored :-). */
 void refrigerator(void)
 {
@@ -183,20 +209,30 @@ static unsigned int try_to_freeze_tasks(
 int freeze_processes(unsigned long freeze_event)
 {
 	unsigned int nr_unfrozen;
-
+	int ret = 0;
+	mutex_lock(&freezer_mutex);
+	if (system_freeze_event_mask & freeze_event)
+		goto ...
From: Oleg Nesterov
Date: Thursday, April 5, 2007 - 2:53 am

I am not sure this is correct. Suppose that system_freeze_event_mask == FE_SUSPEND,
now freeze_processes(FE_ALL) returns success. Shouldn't we return an error?

Oleg.

-

From: Gautham R Shenoy
Date: Thursday, April 5, 2007 - 3:19 am

Hmm, may be we are getting confused with the word 'reentrant' here.
The idea behind this API was to freeze the system for one event
only at any given time.

However, we can have nested freeze_processes() call for different events.

Something like 
freeze_processes(FE_SUSPEND);

/* Do something */

	freeze_processes(FE_HOTPLUG_CPU);

	/* hotplug cpus */

	thaw_processes(FE_HOTPLUG_CPU);

/* Do something more */

thaw_processes(FE_SUSPEND);

So ideally no one is supposed to make a call like
freeze_processes(FE_SUSPEND | FE_HOTPLUG_CPU);
OR
freeze_processes(FE_ALL);


Thanks
gautham.
-- 
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"
-

From: Gautham R Shenoy
Date: Sunday, April 1, 2007 - 10:38 pm

This patch implements process_freezer based cpu-hotplug
core. 
The sailent features are:
o No more (un)lock_cpu_hotplug.
o No more CPU_LOCK_ACQUIRE and CPU_LOCK_RELEASE. Hence no per-subsystem
  hotcpu mutexes.
o Calls freeze_process/thaw_processes at the beginning/end of
  the hotplug operation.

Lessons Learnt:
o CPU_UP_PREPARE has to be called with the processes frozen.
  If implemented otherwise, the kernel threads which we create
  during UP_PREPARE will never be frozen during freeze_processes
  since we them up only during CPU_ONLIN.

Points to ponder: 
o Can the SYSTEM_RUNNING hack in _cpu_up be avoided by some cleaner means.

Signed-off-by : Srivatsa Vaddagiri <vatsa@in.ibm.com>
Signed-off-by : Gautham R Shenoy   <ego@in.ibm.com>
Signed-off-by : Rafael J. Wysocki  <rjw@sisk.pl> 
--
 include/linux/freezer.h  |   10 ++++--
 include/linux/notifier.h |    2 -
 include/linux/sched.h    |    5 ++-
 kernel/cpu.c             |   70 ++++++++++++++++++-----------------------------
 kernel/sched.c           |   20 +------------
 kernel/softirq.c         |   13 ++++----
 kernel/softlockup.c      |    1 
 kernel/workqueue.c       |   33 ++++++++++------------
 mm/slab.c                |    6 ----
 9 files changed, 63 insertions(+), 97 deletions(-)

Index: linux-2.6.21-rc5/include/linux/sched.h
===================================================================
--- linux-2.6.21-rc5.orig/include/linux/sched.h
+++ linux-2.6.21-rc5/include/linux/sched.h
@@ -1195,10 +1195,13 @@ static inline void put_task_struct(struc
 #define PF_FE_KPROBES	0x00000010	/* This thread should not be frozen
 					 * for Kprobes
 					 */
+#define PF_FE_HOTPLUG_CPU 0x00000020	/* Thread should not be frozen for
+					 * cpu hotplug.
+					 */
 
 #define PF_FROZEN	0x00010000	/* frozen for system suspend */
 
-#define PF_FE_ALL	(PF_FE_SUSPEND | PF_FE_KPROBES)
+#define PF_FE_ALL	(PF_FE_SUSPEND | PF_FE_KPROBES | PF_FE_HOTPLUG_CPU)
 					/* Exempt from all kinds freeze chills*/
 
 #define ...
From: Oleg Nesterov
Date: Thursday, April 5, 2007 - 3:53 am

Off-topic. This is a common pattern. Perhaps it makes sense to

Why this change? We are doing kthread_stop() + set_cpus_allowed() on
return. Imho,

		if (IS_ERR(p))
			goto out_allowed;
		goto out_thread;


As it was already discussed, this is racy. As Srivatsa (imho rightly)
suggested, kthread_stop(p) should thaw process itself. This also allows
us to kill at least some of wait_for_die loops.

However, the change in kthread_stop(p) in not enough to close the race.
We can check kthread_should_stop() in refrigerator(), this looks like
a most simple approach for now.

Alternatively, Srivatsa suggests to introduce a new task_lock() protected
task_struct->freezer_state (so we can reliably set FE_ALL). Surely this is
more poweful, but needs more changes. I am not sure. Perhaps we can do
this later.

In any case, imho "try3" should add thaw_process() to kthread_stop().
Gautham, Srivatsa, do you agree?

Oleg.

-

From: Gautham R Shenoy
Date: Thursday, April 5, 2007 - 5:14 am

Yes, that looks feasible and nice. But I remember making this change for

Well, in this case this is not racy. Remember, we're doing a
thaw_process(p) in CPU_DEAD where p *is* frozen for cpu hotplug. So
the where we might call a freeze_process(p) after we do a thaw_process
doesn't seem to be feasible.


Why the check kthread_should_stop() refrigerator() ?
As vatsa mentioned, we would be doing 

task_lock(p);
freezer_exempt(p, FE_ALL); /* Doesn't exist as of now, but we can work
				it out */
thaw_process(p);
task_unlock(p);

wait_for_completion();


This needs an extra field! We're supposed to be miserly when it comes to


-- 
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"
-

From: Oleg Nesterov
Date: Thursday, April 5, 2007 - 6:34 am

From: Nathan Lynch
Date: Friday, April 6, 2007 - 10:27 am

If I'm understanding correctly, this will cause

# echo 0 > /sys/devices/system/cpu/cpuX/online

to sometimes fail, and userspace is expected to try again?  This will
break existing applications.

Perhaps drivers/base/cpu.c:store_online should retry as long as
cpu_up/down return -EBUSY.  That would avoid a userspace-visible
interface change.
-

From: Ingo Molnar
Date: Friday, April 6, 2007 - 10:34 am

yeah. I'd even suggest a freeze_processes_nofail() API instead, that 
does this internally, without burdening the callsites. (and once the 
freezer becomes complete then freeze_processes_nofail() == 
freeze_processes())

	Ingo
-

From: Nathan Lynch
Date: Friday, April 6, 2007 - 10:47 am

Yeah, I just realized that an implementation of my proposal would busy
loop in the kernel forever if a silly admin tried to offline the last
cpu (we're already using -EBUSY for that case), so
freeze_processes_nofail is a better idea :-)

-

From: Nigel Cunningham
Date: Friday, April 6, 2007 - 3:22 pm

Hi.


If there's only one online cpu, shouldn't it return -EINVAL?

Regards,

Nigel

-

From: Pavel Machek
Date: Saturday, April 14, 2007 - 11:48 am

Not sure if we _can_ do freeze_processes_nofail(). If something is
wrong (process in D state forever because of driver bug?), it looks
better to return error to userspace than looping forever.

You may want to pass higher timeout than 20sec. But if you can't
freeze everything in 1hour, it is unlikely to ever succeed.

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-

From: Gautham R Shenoy
Date: Sunday, April 1, 2007 - 10:39 pm

This patch rips out lock_cpu_hotplug from the kernel.
Good Riddance!! (hopefully :) )

Signed-off-by : Gautham R Shenoy <ego@in.ibm.com>
--
 arch/i386/kernel/cpu/mtrr/main.c             |    6 ------
 arch/i386/kernel/microcode.c                 |    8 --------
 arch/mips/kernel/mips-mt.c                   |    5 -----
 arch/powerpc/platforms/pseries/hotplug-cpu.c |    5 -----
 arch/powerpc/platforms/pseries/rtasd.c       |    4 ----
 include/linux/cpu.h                          |   20 --------------------
 kernel/cpu.c                                 |   17 -----------------
 kernel/rcutorture.c                          |    3 ---
 kernel/stop_machine.c                        |    3 ---
 net/core/flow.c                              |    2 --
 10 files changed, 73 deletions(-)

Index: linux-2.6.21-rc4/include/linux/cpu.h
===================================================================
--- linux-2.6.21-rc4.orig/include/linux/cpu.h
+++ linux-2.6.21-rc4/include/linux/cpu.h
@@ -89,18 +89,6 @@ extern struct sysdev_class cpu_sysdev_cl
 #ifdef CONFIG_HOTPLUG_CPU
 /* Stop CPUs going up and down. */
 
-static inline void cpuhotplug_mutex_lock(struct mutex *cpu_hp_mutex)
-{
-	mutex_lock(cpu_hp_mutex);
-}
-
-static inline void cpuhotplug_mutex_unlock(struct mutex *cpu_hp_mutex)
-{
-	mutex_unlock(cpu_hp_mutex);
-}
-
-extern void lock_cpu_hotplug(void);
-extern void unlock_cpu_hotplug(void);
 #define hotcpu_notifier(fn, pri) {				\
 	static struct notifier_block fn##_nb =			\
 		{ .notifier_call = fn, .priority = pri };	\
@@ -113,14 +101,6 @@ int cpu_down(unsigned int cpu);
 
 #else		/* CONFIG_HOTPLUG_CPU */
 
-static inline void cpuhotplug_mutex_lock(struct mutex *cpu_hp_mutex)
-{ }
-static inline void cpuhotplug_mutex_unlock(struct mutex *cpu_hp_mutex)
-{ }
-
-#define lock_cpu_hotplug()	do { } while (0)
-#define unlock_cpu_hotplug()	do { } while (0)
-#define lock_cpu_hotplug_interruptible() 0
 #define hotcpu_notifier(fn, pri)	do { (void)(fn); } while (0)
 #define ...
From: Gautham R Shenoy
Date: Sunday, April 1, 2007 - 10:40 pm

Currently i386 and x86_64 __cpu_up uses the services of the kevents
workqueue to bring the cpu up. Change this and use kthread workqueue
instead which is single_threaded and won't be frozen during 
CPU_HOTPLUG.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
--
 arch/i386/kernel/smpboot.c   |    3 ++-
 arch/x86_64/kernel/smpboot.c |    5 +++--
 include/linux/kthread.h      |    5 +++--
 kernel/kthread.c             |   16 +++++++++++++---
 4 files changed, 21 insertions(+), 8 deletions(-)

Index: linux-2.6.21-rc5/include/linux/kthread.h
===================================================================
--- linux-2.6.21-rc5.orig/include/linux/kthread.h
+++ linux-2.6.21-rc5/include/linux/kthread.h
@@ -3,7 +3,7 @@
 /* Simple interface for creating and stopping kernel threads without mess. */
 #include <linux/err.h>
 #include <linux/sched.h>
-
+#include <linux/workqueue.h>
 struct task_struct *kthread_create(int (*threadfn)(void *data),
 				   void *data,
 				   const char namefmt[], ...);
@@ -29,5 +29,6 @@ struct task_struct *kthread_create(int (
 void kthread_bind(struct task_struct *k, unsigned int cpu);
 int kthread_stop(struct task_struct *k);
 int kthread_should_stop(void);
-
+extern void kthreadwq_queue_work(struct work_struct *w);
+extern int kthreadwq_up(void);
 #endif /* _LINUX_KTHREAD_H */
Index: linux-2.6.21-rc5/kernel/kthread.c
===================================================================
--- linux-2.6.21-rc5.orig/kernel/kthread.c
+++ linux-2.6.21-rc5/kernel/kthread.c
@@ -112,7 +112,7 @@ static int kthread(void *_create)
 	return 0;
 }
 
-/* We are keventd: create a thread. */
+/* We are keventd: create a thread. Hmm, Are we?? */
 static void keventd_create_kthread(struct work_struct *work)
 {
 	struct kthread_create_info *create =
@@ -132,6 +132,16 @@ static void keventd_create_kthread(struc
 	complete(&create->done);
 }
 
+void kthreadwq_queue_work(struct work_struct ...
From: Oleg Nesterov
Date: Thursday, April 5, 2007 - 5:08 am

Off-topic question: can't kernel/kmod.c use this workqueue too instead
of its own khelper_wq?

Oleg.

-

From: Gautham R Shenoy
Date: Sunday, April 1, 2007 - 10:41 pm

This patch
o Makes all non-singlethreaded workqueues freezeable by default.
o Introduces a new API for creating freeze_exempted workqueues.
o Uses the combination of cancel_delayed_work and cancel_work_sync
  in Slab during DOWN_PREPARE instead of cancel_rearming_delayed work, 
  which tries to flush_workqueue in vain, in a frozen context.

Note:
o The singlethreaded workqueues will not be frozen.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Johannes Berg <johannes@sipsolutions.net>
--
 include/linux/workqueue.h |   12 +++++-------
 kernel/workqueue.c        |   35 ++++++++++++++++++++++++++---------
 mm/slab.c                 |   10 +++-------
 3 files changed, 34 insertions(+), 23 deletions(-)

Index: linux-2.6.21-rc5/include/linux/workqueue.h
===================================================================
--- linux-2.6.21-rc5.orig/include/linux/workqueue.h
+++ linux-2.6.21-rc5/include/linux/workqueue.h
@@ -9,7 +9,6 @@
 #include <linux/linkage.h>
 #include <linux/bitops.h>
 #include <asm/atomic.h>
-
 struct workqueue_struct;
 
 struct work_struct;
@@ -112,12 +111,11 @@ struct execute_work {
 	clear_bit(WORK_STRUCT_PENDING, work_data_bits(work))
 
 
-extern struct workqueue_struct *__create_workqueue(const char *name,
-						    int singlethread,
-						    int freezeable);
-#define create_workqueue(name) __create_workqueue((name), 0, 0)
-#define create_freezeable_workqueue(name) __create_workqueue((name), 0, 1)
-#define create_singlethread_workqueue(name) __create_workqueue((name), 1, 0)
+extern struct workqueue_struct *create_workqueue(const char *name);
+
+extern struct workqueue_struct *create_freeze_exempted_workqueue(const char *name, int freeze_exempt_events);
+
+extern struct workqueue_struct *create_singlethread_workqueue(const char *name);
 
 extern void destroy_workqueue(struct workqueue_struct *wq);
 
Index: ...
From: Oleg Nesterov
Date: Thursday, April 5, 2007 - 4:57 am

EXPORT_SYMBOL() has a cost, and these functions are simple wrappers. How about
static inline?

Oleg.

-

From: Andrew Morton
Date: Thursday, April 5, 2007 - 1:06 pm

On Thu, 5 Apr 2007 15:57:14 +0400

something.
-

From: Gautham R Shenoy
Date: Sunday, April 1, 2007 - 10:42 pm

Clean up workqueue.c from the perspective of freezer-based cpu-hotplug.
This patch
o Removes cpu_populated_map as cpu_online_map is safe to use.
o removes cwq_should_stop and uses kthread_should_stop instead.
o Reintroduces take_over_work from the worker_thread of a downed
  cpu. This means that all non-singlethreaded workqueues *have* to
  be frozen to avoid any races.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
--
 kernel/workqueue.c |  118 ++++++++++++++++++++++-------------------------------
 1 files changed, 50 insertions(+), 68 deletions(-)

Index: linux-2.6.21-rc4/kernel/workqueue.c
===================================================================
--- linux-2.6.21-rc4.orig/kernel/workqueue.c
+++ linux-2.6.21-rc4/kernel/workqueue.c
@@ -47,10 +47,6 @@ struct cpu_workqueue_struct {
 
 	struct workqueue_struct *wq;
 	struct task_struct *thread;
-	int status;
-	#define CWQ_RUNNING	0
-	#define CWQ_SHOULD_STOP	1
-	#define CWQ_STOPPED	2
 
 	int run_depth;		/* Detect run_workqueue() recursion depth */
 } ____cacheline_aligned;
@@ -75,7 +71,6 @@ static LIST_HEAD(workqueues);
 static int singlethread_cpu __read_mostly;
 static cpumask_t cpu_singlethread_map __read_mostly;
 /* optimization, we could use cpu_possible_map */
-static cpumask_t cpu_populated_map __read_mostly;
 
 /* If it's single threaded, it isn't in the list of workqueues. */
 static inline int is_single_threaded(struct workqueue_struct *wq)
@@ -86,7 +81,7 @@ static inline int is_single_threaded(str
 static const cpumask_t *wq_cpu_map(struct workqueue_struct *wq)
 {
 	return is_single_threaded(wq)
-		? &cpu_singlethread_map : &cpu_populated_map;
+		? &cpu_singlethread_map : &cpu_online_map;
 }
 
 static
@@ -270,32 +265,16 @@ static void run_workqueue(struct cpu_wor
 	spin_unlock_irq(&cwq->lock);
 }
 
-/*
- * NOTE: the caller must not touch *cwq if this func returns true
- */
-static int ...
From: Oleg Nesterov
Date: Tuesday, April 3, 2007 - 4:47 am

We can't freeze workqueue if some work_struct continuously re-queues itself.
Perhaps this is not a problem (I don't see a good reason for such a behaviour),

I still think that wait_to_die + bind_cpu is unneeded complication.
Why can't we do the following:

	static int worker_thread(void *__cwq)
	{
		...

		for (;;) {
			try_to_freeze();

			prepare_to_wait(&cwq->more_work, &wait, TASK_INTERRUPTIBLE);
			if (!kthread_should_stop() && list_empty(&cwq->worklist))
				schedule();
			finish_wait(&cwq->more_work, &wait);

			if (kthread_should_stop(cwq))
				break;

			run_workqueue(cwq);
		}

		return 0;
	}



Well, this is a matter of taste, but I don't like this change. Now we
should add kthread_bind/wake_up_process calls to __create_workqueue()

This is wrong. CPU_UP_PREPARE doesn't call init_cpu_workqueue().
Easy to fix, but I personally think is is better to initialize

Deadlockable. Suppose that the freezing is in progress, cwq->thread is not
frozen yet. cleanup_workqueue_thread() calls thaw_process(cwq->thread),


This means that the work_struct on single_threaded wq can't use any of

	__create_workqueue()
	destroy_workqueue()
	flush_workqueue()
	cancel_work_sync()

, they are all racy wrt workqueue_cpu_callback(), and we don't freeze
single_threaded workqueues. This is bad.

Probaly we should:

	- freeze all workqueues, even the single_threaded ones.

	- helper_init() explicitely does __create_workqueue(FE_ALL).
	  this means that we should never use the functions above
	  with this workqueue.

Oleg.

-

From: Srivatsa Vaddagiri
Date: Tuesday, April 3, 2007 - 6:59 am

cleanup_workqueue_thread (in Gautham's patches) does this:

	thaw_process()
	kthread_stop()

There is a chance that after thaw_process() [but before we have posted
the kthread_stop], worker thread can come out of the refrigerator and start
running run_workqueue() - that will simply prolong the subsequent
kthread_stop() and the system freeze time.

We could do what you are suggesting if the thaw_process() part was
integrated into kthread_stop() code [basically thaw_process after




Good catch! Can cleanup_workqueue_thread take some mutex to serialize
with freezer here (say freezer_mutex)?

Or better, since this seems to be a general problem for anyone who wants to do a
kthread_stop, how abt modifying kthread_stop like below:

kthread_stop(p)
{
	int old_exempt_flags;

	task_lock(p);
	old_exempt_flags = p->flags;
	p->flags |= PFE_ALL;	/* Exempt 'p' from being frozen? */
	task_unlock(p);

	kthread_stop_info.k = p;
	thaw_process(p);

	wait_for_completion();

}

Marking 'p' as exempt shouldn't be a problem because freezer would wait
on the thread doing kthread_stop() anyway before declaring system as


The workqueue_mutex() should serialize these with workqueue_cpu_callback() to 

Yes I agree, we should target freezing everybody here. It feels much
safer that way!

The only dependency I have seen is stop_machine() called after processes
are frozen. It needs the services of a workqueue to create kthreads. We
need to atleast exempt that worker thread from being frozen. Hopefully
the list of such non-freezable singlethreaded workqueues will be tiny

Ok you seem to have covered what I went out to say above! 

Thx for your review so far ..

-- 
Regards,
vatsa
-

From: Oleg Nesterov
Date: Tuesday, April 3, 2007 - 8:03 am

I think it would be nice to do. I believe we can cleanup ksoftirqd()
and migration_thread() as well (kill wait_to_die: loop). Probably it
is better to introduce a new helper for that, kthread_thaw_stop() or

What if is_single_threaded(wq) == true? In that case we should call
flush_cpu_workqueue(cpu) only if cpu == singlethread_cpu, otherwise
this is unneeded and wrong, because per_cpu_ptr(wq->cpu_wq, cpu) was

I agree, we should mark this thread as non-freezable, but we can't modify
p->flags, this is racy. "current" owns its ->flags and it is not atomic.
Note that thaw_process() checks frozen(p) when it clears PF_FROZEN.

Actually, we should do this before destroy_workqueue() calls flush_workqueue().
Otherwise flush_cpu_workqueue() can hang forever in a similar manner.


No, no, workqueue_mutex can't help. Just for example: CPU_UP_PREPARE completes
and drops workqueue_mutex. __create_workqueue(wq) doesn't see the new cpu, it

Good!

Oleg.

-

From: Srivatsa Vaddagiri
Date: Tuesday, April 3, 2007 - 10:18 am

I doubt whether we can kill it in migration_thread, since that is
another thread which is unfrozen for hotplug (stop_machine relies on its



I suspected that we cannot modify p->flags just like that. How abt
moving freezer exemption bits to a separate field, which is protected by

Yep. I guess these are a class of freezer deadlocks very similar to vfork
parent waiting on child case. I get a feeling these should become common
outside of kthread too (A waits on B for something, B gets frozen, which
means A won't freeze causing freezer to fail). Can freezer detect this


Ok ..sure.

-- 
Regards,
vatsa
-

From: Oleg Nesterov
Date: Wednesday, April 4, 2007 - 8:28 am

I changed my mind :) The problem is general, I am starting to believe

Probably yes... In that case it makes sense to move PF_FREEZER_SKIP/PF_FROZEN
to the new field as well.

Perhaps we can ignore this problem for now. Freezer is not 100% reliable
anyway. For example,

	worker_thread:

		for (;;) {
			try_to_freeze();

			prepare_to_wait();
			if (...)
				schedule();
			finish_wait();
		}

This is racy, we can miss freeze_process()->signal_wake_up() if it happens
between try_to_freeze() and prepare_to_wait(). We have to check TIF_FREEZE
before entering schedule() if we want to fix this race.

Should we? I don't know. This will uglify the code, and the probability
of this race is very low.

Oleg.

-

From: Srivatsa Vaddagiri
Date: Wednesday, April 4, 2007 - 10:49 am

yes i agree. Although is some cases like destroy_workqueue, we need to
mark the target thread non-freezable way before we call kthread_stop (as

I wonder if there are some reserved fields in task_struct which we can


Would be nice to fix IMO. Atleast serves to show "how to make your code
freezer friendly".

-- 
Regards,
vatsa
-

From: Oleg Nesterov
Date: Thursday, April 5, 2007 - 5:20 am

This is funny. I "noticed" this race a long ago, when the ->freezeable flag
was introduced. However, looking at 2.6.20 I see that the patch was correct,
and this race was in fact introduced by me in

	[PATCH 1/1] workqueue: don't migrate pending works from the dead CPU
	http://marc.info/?l=linux-kernel&m=117062192709871

I'll send a fix on weekend.

Oleg.

-

From: Srivatsa Vaddagiri
Date: Wednesday, April 11, 2007 - 7:22 pm

I wonder if there is some value in "enforcing" an order in which
processes get frozen i.e freeze A first before B. That may solve the
deadlocks we have been discussing wrt kthread_stop and flush_workqueue
as well.

The idea is similar to how deadlock wrt multiple locks are solved -
where a ordering is enforced. Take Lock A first before Lock B. 

If process A waits on B (like in kthread_stop or flush_workqueue), then if we:

	1. Insert A and B in a list (freeze_me_first_list)
	2. Have freezer scan freeze_me_first_list before the master
	   task-list, so that it:
		2a. "freezes A and waits for A to get frozen" first
		2b. "freezes B and waits for B to get frozen" next

then we would avoid the nastiness of "B getting frozen first and A doesnt
freeze because of that" with lesser code changes?

-- 
Regards,
vatsa
-

From: Gautham R Shenoy
Date: Thursday, April 12, 2007 - 3:01 am

In that case, A should insert the dependency into the
freeze_me_first_list as B is unaware of the dependency yet.

What if by the time A has inserted the dependency B is already frozen?

Can very well happen right?

A:			B:			freezer
-------------------------------------------------------------------------------
						*Check the list. don't find  A/B
						*Mark A freezeable.
						*Mark B freezeable.
			*try_to_freeze();
*insert A, B		.		
 into the		.	
 list.			.
			.		
		  	.
* wait_for_		.
  completion(done);	.			/* Freezer fails at this
  			.			 *  point 
			.			 */
		  	.
		  	.
		  	*complete(done);

*try_to_freeze();

Example: A = a thread doing flush_workqueue.
	 B = worker thread.


Of course, we can always use the freezer_skip around this
wait_for_completion as well as long as the thread A is not marked
PF_NOFREEZE. But with multiple freeze events, it won't be as simple as

Thanks and Regards
gautham.
-- 
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"
-

From: Oleg Nesterov
Date: Thursday, April 12, 2007 - 9:00 am

Perhaps we can add "atomic_t xxx" to task_struct.

	int freezing(struct task_struct *p)
	{
		return test_tsk_thread_flag(p, TIF_FREEZE)
			&& atomic_read(&p->xxx) == 0;
	}

	void xxx_start(struct task_struct *p)
	{
		atomic_inc(p->xxx);
		thaw_process(p);
	}

	xxx_end(struct task_struct *p)
	{
		atomic_dec(p->xxx);
	}

Now,

	xxx_start(p);
	... wait for something which depends on p...
	xxx_end(p);

Of course we need other changes, freeze_process() should check ->xxx, etc.
I am not sure this makes sense.

Oleg.

-

From: Gautham R Shenoy
Date: Friday, April 13, 2007 - 2:46 am

I think this is racy.
Say, if the task which does xxx_start(p) [let's call it task q] is not freezeable, and task p is
already frozen when q  calls xxx_start, then we might be in a situation
where 

- Freezer has declared the whole system to be frozen. Hence the thread
  issuing the 'freeze'(suspend/hotplug) can go ahead and do whatever it wants to.

- Task 'p' which was supposed to be frozen , is now running and
  *surprise* We have a thread running on a cpu which ain't there any more!

I feel we need some kind of a global rwlock. 


DEFINE_RWLOCK(freezer_status_lock);
int  xxx_start(struct task_struct *p)
{
	int ret = 0;
	ret = read_trylock(&freezer_status_lock);
	if(ret) 
	 /* We've succeeded. So lets thaw the chap */
	 thaw_process(p);
	/* If we've failed to acquire trylock, that means freezer doesn't 
	 * depend on us.
	 * So lets simply wait without thawing p
	 */
	
	return ret;

}


void xxx_end(int state)
{
	if(state)
		read_unlock(&freezer_status_lock);
}


int try_to_freeze_tasks()
{
	do {
		/* The regular freezer code */

		if (!todo && !write_trylock(&freezer_status_lock));
			continue;
		/* When the freezer goes back, it will find task 'p'
		 * woken up and hence wait for it to get frozen
		 */
	}while(todo);
}

void try_to_thaw_tasks()
{
	.
	.
	.
	write_unlock(&freezer_status_lock);
}


	int state = xxx_start(p);
	... wait for something which depends on p...
 	xxx_end(state);

However, now that I look at it again, I see that it will fail in our case
where we might need to nest the try_to_freeze_tasks call.

Hmm, we don't have a rwlock variant that allows multiple writers, now do

Thanks and Regards
gautham.
-- 
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"
-

From: Gautham R Shenoy
Date: Sunday, April 1, 2007 - 10:42 pm

This patch makes all the kernel_threads (except the migration thread)
freezeable for cpu hotplug.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>

-- 
 arch/i386/kernel/apm.c              |    2 +-
 drivers/block/loop.c                |    2 +-
 drivers/char/apm-emulation.c        |    6 +++---
 drivers/ieee1394/ieee1394_core.c    |    2 +-
 drivers/md/md.c                     |    2 +-
 drivers/mmc/card/queue.c            |    2 +-
 drivers/mtd/mtd_blkdevs.c           |    2 +-
 drivers/scsi/libsas/sas_scsi_host.c |    2 +-
 drivers/scsi/scsi_error.c           |    2 +-
 drivers/usb/storage/usb.c           |    2 +-
 10 files changed, 12 insertions(+), 12 deletions(-)

Index: linux-2.6.21-rc5/arch/i386/kernel/apm.c
===================================================================
--- linux-2.6.21-rc5.orig/arch/i386/kernel/apm.c
+++ linux-2.6.21-rc5/arch/i386/kernel/apm.c
@@ -1395,7 +1395,7 @@ static void apm_mainloop(void)
 
 	add_wait_queue(&apm_waitqueue, &wait);
 	set_current_state(TASK_INTERRUPTIBLE);
-	freezer_exempt(FE_ALL);
+	freezer_exempt(FE_SUSPEND_KPROBES);
 	for (;;) {
 		try_to_freeze();
 		schedule_timeout(APM_CHECK_TIMEOUT);
Index: linux-2.6.21-rc5/drivers/block/loop.c
===================================================================
--- linux-2.6.21-rc5.orig/drivers/block/loop.c
+++ linux-2.6.21-rc5/drivers/block/loop.c
@@ -587,7 +587,7 @@ static int loop_thread(void *data)
 	 * hence, it mustn't be stopped at all
 	 * because it could be indirectly used during suspension
 	 */
-	freezer_exempt(FE_ALL);
+	freezer_exempt(FE_SUSPEND_KPROBES);
 
 	set_user_nice(current, -20);
 
Index: linux-2.6.21-rc5/drivers/char/apm-emulation.c
===================================================================
--- linux-2.6.21-rc5.orig/drivers/char/apm-emulation.c
+++ linux-2.6.21-rc5/drivers/char/apm-emulation.c
@@ -336,7 +336,7 @@ apm_ioctl(struct inode * inode, struct f
 			 * threads.
 			 */
 			flags = ...
From: Ingo Molnar
Date: Sunday, April 1, 2007 - 11:16 pm

hm, shouldnt the make be frozen immediately?

doesnt the 'please freeze ASAP' flag get propagated to all tasks, 
immediately? After that point any cloning activity should duplicate that 

we could definitely do that - but i think it should be unnecessary: if 
we mark all tasks as PF_FREEZING atomically, that should result in 
_every_ task immediately dropping dead (once they get back from 
TASK_UNINTERRUPTIBLE). No excuses. If there's some longer delay then 
that can only be explained by some new cloned task/thread slipping 
through the net somehow. (i.e. the PF_FREEZING flag not being duplicated 
across fork?)

i'm wondering about how TASK_UNINTERRUPTIBLE tasks are handled by the 
freezer: are they assumed frozen immediately, or do we wait until they 
notice their PF_FREEZING and go into try_to_freeze()? I'd expect 
TASK_UNINTERRUPTIBLE to be the largest source of latency. (and hence be 
the primary source for freezing 'failures')

	Ingo
-

From: Srivatsa Vaddagiri
Date: Monday, April 2, 2007 - 2:28 am

afaics, setting the 'please freeze asap' flag is racy wrt dup_task_struct
(where the child's tsk->thread_info->flags are copied from its parent?). 
Secondly, from what I understand, it takes a 'flag to be set + signal marked
pending' for the child task to be frozen. If that is the case, then
copy_process may not propogae the signal to the child, which could mean
mean that we can be in a catch-up game in freeze_processes, trying to
freeze processes we didnt see in earlier passes.

I think copy_process() can check for something like this:

	write_lock_irq(&tasklist_lock);

	...

	if (freezing(current))
		freeze_process(p);	/* function exported by freezer */

	...

	write_unlock_irq(&tasklist_lock);


-- 
Regards,
vatsa
-

From: Ingo Molnar
Date: Monday, April 2, 2007 - 4:18 am

yeah. (is that safe with tasklist_lock held?)

i'm wondering whether we could do even better than the signal approach. 
I _think_ the best approach would be to only wait for tasks that are _on 
the runqueue_. I.e. any task that has scheduled away with 
TASK_UNINTERRUPTIBLE (and might not be able to process signal events for 
a long time) is still freezable because it scheduled away.

the only freeze-unsafe task is one that is on the runqueue, executing 
some unknown kernel code. But the number of those is typically pretty 
low, even with very large make -j task-counts.

now, the current approach approximates that set of tasks, but not 
completely: in particular TASK_UNINTERRUPTIBLE sleeping threads can 
introduce arbitrary long delays (and hence freezing failures). [in 
addition to any fork-related 'leaks' of freeze-notification]

	Ingo
-

From: Srivatsa Vaddagiri
Date: Monday, April 2, 2007 - 5:42 am

I am slightly uncomfortable with "not waiting for tasks inside the
kernel to get out" part, even if it that is done only for
TASK_UNINTERRUPTIBLE tasks. For ex: consider this:

flush_workqueue() <- One of biggest offenders of lock_cpu_hotplug() to date
	for_each_online_cpu(cpu)
		flush_cpu_workqueue
			TASK_UNINTERRUPTIBLE sleep

If we don't wait for this thread from being frozen "voluntarily" (because it is 
in TASK_UNINTERRUPTIBLE sleep), then flush_workqueue is clearly racy wrt
cpu hotplug.

I would imagine other situations like this are possible where "not waiting
for everyone to /voluntarily/ quiece" can break cpu hotplug. In fact,
the biggest reason why we are moving to freezer based hotplug is the
fact that it quiesces everyone, leading to (hopefully) zero race conditions.

-- 
Regards,
vatsa
-

From: Gautham R Shenoy
Date: Monday, April 2, 2007 - 7:16 am

I quickly coded this up and ran my tests again. Unfortunately, the
results are negative. I printk'd the state of the unfrozen task and 
this is how the serial console output looks like:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Stopping user space processes timed out after 20 seconds (7 tasks
refusing to freeze)
make:TASK_UNINTERRUPTIBLE
make:TASK_UNINTERRUPTIBLE
make:TASK_UNINTERRUPTIBLE
make:TASK_UNINTERRUPTIBLE
make:TASK_UNINTERRUPTIBLE
make:TASK_UNINTERRUPTIBLE
make:TASK_UNINTERRUPTIBLE
Restarting tasks ... done.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

I couldn't see any reduction in the number of
freeze-failures. Thus Ingo right.  These failures are due
to the TASK_UNINTERRUPTIBLE tasks which fail to freeze within the

The other option would be to have another state equivalent to 
TASK_UNINTERRUPTIBLE_NOFREEZE, so that Ingo's solution can be applied
for regular TASK_UNINTERRUPTIBLE tasks. Thus TASK_UNINTERRUPTBLE tasks
from a for_each_online_cpu context should now do a
TASK_UNINTERRUPTIBLE_NOFREEZE instead. 


Thanks and Regards
gautham.
-- 
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"
-

From: Ingo Molnar
Date: Monday, April 2, 2007 - 11:56 am

ok. But the only real problem would be for_each_online_cpu() loops that 
might sleep, correct? I did a quick audit and those seem to be in the 
minority by a factor of 1:10.

So ... to make the audit obviously safe, how about mechanically 
converting 100% of the for_each_online_cpu() loops to something like:

	mask = get_each_online_cpu_mask();
	for_each_cpu_mask(mask) {
		...
	}
	put_each_online_cpu_mask(mask);

where get_each_online_cpu_mask() also does a preempt_disable() 
implicitly, and put_each_online_cpu_mask() does a preempt_enable(). 
(Note that no locking is needed - only preemption-disabling.)

the 10% loops that _can_ schedule would trigger the __might_sleep() 
atomicity test in schedule()), and those would have to be converted a 
bit more cleverly, on a case by case basis. (for example a number of 
them might not even have to sleep on the for_each_online_cpu() loop)

hm?

	Ingo
-

From: Srivatsa Vaddagiri
Date: Tuesday, April 3, 2007 - 5:56 am

I would shy from saying that "that's the only problem". It could also be

	for_each_cpu_mask(some_mask)
		do_something_which_sleeps();

where some_mask is suppsoed to represent a subset of online cpus.

For ex: policy->cpus is a mask maintained by cpufreq, which it iterates thr' 
whenever changing cpu frequency. The cpus in that mask are supposed to be 
a subset of online cpus (with CPU_ONLINE/DEAD handlers in cpufreq.c adjusting 
the mask upon hotplug).

We could adopt a similar trick (get/put_each_cpu_mask) that you describe

The real question is how do we convert over those sleeping
for_each_cpu_mask users (for ex: flush_workqueue) such that they don't
block freezer/hotplug for long periods?

One option is to probably rewrite them to understand that
online[or a derived]_map could have changed everytime they come out of a 
(un)interruptible sleep and deal with arising races appropriately. That
would mean a bit of maintenance headache unfortunately (I was hoping
freezer will lead to zero maintenance headache :)

Besides, how problematic is this in practise (that threads sleep for
extended durations in TASK_INTERRUPTIBLE state breaking
freezer/suspend/hotplug)?

Should we ignore this for the timebeing and take up later as and when
users report problems?

-- 
Regards,
vatsa
-

From: Gautham R Shenoy
Date: Tuesday, April 3, 2007 - 7:15 am

In my case, the problem of freezer failing was due to the vfork freezer race
in do_fork. The parent task was waiting_for_completion() while the child was
frozen.

Rafael has already sent out the fix for that, but for some reason I
don't see it in the -mm.

With that fix, freezer and hence hotplug succeeds even when I am running

Thanks and Regards
gautham.

-- 
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"
-

From: Srivatsa Vaddagiri
Date: Tuesday, April 3, 2007 - 8:15 pm

Good to know that!

So Ingo/Rafael, should we ignore the problem of "TASK_UNINTERRUPTIBLE
sleepers can break freezer" for the timebeing? Mainly because its not
trivial to solve and we need to tackle it case by case basis as and when
users report specific problems.

-- 
Regards,
vatsa
-

From: Ingo Molnar
Date: Wednesday, April 4, 2007 - 3:04 am

yeah, i think you are right - and the TASK_UNINTERRUPTIBLE thing is not 
only quite complex, it also breaks the symmetry of freezer use 
(sw-suspend obviously cannot freeze uninterruptible tasks). We should 
watch whether the current latency of freezing is good enough in 
practice.

	Ingo
-

From: Gautham R Shenoy
Date: Wednesday, April 4, 2007 - 3:41 am

Ok. Do you have any specific tests in mind which I can run and post the
numbers? 

As of now, I've been stressing the system with (kernbench + ondemand governor)

Thanks and Regards
gautham.
-- 
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"
-

From: Ingo Molnar
Date: Wednesday, April 4, 2007 - 4:49 am

i suspect a fork-intensive application like kernbench should be close to 
the worst-case already. A more IO-intensive workload would maximize the 
uninterruptible-sleep latencies perhaps?

	Ingo
-

From: Gautham R Shenoy
Date: Wednesday, April 4, 2007 - 5:24 am

Ok. I hope the NFS read/write in a tight loop should do the trick.

Thanks and Regards
gautham.
-- 
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"
-

From: Gautham R Shenoy
Date: Monday, April 2, 2007 - 4:19 am

I will try again  Vatsa's suggestion of having a 

if (freezing(current))
	freeze_process(p);

in copy processes() and check if we can do away with the fork race. 
the TASK_UNINTERRUPTIBLE state for more than the timeout period. 

The kernel threads have to call try_to_freeze() explicitly and for the
userspace tasks, try_to_freeze() is called in get_signal_to_deliver().
The system is considered frozen only when *all* the freezeable tasks
call try_to_freeze() one way or the other. This is unlikely in case of
a TASK_UNINTERRUPTIBLE task.

Question is can we have some task in TASK_UNINTERRUPTIBLE state for such

Thanks and Regards
gautham.
-- 
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"
-

From: Ingo Molnar
Date: Monday, April 2, 2007 - 4:27 am

yes, easily so - just have a really long disk queue. Or really heavy 
mutex contention.

i really think we should add a freezing hook to schedule too (no need to 
change anything else - just add a PF_FREEZE check into the schedule() 
function) - and add a wakeup method that moves TASK_UNINTERRUPTIBLE 
tasks to the runqueue but does not touch their task->state.

( the copy_process() handling is still needed, so that no new tasks
  without PF_FREEZE get created that could slip out of control. )

Such a wakeup will cause them to execute again but without disturbing 
their task->state value, at which point a second hook in schedule() 
could catch and freeze them. (and can restart the sleep afterwards, if 
the task is still TASK_UNINTERRUPTIBLE)

i.e. two easy hooks in schedule() plus a try_to_wake_up() variant that 
does not touch p->state. In fact the second hook could be used instead 
of the first one so one might be enough. (I can code up the scheduler 
bits for you if that would be helpful.)

	Ingo
-

From: Rafael J. Wysocki
Date: Monday, April 2, 2007 - 3:12 pm

Yes, something like this would be necessary to handle TASK_UNINTERRUPTIBLE
tasks.  _Still_, currently we just fail the freezing if there are
uninterruptible tasks, because they can hold locks and may potentially
deadlock with device drivers' .suspend() or .resume() routines (please remember
that the freezer is used for suspending in the first place :-)).

I think we can freeze uninterruptible tasks for the CPU hotplug, but we should

We can catch them while we're freezing kernel threads (provided that the kernel
threads themselves aren't forking like mad, but this doesn't seem to happen in
practice).
 
Greetings,
Rafael
-

From: Pavel Machek
Date: Monday, April 2, 2007 - 6:22 am

They have to wait. We need all tasks frozen _with no locks held_.

							Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-

From: Gautham R Shenoy
Date: Tuesday, April 3, 2007 - 7:01 am

Ok, are some luck. I panic()ed on freezer fail and checked
the stacktrace of the unfrozen tasks. The stacktrace of each one looks
like:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
PID: 7697   TASK: cc354a70  CPU: 7   COMMAND: "make"
#0 [cc37fe50] schedule at c0431752
#1 [cc37fec4] wait_for_completion at c04318d0
#2 [cc37ff24] do_fork at c01249a6
#3 [cc37ff94] sys_vfork at c0103c1f
#4 [cc37ffb4] system_call at c0104d8d
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Rafael had sent out a patch to fix the vfork race, which can be found at
http://lkml.org/lkml/2007/3/1/212

However, the following hunk seems to be missing from the latest -mm.

@@ -1393,7 +1394,9 @@ long do_fork(unsigned long clone_flags,
		tracehook_report_clone_complete(clone_flags, nr, p);
		if (clone_flags & CLONE_VFORK) {
+			freezer_do_not_count();
			wait_for_completion(&vfork);
+			freezer_count();
			tracehook_report_vfork_done(p, nr);
		}
		} else {



Rafael / Andrew,
        Any reasons for leaving this hunk out?

I reran my tests with this hunk applied, and it work just fine.

Thanks and Regards
gautham.
-- 
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"
-

From: Gautham R Shenoy
Date: Tuesday, April 3, 2007 - 5:01 am

Ok, we might be in some luck. I panic()ed on freezer fail and checked
the stacktrace of the unfrozen tasks. The stacktrace of each one looks
like:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
PID: 7697   TASK: cc354a70  CPU: 7   COMMAND: "make"
#0 [cc37fe50] schedule at c0431752
#1 [cc37fec4] wait_for_completion at c04318d0
#2 [cc37ff24] do_fork at c01249a6
#3 [cc37ff94] sys_vfork at c0103c1f
#4 [cc37ffb4] system_call at c0104d8d
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Rafael had sent out a patch to fix the vfork race, which can be found at
http://lkml.org/lkml/2007/3/1/212

However, the hunk

@@ -1393,7 +1394,9 @@ long do_fork(unsigned long clone_flags,
		tracehook_report_clone_complete(clone_flags, nr, p);

		if (clone_flags & CLONE_VFORK) {
+			freezer_do_not_count();
			wait_for_completion(&vfork);
+			freezer_count();
			tracehook_report_vfork_done(p, nr);
		}
	} else {

Seems to be missing in the latest -mm's.

Rafael / Andrew, 
	Any reasons for leaving this hunk out?


Thanks and Regards
gautham.
-- 
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"
-

From: Rafael J. Wysocki
Date: Tuesday, April 3, 2007 - 12:34 pm

No, absolutely not.  It's needed.

Moreover, freezer-fix-vfork-problem.patch from the broken-out -rc5-mm3
contains it, so some other patch must have reverted this change.

[looks]

Ah, it's utrace-prep-2.patch .  Andrew?
-

From: Andrew Morton
Date: Tuesday, April 3, 2007 - 1:24 pm

On Tue, 3 Apr 2007 21:34:28 +0200

urgh, I screwed up, sorry.

utrace-prep-2 reverts a bit of the underlying tree so that the utrace
patches (which are against mainline) don't throw a tremendous reject which
has to be fixed each time I pull Roland's tree.  I'm supposed to reapply
that change after the utrace patches but forgot.

--- a/kernel/fork.c~undo-utrace-prep-2
+++ a/kernel/fork.c
@@ -1387,7 +1387,9 @@ long do_fork(unsigned long clone_flags,
 			tracehook_report_clone_complete(clone_flags, nr, p);
 
 		if (clone_flags & CLONE_VFORK) {
+			freezer_do_not_count();
 			wait_for_completion(&vfork);
+			freezer_count();
 			if (likely(is_user))
 				tracehook_report_vfork_done(p, nr);
 		}
_

-

From: Ingo Molnar
Date: Wednesday, April 4, 2007 - 3:06 am

btw., how about pushing utrace to v2.6.22, is anything big holding that 
up? (other than Roland's modesty, which we should gently but firmly 
ignore in this case =B-)

	Ingo
-

From: Christoph Hellwig
Date: Wednesday, April 4, 2007 - 3:36 am

a) lots of architecture support
b) lots of overengineered and obsfucated code
c) missing public information

I've also not gotten any feedback to my ages old review and the issues
don't seem to be addressed either.



-

From: Andrew Morton