Re: floppy.c soft lockup

Previous thread: [PATCH 0/7] [RFC] Memory Compaction v1 by Mel Gorman on Tuesday, May 29, 2007 - 10:36 am. (21 messages)

Next thread: [PATCH 1/4] Make usb-autosuspend timer 1 sec jiffy aligned by Venki Pallipadi on Tuesday, May 29, 2007 - 10:58 am. (3 messages)
From: Mark Hounschell
Date: Tuesday, May 29, 2007 - 10:31 am

Changes in floppy.c from 2.6.17 and 2.6.18 have broken an application I have. I have tracked 
it down to a single line of code. When the following patch is applied to the version in 2.6.18
my application works.

--- linux-2.6.18/drivers/block/floppy.c 2006-09-19 23:42:06.000000000 -0400
+++ linux-2.6.18-crt/drivers/block/floppy.c     2007-05-29 09:12:20.000000000 -0400
@@ -893,7 +893,6 @@
                set_current_state(TASK_RUNNING);
                remove_wait_queue(&fdc_wait, &wait);

-               flush_scheduled_work();
        }
        command_status = FD_COMMAND_NONE;

I don't claim to understand the changes from 2.6.17 to 2.6.18 except for the devfs removal.
All I can say is this one line of code kills the application. I have tried to write a short pgm
that shows my problem but everything else I write seems to work. The application only runs
on SMP machines and uses process and irq affinities with real-time scheduling. When I turn
off process and irq affinities the application runs. 

I have tried kernels up through 2.6.21.1 with the same results. All kernels from 2.6.18 up
require that I remove this one line of code or my application does not work?

Regards
Mark



-

From: Andrew Morton
Date: Wednesday, May 30, 2007 - 10:46 pm

Interesting.  I'd expect that the calling process is spinning, with realtime
policy and is expecting some other process to do something (ie: run a workqueue).

If you keep the process and irq affinities, and disable the realtime policy
does that also prevent the problem?

It would be interesting it you could capture a few task traces while it is stuck:
echo 1 > /proc/sys/kernel/sysrq then do ALT-SYSRQ-P a bunch of times and ALT-SYSRQ-T,
see if you can work out where the CPU is stuck.

ALso, 2.6.22-rc3 might have accidentally fixed this.
-

From: Mark Hounschell
Date: Thursday, May 31, 2007 - 7:28 am

I've attached the syslog output as a result of doing the above. I can't really make any kind of

No. Same thing there.  The traces attached are using 2.6.22-rc3.

Basically the main RT-process (which is a CPU bound process on processor-2) signals a
thread to do some I/O. That RT-thread (running on the other processor) does a simple 

ioctl(Q->DevSpec1, FDSETPRM, &medprm)

and there is no return from the call. That thread is hung.


Thanks 
Mark


From: Oleg Nesterov
Date: Thursday, May 31, 2007 - 10:06 am

Could you show the full output? There are no events/* or process doing ioctl()

If the main RT-process monopolizes processor-2, flush_workqueue() (or cancel_work_sync())

What happens if you kill the main RT-process?

Could you try the patch below? Just to see if it makes any difference.

Oleg.

(against 2.6.22-rcX)

--- OLD/drivers/block/floppy.c~	2007-04-03 13:04:58.000000000 +0400
+++ OLD/drivers/block/floppy.c	2007-05-31 20:50:18.000000000 +0400
@@ -862,6 +862,8 @@ static void set_fdc(int drive)
 		FDCS->reset = 1;
 }
 
+static DECLARE_WORK(floppy_work, NULL);
+
 /* locks the driver */
 static int _lock_fdc(int drive, int interruptible, int line)
 {
@@ -893,7 +895,7 @@ static int _lock_fdc(int drive, int inte
 		set_current_state(TASK_RUNNING);
 		remove_wait_queue(&fdc_wait, &wait);
 
-		flush_scheduled_work();
+		cancel_work_sync(&floppy_work);
 	}
 	command_status = FD_COMMAND_NONE;
 
@@ -992,8 +994,6 @@ static void empty(void)
 {
 }
 
-static DECLARE_WORK(floppy_work, NULL);
-
 static void schedule_bh(void (*handler) (void))
 {
 	PREPARE_WORK(&floppy_work, (work_func_t)handler);

-

From: Mark Hounschell
Date: Thursday, May 31, 2007 - 11:01 am

The patch does make it work. Would you like for me to try again to get a trace with
something meaningful in it?

Regards
Mark
-

From: Mark Hounschell
Date: Thursday, May 31, 2007 - 11:44 am

When I kill the main process all its threads also go away. Including the floppy thread.
Nothing notable happens with this kernel. On previous (2.6.18) I would get a dump

The patch does make it work.

Regards
Mark
-

From: Oleg Nesterov
Date: Thursday, May 31, 2007 - 12:22 pm

Aha, I missed the word "thread", this is the single process.

Still, this means that flush_workqueue() completes when other sub-threads go away,
otherwise the thread doing ioctl() couldn't exit.

Thank you very much.

So, the main question is: is it possible that one of RT processes/threads pins itself


I do not understand floppy.c, absolutely, so I am not sure this patch is correct.

Even if correct, this patch doesn't solve this problem (if we really understand
what's going on). cancel_work_sync() may still hang if floppy_work->func() runs
on the starved CPU. This is unlikely, but possible.

Thanks!

Oleg.

-

From: Mark Hounschell
Date: Thursday, May 31, 2007 - 1:18 pm

The main process is pinned to a processor(2) with all _non-kernel_  processes/threads forced over to processor 1.
Any already affinitized processes or kernel threads are left as is. Only user land stuff is moved. The main process
is for sure _not_ relinquishing it's processor(2) intentionally. All the I/O threads, floppy included, are running
on the other processor(1). During this failure only 1 or 2 of the I/O threads are actually doing anything.
I assume that what ever is going on in the kernel/floppy driver on behalf of the floppy thread is being done on processor 1? 

Today, 2.6.18 is doing the same as 2.6.22-rc3. I hate it when that happens. Maybe it was

-

From: Mark Hounschell
Date: Friday, June 1, 2007 - 2:51 am

Those syslog dumps must have been a result of something I was doing

Thanks and Regards
Mark
-

From: Oleg Nesterov
Date: Friday, June 1, 2007 - 4:00 am

I hope Ingo will correct me if I am wrong,


This means that a non-rt kernel thread bound to CPU 2 can't run. In particular,
events/2. This means that the problem is not directly connected to floppy.c,
any flush_scheduled_work() (or schedule_on_each_cpu()) can't succeed.

You can change irq/X/smp_affinity, but smp_apic_timer_interrupt() still can
queue work_struct on CPU 2 (for example, mm/slab.c uses per-cpu reap_work).
Since events/2 is blocked by the main RT thread, such a work_struct can't be

Yes, but see above. flush_scheduled_work() needs a cooperation from events/2
which is bound to CPU 2.

If you changed irq/X/smp_affinity, the patch I sent should help, because
floppy_work can't be scheduled on CPU 2, but still I don't think it is right
to run 100% cpu-bound RT-process.

Oleg.

-

From: Mark Hounschell
Date: Friday, June 1, 2007 - 7:10 am

Well, I have multiple I/O threads for many other types of I/O that don't have any problems. 
And until these changes in 2.6.18 I didn't have any problems with the floppy. I have multiple
ethernet threads, multiple scsi (SG) device threads, multiple rs232 device threads, parallel port, 

I don't mean to sound stupid but why would a process running on processor 1 require anything

Again I don't understand why flush_scheduled_work() running on behalf of a process
affinitized to processor-1 requires cooperation from events/2 (affinitized to processor-2)
when there is an events/1 already affinitized to processor 1? Again though, Forgive my 

The patch you sent helps with no other intervention from me. But then so does 
the patch mentioned in the original post.  I am able to bang on the floppies pretty
hard doing all kinds of things with no trouble using either. 

As far as a 100% cpu-bound RT-process goes, well I say I don't intentionally relinquish
the processor but it's not really 100% cpu-bound. Running xosview I see some spare time. 

Thanks
Mark

-

From: Oleg Nesterov
Date: Friday, June 1, 2007 - 8:16 am

flush_workqueue() blocks until any scheduled work on any CPU has run to
completion. If we have some work_struct pending on CPU 2, it can be completed

This patch replaces flush_scheduled_work() with cancel_work_sync(). The latter
can still hang if the floppy interrupt happens on CPU 2 and does schedule_bh(),
events/2 starts running floppy_work->func() and preempted by RT-thread. This is
	>
	> The application only runs on SMP machines and uses process and irq affinities


Well, I don't know what is xosview, sorry :) so I don't understand what does
"spare time" precisely mean. If this thread does some i/o or something which
can sleep, then...

OK. In that case we may have another reason for deadlock, say a pending
floppy_work needs open_lock or test_and_set_bit(0, &fdc_busy).

Could you apply the trivial patch below, and change the i/o thread to do

		prctl(1234);				// hangs ???
		printf(something);
		ioctl(Q->DevSpec1, FDSETPRM, &medprm);	// this hangs

to see if prctl() hangs or not? This way we can narrow the problem.
(of course, you can just kill the above ioctl() if this is possible).

Thanks!

Oleg.

--- OLD/kernel/sys.c~	2007-04-03 13:05:02.000000000 +0400
+++ OLD/kernel/sys.c	2007-06-01 18:56:22.000000000 +0400
@@ -2147,6 +2147,11 @@ asmlinkage long sys_prctl(int option, un
 {
 	long error;
 
+	if (option == 1234) {
+		flush_scheduled_work();
+		return 0;
+	}
+
 	error = security_task_prctl(option, arg2, arg3, arg4, arg5);
 	if (error)
 		return error;

-

From: Mark Hounschell
Date: Friday, June 1, 2007 - 10:11 am

All the irq affinities but one are set to processor-1. The only irq not
is from an rtom (Real-Time Option Module). It's irq is handled by

I don't understand the _real_ meaning of spare time either but xosview
is just a little graphical window showing information obtained from the


Ok the prctl never returned. I just replaced the ioctl with it and added
a printf before and after. I only get the one before. The thread is hung
at this point just as if I'd done the ioctl?

Regards
Mark
-

From: Oleg Nesterov
Date: Friday, June 1, 2007 - 11:36 am

Thanks. So we can rule out floppy.c. flush_scheduled_work/flush_workqueue
is broken by this RT application. Imho, this is not the kernel problem.

Now I am very sure that the initial suspect was correct: cpu starvation.
I can cook a debug patch to be 100% sure tomorrow, which kernel version is
most convenient to you?

Oleg.

-

From: Mark Hounschell
Date: Friday, June 1, 2007 - 12:52 pm

2.6.22-rc3 is fine thanks.

Regards
Mark

-

From: Oleg Nesterov
Date: Saturday, June 2, 2007 - 5:30 am

Please try this patch, it should dump some debug info when flush_workqueue()
hangs (after 30 seconds). You can use it with or without the previous patch
I sent. Please wait for a couple of minutes to collect more info.

Oleg.

--- OLD/kernel/sched.c~TST	2007-04-05 12:20:35.000000000 +0400
+++ OLD/kernel/sched.c	2007-06-02 15:41:53.000000000 +0400
@@ -4177,6 +4177,20 @@ struct task_struct *idle_task(int cpu)
 	return cpu_rq(cpu)->idle;
 }
 
+struct task_struct *get_cpu_curr(int cpu)
+{
+	unsigned long flags;
+	struct task_struct *curr;
+	struct rq *rq = cpu_rq(cpu);
+
+	spin_lock_irqsave(&rq->lock, flags);
+	curr = rq->curr;
+	get_task_struct(curr);
+	spin_unlock_irqrestore(&rq->lock, flags);
+
+	return curr;
+}
+
 /**
  * find_process_by_pid - find a process with a matching PID value.
  * @pid: the pid in question.
--- OLD/kernel/workqueue.c~TST	2007-06-02 13:34:57.000000000 +0400
+++ OLD/kernel/workqueue.c	2007-06-02 16:18:02.000000000 +0400
@@ -49,6 +49,7 @@ struct cpu_workqueue_struct {
 	struct task_struct *thread;
 
 	int run_depth;		/* Detect run_workqueue() recursion depth */
+	int jobs;
 } ____cacheline_aligned;
 
 /*
@@ -253,6 +254,7 @@ static void run_workqueue(struct cpu_wor
 
 		cwq->current_work = work;
 		list_del_init(cwq->worklist.next);
+		cwq->jobs++;
 		spin_unlock_irq(&cwq->lock);
 
 		BUG_ON(get_wq_data(work) != cwq);
@@ -328,7 +330,48 @@ static void insert_wq_barrier(struct cpu
 	insert_work(cwq, &barr->work, tail);
 }
 
-static int flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
+extern struct task_struct *get_cpu_curr(int cpu);
+
+static void flush_wait(struct cpu_workqueue_struct *cwq, int cpu, struct completion *done)
+{
+	struct task_struct *curr;
+	struct work_struct *work;
+	int old_pid, jobs;
+
+	if (is_single_threaded(cwq->wq))
+		cpu = raw_smp_processor_id();
+
+again:
+	work = cwq->current_work;
+	jobs = cwq->jobs;
+
+	curr = get_cpu_curr(cpu);
+	old_pid = curr->pid;
+	put_task_struct(curr);
+
+	if ...
From: Mark Hounschell
Date: Saturday, June 2, 2007 - 1:44 pm

Jun  2 16:36:11 harley kernel: ERR!! events/1 flush hang: c201dbc0
c201dbc0 10012 10012
Jun  2 16:36:11 harley kernel: CURR: 7974 7974 vrsx 93 26
Jun  2 16:36:11 harley kernel:     wq_barrier_func+0x0/0x8
Jun  2 16:36:11 harley kernel:     vmstat_update+0x0/0x24
Jun  2 16:36:11 harley kernel:     ----
Jun  2 16:36:11 harley kernel:     cache_reap+0x0/0xf4
Jun  2 16:36:41 harley kernel: ERR!! events/1 flush hang: c201dbc0
c201dbc0 10012 10012
Jun  2 16:36:41 harley kernel: CURR: 7974 7974 vrsx 93 26
Jun  2 16:36:41 harley kernel:     wq_barrier_func+0x0/0x8
Jun  2 16:36:41 harley kernel:     vmstat_update+0x0/0x24
Jun  2 16:36:41 harley kernel:     ----
Jun  2 16:36:41 harley kernel:     cache_reap+0x0/0xf4
Jun  2 16:37:11 harley kernel: ERR!! events/1 flush hang: c201dbc0
c201dbc0 10012 10012
Jun  2 16:37:11 harley kernel: CURR: 7974 7974 vrsx 93 26
Jun  2 16:37:11 harley kernel:     wq_barrier_func+0x0/0x8
Jun  2 16:37:11 harley kernel:     vmstat_update+0x0/0x24
Jun  2 16:37:11 harley kernel:     ----
Jun  2 16:37:11 harley kernel:     cache_reap+0x0/0xf4
Jun  2 16:37:41 harley kernel: ERR!! events/1 flush hang: c201dbc0
c201dbc0 10012 10012
Jun  2 16:37:41 harley kernel: CURR: 7974 7974 vrsx 93 26
Jun  2 16:37:41 harley kernel:     wq_barrier_func+0x0/0x8
Jun  2 16:37:41 harley kernel:     vmstat_update+0x0/0x24
Jun  2 16:37:41 harley kernel:     ----
Jun  2 16:37:41 harley kernel:     cache_reap+0x0/0xf4
Jun  2 16:37:51 harley kernel: RTOM: In int handler for 12 usec.
Jun  2 16:38:11 harley kernel: ERR!! events/1 flush hang: c201dbc0
c201dbc0 10012 10012
Jun  2 16:38:11 harley kernel: CURR: 7974 7974 vrsx 93 26
Jun  2 16:38:11 harley kernel:     wq_barrier_func+0x0/0x8
Jun  2 16:38:11 harley kernel:     vmstat_update+0x0/0x24
Jun  2 16:38:11 harley kernel:     ----
Jun  2 16:38:11 harley kernel:     cache_reap+0x0/0xf4
Jun  2 16:38:41 harley kernel: ERR!! events/1 flush hang: c201dbc0
c201dbc0 10012 10012
Jun  2 16:38:41 harley kernel: CURR: 7974 7974 vrsx 93 ...
From: Oleg Nesterov
Date: Sunday, June 3, 2007 - 1:14 am

As expected.

Note that ->nivcsw/->nvcsw doesn't change. There is no "spare time"
on CPU 1, "vrsx" monopolizes CPU. events/1->cache_reap() was preempted
by vrsx, it had no chance to run since then. Note that jobs == 7974
doesn't change too. I forgot to print cwq->thread->state, but it should
be TASK_RUNNING. It would not be possible to kill vrsx if cache_reap()
stalled.

I don't think this is a kernel problem, vrsx breaks flush_workqueue().
Ingo can answer authoritatively, but I think SCHED_RR/SCHED_FIFO were
not designed to be 100% cpu-bound.

That said, I think it makes sense to get rid of flush_scheduled_work()
in floppy.c.

Thanks!

Oleg.

-

From: Mark Hounschell
Date: Monday, June 4, 2007 - 7:00 am

Oleg, thanks for your time in diagnosing this. 

As far as a 100% CPU bound task being a valid thing to do, it has been 
done for many years on SMP machines. Any kernel limitation on this 
surely must be considered a bug? 

Thanks again
Regards
Mark

-

From: Mark Hounschell
Date: Wednesday, June 6, 2007 - 6:12 am

Could someone authoritatively comment on this? Is a SCHED_RR/SCHED_FIFO
100% Cpu bound process supported in an SMP env on Linux? (vanilla or -rt)

Thanks and Regards
Mark
-

From: Andrew Morton
Date: Wednesday, June 6, 2007 - 10:28 am

It will kill the kernel, sorry.

The only way in which we can fix that is to allow kernel threads to preempt
rt-priority userspace threads.  But if we were to do that (to benefit the
few) it would cause _all_ people's rt-prio processes to experience glitches
due to kernel activity, which we believe to be worse.

So we're between a rock and a hard place here.

If we really did want to solve this then I guess the kernel would need some
new code to detect a 100%-busy rt-prio process and to then start premitting
preemption of it for kernel thread activity.  That detector would need to
be smart enough to detect a number of 100%-busy rt-prio processes which are
yielding to each other, and one rt-prio process which keeps forking others,
etc.  It might get tricky.
-

From: Matt Mackall
Date: Wednesday, June 6, 2007 - 6:31 pm

The usual alternative is to manually chrt the relevant kernel threads
to RT priority and adjust the priority scheme of their processes appropriately.

-- 
Mathematics is the supreme nostalgia of our time.
-

From: Mark Hounschell
Date: Thursday, June 7, 2007 - 3:18 am

Could not flush_scheduled_work() just follow the affinity mask of the
task that caused the call to begin with. If calling task had a cpu-mask
of 3 then flush_scheduled_work() would do the events/0 and events/1
thing and if the calling task had an affinity mask of 1 then only
events/0 would be done?

In other words changing what Oleg says above just slightly:

flush_workqueue() blocks until any scheduled work on any CPU in the
calling tasks affinity mask has run to completion?

Thanks
Mark

-

From: Matt Mackall
Date: Thursday, June 7, 2007 - 7:25 am

The kernel's internal event API doesn't track any of this stuff and
it's not clear we'd want it to. It'd be a bit simpler perhaps to
simply allow SIGSTOPing events/0. This might even work today from
userspace.

In general, it's considered a mistake to mark CPU hogs as RT precisely
because they present a starvation risk to everything else in the
system, not just kernel threads. We could add kernel infrastructure to
make events survive this sort of thing, but that will very likely just
expose another kernel or userspace livelock.

-- 
Mathematics is the supreme nostalgia of our time.
-

From: Mark Hounschell
Date: Friday, June 8, 2007 - 2:54 am

In general maybe, but we are not really talking in general terms here.
For everything other than kernel stuff, I (userland) would assume
responsibility. That is why I force all other userland tasks that I want
able to have 7 RT-HOGS running on it while all other proceses are on the
8th. Yet in Linux that I can't have even one because it breaks the
kernel. I'm sorry, this does not sound right to me. To me and people in
my world, this is clearly a kernel deficiency.

Regards
Mark
-

From: Oleg Nesterov
Date: Wednesday, June 13, 2007 - 9:17 am

Sorry for delay,


No, we can't do this, this makes flush_workqueue() meaningless.

Even if we could, this can't help. Suppose that a kernel thread takes some
global lock (for example, in our case cache_reap() takes cache_chain_mutex)
and then it is preempted by RT task which doesn't relinquish CPU.

So this problem is "wider", flush_workqueue() was just a random victim.

Oleg.

-

Previous thread: [PATCH 0/7] [RFC] Memory Compaction v1 by Mel Gorman on Tuesday, May 29, 2007 - 10:36 am. (21 messages)

Next thread: [PATCH 1/4] Make usb-autosuspend timer 1 sec jiffy aligned by Venki Pallipadi on Tuesday, May 29, 2007 - 10:58 am. (3 messages)