Changes in floppy.c from 2.6.17 and 2.6.18 have broken an application I have. I have tracked
it down to a single line of code. When the following patch is applied to the version in 2.6.18
my application works.
--- linux-2.6.18/drivers/block/floppy.c 2006-09-19 23:42:06.000000000 -0400
+++ linux-2.6.18-crt/drivers/block/floppy.c 2007-05-29 09:12:20.000000000 -0400
@@ -893,7 +893,6 @@
set_current_state(TASK_RUNNING);
remove_wait_queue(&fdc_wait, &wait);
- flush_scheduled_work();
}
command_status = FD_COMMAND_NONE;
I don't claim to understand the changes from 2.6.17 to 2.6.18 except for the devfs removal.
All I can say is this one line of code kills the application. I have tried to write a short pgm
that shows my problem but everything else I write seems to work. The application only runs
on SMP machines and uses process and irq affinities with real-time scheduling. When I turn
off process and irq affinities the application runs.
I have tried kernels up through 2.6.21.1 with the same results. All kernels from 2.6.18 up
require that I remove this one line of code or my application does not work?
Regards
Mark
-
Interesting. I'd expect that the calling process is spinning, with realtime policy and is expecting some other process to do something (ie: run a workqueue). If you keep the process and irq affinities, and disable the realtime policy does that also prevent the problem? It would be interesting it you could capture a few task traces while it is stuck: echo 1 > /proc/sys/kernel/sysrq then do ALT-SYSRQ-P a bunch of times and ALT-SYSRQ-T, see if you can work out where the CPU is stuck. ALso, 2.6.22-rc3 might have accidentally fixed this. -
I've attached the syslog output as a result of doing the above. I can't really make any kind of No. Same thing there. The traces attached are using 2.6.22-rc3. Basically the main RT-process (which is a CPU bound process on processor-2) signals a thread to do some I/O. That RT-thread (running on the other processor) does a simple ioctl(Q->DevSpec1, FDSETPRM, &medprm) and there is no return from the call. That thread is hung. Thanks Mark
Could you show the full output? There are no events/* or process doing ioctl()
If the main RT-process monopolizes processor-2, flush_workqueue() (or cancel_work_sync())
What happens if you kill the main RT-process?
Could you try the patch below? Just to see if it makes any difference.
Oleg.
(against 2.6.22-rcX)
--- OLD/drivers/block/floppy.c~ 2007-04-03 13:04:58.000000000 +0400
+++ OLD/drivers/block/floppy.c 2007-05-31 20:50:18.000000000 +0400
@@ -862,6 +862,8 @@ static void set_fdc(int drive)
FDCS->reset = 1;
}
+static DECLARE_WORK(floppy_work, NULL);
+
/* locks the driver */
static int _lock_fdc(int drive, int interruptible, int line)
{
@@ -893,7 +895,7 @@ static int _lock_fdc(int drive, int inte
set_current_state(TASK_RUNNING);
remove_wait_queue(&fdc_wait, &wait);
- flush_scheduled_work();
+ cancel_work_sync(&floppy_work);
}
command_status = FD_COMMAND_NONE;
@@ -992,8 +994,6 @@ static void empty(void)
{
}
-static DECLARE_WORK(floppy_work, NULL);
-
static void schedule_bh(void (*handler) (void))
{
PREPARE_WORK(&floppy_work, (work_func_t)handler);
-
The patch does make it work. Would you like for me to try again to get a trace with something meaningful in it? Regards Mark -
When I kill the main process all its threads also go away. Including the floppy thread. Nothing notable happens with this kernel. On previous (2.6.18) I would get a dump The patch does make it work. Regards Mark -
Aha, I missed the word "thread", this is the single process. Still, this means that flush_workqueue() completes when other sub-threads go away, otherwise the thread doing ioctl() couldn't exit. Thank you very much. So, the main question is: is it possible that one of RT processes/threads pins itself I do not understand floppy.c, absolutely, so I am not sure this patch is correct. Even if correct, this patch doesn't solve this problem (if we really understand what's going on). cancel_work_sync() may still hang if floppy_work->func() runs on the starved CPU. This is unlikely, but possible. Thanks! Oleg. -
The main process is pinned to a processor(2) with all _non-kernel_ processes/threads forced over to processor 1. Any already affinitized processes or kernel threads are left as is. Only user land stuff is moved. The main process is for sure _not_ relinquishing it's processor(2) intentionally. All the I/O threads, floppy included, are running on the other processor(1). During this failure only 1 or 2 of the I/O threads are actually doing anything. I assume that what ever is going on in the kernel/floppy driver on behalf of the floppy thread is being done on processor 1? Today, 2.6.18 is doing the same as 2.6.22-rc3. I hate it when that happens. Maybe it was -
I hope Ingo will correct me if I am wrong, This means that a non-rt kernel thread bound to CPU 2 can't run. In particular, events/2. This means that the problem is not directly connected to floppy.c, any flush_scheduled_work() (or schedule_on_each_cpu()) can't succeed. You can change irq/X/smp_affinity, but smp_apic_timer_interrupt() still can queue work_struct on CPU 2 (for example, mm/slab.c uses per-cpu reap_work). Since events/2 is blocked by the main RT thread, such a work_struct can't be Yes, but see above. flush_scheduled_work() needs a cooperation from events/2 which is bound to CPU 2. If you changed irq/X/smp_affinity, the patch I sent should help, because floppy_work can't be scheduled on CPU 2, but still I don't think it is right to run 100% cpu-bound RT-process. Oleg. -
Well, I have multiple I/O threads for many other types of I/O that don't have any problems. And until these changes in 2.6.18 I didn't have any problems with the floppy. I have multiple ethernet threads, multiple scsi (SG) device threads, multiple rs232 device threads, parallel port, I don't mean to sound stupid but why would a process running on processor 1 require anything Again I don't understand why flush_scheduled_work() running on behalf of a process affinitized to processor-1 requires cooperation from events/2 (affinitized to processor-2) when there is an events/1 already affinitized to processor 1? Again though, Forgive my The patch you sent helps with no other intervention from me. But then so does the patch mentioned in the original post. I am able to bang on the floppies pretty hard doing all kinds of things with no trouble using either. As far as a 100% cpu-bound RT-process goes, well I say I don't intentionally relinquish the processor but it's not really 100% cpu-bound. Running xosview I see some spare time. Thanks Mark -
flush_workqueue() blocks until any scheduled work on any CPU has run to
completion. If we have some work_struct pending on CPU 2, it can be completed
This patch replaces flush_scheduled_work() with cancel_work_sync(). The latter
can still hang if the floppy interrupt happens on CPU 2 and does schedule_bh(),
events/2 starts running floppy_work->func() and preempted by RT-thread. This is
>
> The application only runs on SMP machines and uses process and irq affinities
Well, I don't know what is xosview, sorry :) so I don't understand what does
"spare time" precisely mean. If this thread does some i/o or something which
can sleep, then...
OK. In that case we may have another reason for deadlock, say a pending
floppy_work needs open_lock or test_and_set_bit(0, &fdc_busy).
Could you apply the trivial patch below, and change the i/o thread to do
prctl(1234); // hangs ???
printf(something);
ioctl(Q->DevSpec1, FDSETPRM, &medprm); // this hangs
to see if prctl() hangs or not? This way we can narrow the problem.
(of course, you can just kill the above ioctl() if this is possible).
Thanks!
Oleg.
--- OLD/kernel/sys.c~ 2007-04-03 13:05:02.000000000 +0400
+++ OLD/kernel/sys.c 2007-06-01 18:56:22.000000000 +0400
@@ -2147,6 +2147,11 @@ asmlinkage long sys_prctl(int option, un
{
long error;
+ if (option == 1234) {
+ flush_scheduled_work();
+ return 0;
+ }
+
error = security_task_prctl(option, arg2, arg3, arg4, arg5);
if (error)
return error;
-
All the irq affinities but one are set to processor-1. The only irq not is from an rtom (Real-Time Option Module). It's irq is handled by I don't understand the _real_ meaning of spare time either but xosview is just a little graphical window showing information obtained from the Ok the prctl never returned. I just replaced the ioctl with it and added a printf before and after. I only get the one before. The thread is hung at this point just as if I'd done the ioctl? Regards Mark -
Thanks. So we can rule out floppy.c. flush_scheduled_work/flush_workqueue is broken by this RT application. Imho, this is not the kernel problem. Now I am very sure that the initial suspect was correct: cpu starvation. I can cook a debug patch to be 100% sure tomorrow, which kernel version is most convenient to you? Oleg. -
Please try this patch, it should dump some debug info when flush_workqueue()
hangs (after 30 seconds). You can use it with or without the previous patch
I sent. Please wait for a couple of minutes to collect more info.
Oleg.
--- OLD/kernel/sched.c~TST 2007-04-05 12:20:35.000000000 +0400
+++ OLD/kernel/sched.c 2007-06-02 15:41:53.000000000 +0400
@@ -4177,6 +4177,20 @@ struct task_struct *idle_task(int cpu)
return cpu_rq(cpu)->idle;
}
+struct task_struct *get_cpu_curr(int cpu)
+{
+ unsigned long flags;
+ struct task_struct *curr;
+ struct rq *rq = cpu_rq(cpu);
+
+ spin_lock_irqsave(&rq->lock, flags);
+ curr = rq->curr;
+ get_task_struct(curr);
+ spin_unlock_irqrestore(&rq->lock, flags);
+
+ return curr;
+}
+
/**
* find_process_by_pid - find a process with a matching PID value.
* @pid: the pid in question.
--- OLD/kernel/workqueue.c~TST 2007-06-02 13:34:57.000000000 +0400
+++ OLD/kernel/workqueue.c 2007-06-02 16:18:02.000000000 +0400
@@ -49,6 +49,7 @@ struct cpu_workqueue_struct {
struct task_struct *thread;
int run_depth; /* Detect run_workqueue() recursion depth */
+ int jobs;
} ____cacheline_aligned;
/*
@@ -253,6 +254,7 @@ static void run_workqueue(struct cpu_wor
cwq->current_work = work;
list_del_init(cwq->worklist.next);
+ cwq->jobs++;
spin_unlock_irq(&cwq->lock);
BUG_ON(get_wq_data(work) != cwq);
@@ -328,7 +330,48 @@ static void insert_wq_barrier(struct cpu
insert_work(cwq, &barr->work, tail);
}
-static int flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
+extern struct task_struct *get_cpu_curr(int cpu);
+
+static void flush_wait(struct cpu_workqueue_struct *cwq, int cpu, struct completion *done)
+{
+ struct task_struct *curr;
+ struct work_struct *work;
+ int old_pid, jobs;
+
+ if (is_single_threaded(cwq->wq))
+ cpu = raw_smp_processor_id();
+
+again:
+ work = cwq->current_work;
+ jobs = cwq->jobs;
+
+ curr = get_cpu_curr(cpu);
+ old_pid = curr->pid;
+ put_task_struct(curr);
+
+ if ...Jun 2 16:36:11 harley kernel: ERR!! events/1 flush hang: c201dbc0 c201dbc0 10012 10012 Jun 2 16:36:11 harley kernel: CURR: 7974 7974 vrsx 93 26 Jun 2 16:36:11 harley kernel: wq_barrier_func+0x0/0x8 Jun 2 16:36:11 harley kernel: vmstat_update+0x0/0x24 Jun 2 16:36:11 harley kernel: ---- Jun 2 16:36:11 harley kernel: cache_reap+0x0/0xf4 Jun 2 16:36:41 harley kernel: ERR!! events/1 flush hang: c201dbc0 c201dbc0 10012 10012 Jun 2 16:36:41 harley kernel: CURR: 7974 7974 vrsx 93 26 Jun 2 16:36:41 harley kernel: wq_barrier_func+0x0/0x8 Jun 2 16:36:41 harley kernel: vmstat_update+0x0/0x24 Jun 2 16:36:41 harley kernel: ---- Jun 2 16:36:41 harley kernel: cache_reap+0x0/0xf4 Jun 2 16:37:11 harley kernel: ERR!! events/1 flush hang: c201dbc0 c201dbc0 10012 10012 Jun 2 16:37:11 harley kernel: CURR: 7974 7974 vrsx 93 26 Jun 2 16:37:11 harley kernel: wq_barrier_func+0x0/0x8 Jun 2 16:37:11 harley kernel: vmstat_update+0x0/0x24 Jun 2 16:37:11 harley kernel: ---- Jun 2 16:37:11 harley kernel: cache_reap+0x0/0xf4 Jun 2 16:37:41 harley kernel: ERR!! events/1 flush hang: c201dbc0 c201dbc0 10012 10012 Jun 2 16:37:41 harley kernel: CURR: 7974 7974 vrsx 93 26 Jun 2 16:37:41 harley kernel: wq_barrier_func+0x0/0x8 Jun 2 16:37:41 harley kernel: vmstat_update+0x0/0x24 Jun 2 16:37:41 harley kernel: ---- Jun 2 16:37:41 harley kernel: cache_reap+0x0/0xf4 Jun 2 16:37:51 harley kernel: RTOM: In int handler for 12 usec. Jun 2 16:38:11 harley kernel: ERR!! events/1 flush hang: c201dbc0 c201dbc0 10012 10012 Jun 2 16:38:11 harley kernel: CURR: 7974 7974 vrsx 93 26 Jun 2 16:38:11 harley kernel: wq_barrier_func+0x0/0x8 Jun 2 16:38:11 harley kernel: vmstat_update+0x0/0x24 Jun 2 16:38:11 harley kernel: ---- Jun 2 16:38:11 harley kernel: cache_reap+0x0/0xf4 Jun 2 16:38:41 harley kernel: ERR!! events/1 flush hang: c201dbc0 c201dbc0 10012 10012 Jun 2 16:38:41 harley kernel: CURR: 7974 7974 vrsx 93 ...
As expected. Note that ->nivcsw/->nvcsw doesn't change. There is no "spare time" on CPU 1, "vrsx" monopolizes CPU. events/1->cache_reap() was preempted by vrsx, it had no chance to run since then. Note that jobs == 7974 doesn't change too. I forgot to print cwq->thread->state, but it should be TASK_RUNNING. It would not be possible to kill vrsx if cache_reap() stalled. I don't think this is a kernel problem, vrsx breaks flush_workqueue(). Ingo can answer authoritatively, but I think SCHED_RR/SCHED_FIFO were not designed to be 100% cpu-bound. That said, I think it makes sense to get rid of flush_scheduled_work() in floppy.c. Thanks! Oleg. -
Oleg, thanks for your time in diagnosing this. As far as a 100% CPU bound task being a valid thing to do, it has been done for many years on SMP machines. Any kernel limitation on this surely must be considered a bug? Thanks again Regards Mark -
Could someone authoritatively comment on this? Is a SCHED_RR/SCHED_FIFO 100% Cpu bound process supported in an SMP env on Linux? (vanilla or -rt) Thanks and Regards Mark -
It will kill the kernel, sorry. The only way in which we can fix that is to allow kernel threads to preempt rt-priority userspace threads. But if we were to do that (to benefit the few) it would cause _all_ people's rt-prio processes to experience glitches due to kernel activity, which we believe to be worse. So we're between a rock and a hard place here. If we really did want to solve this then I guess the kernel would need some new code to detect a 100%-busy rt-prio process and to then start premitting preemption of it for kernel thread activity. That detector would need to be smart enough to detect a number of 100%-busy rt-prio processes which are yielding to each other, and one rt-prio process which keeps forking others, etc. It might get tricky. -
The usual alternative is to manually chrt the relevant kernel threads to RT priority and adjust the priority scheme of their processes appropriately. -- Mathematics is the supreme nostalgia of our time. -
Could not flush_scheduled_work() just follow the affinity mask of the task that caused the call to begin with. If calling task had a cpu-mask of 3 then flush_scheduled_work() would do the events/0 and events/1 thing and if the calling task had an affinity mask of 1 then only events/0 would be done? In other words changing what Oleg says above just slightly: flush_workqueue() blocks until any scheduled work on any CPU in the calling tasks affinity mask has run to completion? Thanks Mark -
The kernel's internal event API doesn't track any of this stuff and it's not clear we'd want it to. It'd be a bit simpler perhaps to simply allow SIGSTOPing events/0. This might even work today from userspace. In general, it's considered a mistake to mark CPU hogs as RT precisely because they present a starvation risk to everything else in the system, not just kernel threads. We could add kernel infrastructure to make events survive this sort of thing, but that will very likely just expose another kernel or userspace livelock. -- Mathematics is the supreme nostalgia of our time. -
In general maybe, but we are not really talking in general terms here. For everything other than kernel stuff, I (userland) would assume responsibility. That is why I force all other userland tasks that I want able to have 7 RT-HOGS running on it while all other proceses are on the 8th. Yet in Linux that I can't have even one because it breaks the kernel. I'm sorry, this does not sound right to me. To me and people in my world, this is clearly a kernel deficiency. Regards Mark -
Sorry for delay, No, we can't do this, this makes flush_workqueue() meaningless. Even if we could, this can't help. Suppose that a kernel thread takes some global lock (for example, in our case cache_reap() takes cache_chain_mutex) and then it is preempted by RT task which doesn't relinquish CPU. So this problem is "wider", flush_workqueue() was just a random victim. Oleg. -
