In short: "while (flush_cpu_workqueue(cwq))" can livelock because a re-niced
caller can add a new barrier before the lower-priority cwq->thread clears
->current_work.
Yes, and the fix is very simple. In fact cleanup_workqueue_thread() doesn't
need the "while" loop. I did it that way to avoid a subtle dependency with
run_workqueue(), and because I failed to invent a good comment which explains
why it is safe to do flush_cpu_workqueue() once.
In short, if we have another barrier when flush_cpu_workqueue() returns,
cwq->thread must be "inside" run_workqueue() which can't return until
cwq->worklist becomes empty. This means we can do kthread_stop() right now,
kthread_should_stop() won't be checked until run_workqueue() returns.
(Another option is to clear cwq->current_work in wq_barrier_func(), before
complete(). This is possible because nobody can "see" this barrier except
flush_cpu_workqueue()).
I'll re-check my thinking and send a patch tomorrow.
Thanks a lot, Michal.
Oleg.
-