> On 2008.05.02 17:05:41 +0200, Petr Tesarik wrote:
> > Hello,
> >
> > there seems to be a possible livelock in the CPU timers (actually, we've
> > experienced it once already). The analysis lead to the conclusion that
> > fixing this properly will be a bit harder.
> >
> > Problem Statement
> >
> > 1. Any process can set ITIMER_PROF as short as 1 timer tick
> > 2. If the process is CPU-bound (e.g. a number-crunching application),
> > the timer will expire with every timer tick
> > 3. If the process has N threads and each thread runs on its own
> > CPU, the timer will expire N times per timer tick
> >
> > Now, this can become quite interesting on a larger system. For instance,
> > with 128 CPUs and a (conservative) setting of HZ 300, any process can
> > effectively ask to be sent a SIGPROF every 26 usec (1/300/128 s). Pretty
> > fast, but it can get worse if you consider the pitfalls of a NUMA
> > architecture.
> >
> > Whenever the timer expires, a new expiration time must be computed for
> > each thread in the thread group. The values are stored in task_struct of
> > each thread, and IIRC those are kept local to the CPU where the thread
> > is running (quite logically). Put in another way, walking all threads
> > means touching remote memory except for the few task_struct which are
> > local to the recomputing CPU. In the 128-CPU example, if each node has 2
> > CPUs (think Altix), only 2 accesses are to the local node and 126
> > accesses are to remote memory. These are already quite expensive, but it
> > gets even worse. The timer interrupt is generated locally for each CPU,
> > so if the process is spread over all of them (see above), all CPUs
> > recompute the timer (hint: write access), thus constantly invalidating
> > local caches of the remote memory.
> >
> > And those computations cannot run in parallel, of course, because
> > everything is serialized on sighand->siglock (see run_posix_cpu_timers).
> > So, each time one CPU finishes walking the threads, there's already a
> > queue of all the other CPUs waiting to do the same again. For this kind
> > of load, those 26 usec (approx. 41600 CPU cycles on a 1.6 GHz CPU) may
> > be well insufficient. Even more so, if somebody else occasionally takes
> > write lock on tasklist_lock. It may so happen that the system spends all
> > CPU cycles re-computing the profiling timers over and over again and
> > never make any progress. Note this is partly caused by the fact that the
> > time needed to handle the profiling timer is itself also counted to the
> > profiling time.
> >
> > This simply does not scale for large systems (and I was not even
> > considering those really large 1024p ones).
> >
> > Ideas
> >
> > Any ideas for improvement? Obviously, storing per-process expiration
> > times in one place instead of dividing the interval by the number of
> > currently running threads would help here. But it would spoil the fast
> > path (when there are no timers at all). Avoiding that multiple threads
> > re-compute the same values might also help (although I've got no idea
> > how to implement this).
> >
> > Any more comments (before I start hacking something you'll NAK)?
> >
> > BTW did POSIX really mean to allow an indefinitely small interval for
> > the SIGPROF?
> >
> > Petr Tesarik
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to
majordomo@vger.kernel.org
> More majordomo info at
http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at
http://www.tux.org/lkml/