Although Robin probably had broader experience, I think we have both had
opportunity to examine the workloads and configuration of a reasonable
sample of the active (and historical) large (>=512c) shared memory systems.
Some workloads and configurations are specialized, and perhaps less
stressing that the mixed, volatile loads and array of services most of
these systems are expected to handle, but the specialized loads have been
the exceptions in my experience. That may change as the price/core
continues to go down and pseudo-shared memory systems based on cluster
interconnects become more common and possibly even viable, but don't hold
your breathe.
If those get more cycles to my users, I'll start reading the list religiously!
Early in 2.6.28 might work for us. 2.6.27 would be nice. Yes, we'd like a
distribution vendor(s) to pull it. If we ask nicely, the one which matters
to me (and my users) is quite likely to take it if it has been accepted
early in the next cycle. They've been very good about that sort of thing
(for which I'm very thankful). So while it's extra administrivia, I'm not
the one who has to fill out the forms and write up the justification ;-)
But the opposite question: Does the patch proposed have significant risk or
drawbacks? We know it offers a minor but noticeable performance
improvement for at least some of the small set of systems it effects. Is
it an unreasonable risk for other systems - or is there a known group of
systems it would have an affect on which would not benefit or might even
harm? Would a revision of it be acceptable, and if so, (based on answers
to the prior questions) what criteria should a revision meet, and what time
frame should we target?
It was not a high priority, and I didn't push on it until after the trouble
with proc_pid_readdir was resolved (and the fix floated downstream to me).
Sorry, but it was lost in higher priority work, and not something
nagging at me, as I had already made the change on the systems I build for.
I'd say the breakpoint - where increasing the size of the pid hash starts
having a useful return - is more like 512 or 1024. On NUMA boxes (which I
think is most, if not all of the large processor count systems), running a
list in the bucket (which more often than not will be remote) can be
expensive, so we'd like to be closer to 1 process / bucket.
Only once a day? Easy silly season, for having two major distributions
taking a snapshot on 2.6.27... I can see that getting annoying, and it's
an unfortunate follow on effect of how Linux gets delivered to users who
require commercial support and/or 3rd party application certifications for
whatever reason (which unfortunately includes my users)... Developers and
users both need to push the major distributions to offer something
reasonably current - we're both stuck with this silliness until users can
count on new development being delivered in something a bit shorter than
two years...
Caught in the middle, I ask both sides to push on the distributions at
every opportunity! <push push>.
Is it? I hadn't noticed, but I usually only go for the things users are in
my cubicle complaining about, and I'm way downstream, so if it's not a
problem there, I won't notice until I can get some time on a system to play
with something current (within the next week or two, I hope). I can look
then, if you'd like.
Is there a general problem?
The last time we had trouble with the pid infrastructure, I believe it was
the result of a patch leaking through, which, frankly, was quite poor. I
believe it's deficiencies have been addressed, and it looks like we now
have a respectable implementation which should serve us well for a while.
There certainly is room for major architectural improvements. Your ideas
for moving from a hash to a radix are a good direction to take, and are
something we should work on as processor counts continue to grow. It is
likely that we stand to gain in both raw cycles consumed as well as memory
consumption - but we're not going to see that tomorrow.
I would think reducing process counts is also is a longer term project. I
wouldn't be looking at 2.6.28 for that, but rather 2.6.30 or so. Most
(possibly all) of the worst offenders appear to be using create_workqueue,
which I don't expect will be trivial to change. If someone picked up the
task today, it might be ready for 2.6.29, but we may want more soak time,
as it looks to me like an intrusive change with a high potential for
unexpected consequences.
From where I'm sitting, the current mechanism seems to do reasonably well,
even with very large numbers of processes (hundreds of thousands), provided
that the hash table is large enough to account for increased use. The
immediate barrier to adequate performance on large systems (that is, not
unnecessarily wasting a significant portion of cycles) is the unreasonably
low cap on the size of the hash table: it's an artificial limit, based on
an outdated set of expectations about the sizes of systems. As such, it's
easy to extend the useful life of the current implementation with very
little cost or effort.
A major rework with more efficient resource usage may be a higher priority
for someone looking at higher processor counts with (relatively) tiny
memory sizes. If such people exist, it should not be difficult to take
them into account when sizing the existing pid hash.
That's a short term (tomorrow-ish), very low risk project with immediate
benefit: a small patch with no effect on systems <512c, which grows the pid
hash when it is likely to be beneficial and there is plenty of memory to spare.
I'd really like to see an increased limit to the size of the pid hash in
the near term. If we can reduce process counts, we might revisit the
sizing. Better would be to start work on a more resource efficient
implementation to eliminate it before we have to revisit it. Ideal would
be to move ahead with all three. I don't see any (sensible) reason for any
of these steps to be mutually exclusive.
--
Stephen Champion Silicon Graphics Site Team
schamp@(sgi.com|nas.nasa.gov) NASA Advanced Supercomputing
--