[PATCH] Replace completions with semaphores

Previous thread: Linux 2.6.25-rc9 by Linus Torvalds on Friday, April 11, 2008 - 4:51 pm. (2 messages)

Next thread: get a still sysrq-t call trace, not a moving one by Gary Shi on Friday, April 11, 2008 - 5:43 pm. (1 message)
To: <linux-kernel@...>
Cc: Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>
Date: Friday, April 11, 2008 - 5:00 pm

A long long time ago when dinosaurs roamed the earth and tech
company stock valuations were high (2001), the primitive
of the completion was introduced to save us from a problem
with semaphores. Those interested in the details can examine
http://www.ussg.iu.edu/hypermail/linux/kernel/0107.3/0674.html and
related messages, but everyone else can take my word for it that, with
the generic semaphores I hope will be merged in 2.6.26, semaphores no
longer have the problem that caused us to switch to completions.

So does it make sense to retain the completion as a primitive
in Linux? On the one hand, it clearly denotes how one uses it --
that it's initially 'locked' and becomes 'unlocked' later. On the
other hand, it is Yet Another Thing to be maintained; I had to
add wait_for_completion_killable(), probably there needs to be a
wait_for_completion_killable_timeout() too.

I have a cheesy hack to remove them without causing too much disturbance.
Obviously, I would never suggest this for inclusion -- #define completion
semaphore should be enough to raise anyone's eyebrows -- but it's a
starting point to decide whether or not to get rid of completions or
retain them in some way while basing their implementation on semaphores,
or leave them the hell alone.

I'm a little puzzled at the numbers I get for replacing completions with
semaphores. The individual files change very little in size:

text data bss dec hex filename
1280 0 0 1280 500 kernel/semaphore-before.o
1656 0 0 1656 678 kernel/semaphore.o
36816 8270 376 45462 b196 kernel/sched-before.o
35938 8270 392 44600 ae38 kernel/sched.o
369958 678538 4287224 5335720 516aa8 kernel/built-in-before.o
371154 678538 4287288 5336980 516f94 kernel/built-in.o
7388295 1401040 4712200 13501535 ce045f vmlinux-before
7380903 1401040 4712000 13493943 cde6b7 vmlinux

So despite a net gain of 500 bytes when comparing semaphore.o and sch...

To: Matthew Wilcox <matthew@...>
Cc: <linux-kernel@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>
Date: Saturday, April 12, 2008 - 8:24 am

My take on it is the opposite: kill off semaphores and leave the
completions around.

--

To: Peter Zijlstra <peterz@...>
Cc: Matthew Wilcox <matthew@...>, <linux-kernel@...>, Linus Torvalds <torvalds@...>
Date: Sunday, April 13, 2008 - 3:05 am

exactly. I first misread the subject line because i parsed it as
"replace semaphores with completions" and i thought "cool", then did i
realize that the patch does it the other way around. Converting
completions to semaphores just makes no sense.

Besides all your arguments later in this thread about how problematic
locking primitives semaphores are because they are so vague (which are
all correct), there's also another aspect: completions are faster a bit
in theory, because they know that they will schedule most of the time -
while semaphores assume that they will _not_ schedule. (And that's
exactly because the intent of the developer when using a completion is
crystal clear.)

Ingo
--

To: Ingo Molnar <mingo@...>
Cc: Peter Zijlstra <peterz@...>, <linux-kernel@...>, Linus Torvalds <torvalds@...>
Date: Sunday, April 13, 2008 - 8:52 am

In practice though, the current implementation is slower. Of course,
that's fixable, and I strongly suspect that the current users of
completions simply don't care about speed -- the normal use of
completions is in startup and shutdown paths where a millisecond extra
isn't going to be noticable.

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--

To: Matthew Wilcox <matthew@...>
Cc: Ingo Molnar <mingo@...>, Peter Zijlstra <peterz@...>, <linux-kernel@...>, Linus Torvalds <torvalds@...>
Date: Monday, April 14, 2008 - 12:54 pm

I'd be surprised if it was in that range, milisecond range would
definetely mean that something was completely bused in that area :-).
IOW, I'd be surprised if you can measure much of a difference, even in
microbencmarks.

And performance does matter somewhat, it's not ONLY used in
startup/shutdown scenarios. The block layer uses it for sync requests,
for instance.

--
Jens Axboe

--

To: Matthew Wilcox <matthew@...>
Cc: Peter Zijlstra <peterz@...>, <linux-kernel@...>, Linus Torvalds <torvalds@...>
Date: Monday, April 14, 2008 - 11:41 am

completions and semaphores act in the sub-microsecond range, not in the
milliseconds range.

Ingo
--

To: Ingo Molnar <mingo@...>
Cc: Peter Zijlstra <peterz@...>, <linux-kernel@...>, Linus Torvalds <torvalds@...>
Date: Monday, April 14, 2008 - 1:46 pm

I've clearly misled both you and Jens, for which I apologise. My point
was "even if they were a millisecond slower, nobody would notice",
rather than "I've measured it and they're a millisecond slower" or even
"from eyeballing it, I estimate they're about a millisecond slower".

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--

To: Peter Zijlstra <peterz@...>, Ingo Oeser <ioe-lkml@...>, Daniel Walker <dwalker@...>
Cc: <linux-kernel@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>
Date: Saturday, April 12, 2008 - 1:26 pm

OK, so all three of you have got the wrong end of the stick.

My point is that the struct semaphore and the struct completion are
essentially the same data structure, operated on in essentially the same
way and having essentially the same purpose.

It would look bloody odd to write (code taken from megasas_mgmt_ioctl_fw() in
drivers/scsi/megaraid/megaraid_sas.c):

if (wait_for_completion_interruptible(&instance->ioctl_completion)) {
error = -ERESTARTSYS;
goto out_kfree_ioc;
}
error = megasas_mgmt_fw_ioctl(instance, user_ioc, ioc);
complete(&instance->ioctl_sem);

What I'm trying to get a feeling for is whether people find it similarly
odd to use semaphores where we currently use completions. We *used*
to, but I don't find that a compelling reason.

Arnd contacted me off-list and made the very sensible suggestion of:

struct completion {
struct semaphore sem;
}

That lets us eliminate the duplicate code since all the completion
functions become very thin wrappers around semaphore operations.

I'll note that the semaphore code I hae queued for 2.6.26 is slightly
more efficient than the current implementation of completions because
completions use the generic waitqueue code and thus do an indirect
function call per wakeup. Of course, there's no reason completions
couldn't use the same technique as my semaphore code ... but then they
would be identical to semaphores ;-)

So I'm going to redo my patch using Arnd's idea because it allows us
to eliminate code without changing the API at all. We can continue the
discussion about eliminating one or the other API, but my opinion is that
eliminating either is a mistake. The choice of API indicates how the
author thinks about the code, which is crucial for those reading the code.

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compar...

To: Matthew Wilcox <matthew@...>
Cc: Peter Zijlstra <peterz@...>, Ingo Oeser <ioe-lkml@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>
Date: Saturday, April 12, 2008 - 3:53 pm

> Arnd contacted me off-list and made the very sensible suggestion of:
>
> struct completion {
> struct semaphore sem;
> }
>
> That lets us eliminate the duplicate code since all the completion
> functions become very thin wrappers around semaphore operations.

Just make sure you don't forget the history of completions... As
Linus said long ago (http://lwn.net/2001/0802/a/lt-completions.php3):

In case anybody cares, the race was that Linux semaphores only protect the
accesses _inside_ the semaphore, while the accesses by the semaphores
themselves can "race" in the internal implementation. That helps make an
efficient implementation, but it means that the race was:

cpu #1 cpu #2

DECLARE_MUTEX_LOCKED(sem);
..
down(&sem); up(&sem);
return;
wake_up(&sem.wait) /*BOOM*/
--

To: Roland Dreier <rdreier@...>
Cc: Matthew Wilcox <matthew@...>, Peter Zijlstra <peterz@...>, Ingo Oeser <ioe-lkml@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>
Date: Sunday, April 13, 2008 - 9:55 am

Thanks for bringing this back to attention -- I wasn't aware of the
message you cited.

My opinion about the above race is that this race has nothing to do
with the semaphore concept, but that the race is caused by the way in
which the semaphore object is used. Using any object after it has been
destroyed is asking for trouble.

Bart.
--

To: Bart Van Assche <bart.vanassche@...>
Cc: Roland Dreier <rdreier@...>, Peter Zijlstra <peterz@...>, Ingo Oeser <ioe-lkml@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>
Date: Sunday, April 13, 2008 - 10:22 am

I think you need to re-read more carefully.

The users of the semaphore were doing nothing wrong. They were not
using the object after it was destroyed.

The i386 implementation of the semaphore was calling wake_up() after
setting the counter to allow cpu #0 to proceed. That was faster for the
common case, but had this problem. completions were careful not to do
that, and the semaphore implementation I wrote doesn't do that either.

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--

To: Roland Dreier <rdreier@...>
Cc: Peter Zijlstra <peterz@...>, Ingo Oeser <ioe-lkml@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>
Date: Saturday, April 12, 2008 - 4:47 pm

Yes, that text appears in the URL I provided in the mail that started
this thread ;-)

The semaphore rewrite I did does not have this problem (it's less
efficient than the hand-optimised assembler, but much more maintainable).
You're supposed to be using mutexes if you want efficiency anyway.

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--

To: Matthew Wilcox <matthew@...>
Cc: Roland Dreier <rdreier@...>, Peter Zijlstra <peterz@...>, Ingo Oeser <ioe-lkml@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>, Linus Torvalds <torvalds@...>
Date: Sunday, April 13, 2008 - 3:08 am

but semaphores will be _removed_, _completely_. Rewriting them in
generic C code is just the first step towards that - it consolidates all
the myriads of semaphore implementations that Linux has spread out.

your proposed change to change completions to semaphores is totally
backwards and prolongs an API we want to get rid of. Did you miss this
aspect of the mutex rewrite, of the semaphore-to-mutex,
semaphore-to-completions and semaphore-to-rwsem conversions?

Ingo
--

To: Ingo Molnar <mingo@...>
Cc: Matthew Wilcox <matthew@...>, Roland Dreier <rdreier@...>, Peter Zijlstra <peterz@...>, Ingo Oeser <ioe-lkml@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>, Linus Torvalds <torvalds@...>
Date: Sunday, April 13, 2008 - 10:55 am

I know the semaphore-to-mutex rewrites were essential a.o. to make
further merging of the real-time tree into the mainstream Linux kernel
possible. But the semaphore concept is more powerful than the mutex
concept: semaphores can be used to let one thread wait for a state
change reported by another thread. Should all state-change-waiting be
implemented via wait_event*() functions ?

One of the strengths of the Linux kernel is that the barrier for new
developers is low: in theory anyone familiar with the C programming
language can start writing kernel drivers. Most people still learn
kernel development via the third edition of the "Linux Device Drivers"
book, and with some luck or some help, they come across an overview of
the 2.6 kernel API changes (http://lwn.net/Articles/2.6-kernel-api/).
The LWN book is getting outdated after all the 2.6 kernel API changes,
and the page with 2.6 kernel API changes was last updated six months
ago. Where can a kernel developer find up to date information about
kernel programming ?

Bart.
--

To: Bart Van Assche <bart.vanassche@...>
Cc: Matthew Wilcox <matthew@...>, Roland Dreier <rdreier@...>, Peter Zijlstra <peterz@...>, Ingo Oeser <ioe-lkml@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>
Date: Monday, April 14, 2008 - 1:12 pm

The failure to update the API changes page is just me not managing to
get around to it. I'll do my best to take care of that in the next few
days. Apologies for that.

Updating LDD (which isn't really "the LWN book" though it's hosted here)
will take a little longer. I'd like to find a way to produce an LDD4
with quality at least as good as LDD3, but which doesn't fill the world
with immediately-obsolete bricks of dead trees. Still working on it...

jon
--

To: Jonathan Corbet <corbet@...>
Cc: Bart Van Assche <bart.vanassche@...>, Matthew Wilcox <matthew@...>, Roland Dreier <rdreier@...>, Ingo Oeser <ioe-lkml@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>
Date: Monday, April 14, 2008 - 1:33 pm

Books should only be used to obtain the general picture, any details
will be instantly-obsolete, esp at the pace Linux changes.

Most of the concepts from LDD3 are still valid, many of the details are
dead wrong.

Can't we make LDD4 a high level book, explcitly mentioning how people
should go about obtaining details? Like go ask on #kernelnewbies and the
sorts.

The thing I always tell #kernelnewbies people is to look at a related
driver (of course that kite doesn't always fly). Another good way to
learn stuff is to just read the implementation.

A 'trick' that is often useful is to look in git to see how something
was changed, provided you knew how to do it some time in the past.

--

To: Peter Zijlstra <peterz@...>
Cc: Jonathan Corbet <corbet@...>, Matthew Wilcox <matthew@...>, Roland Dreier <rdreier@...>, Ingo Oeser <ioe-lkml@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>
Date: Monday, April 14, 2008 - 2:38 pm

What I remember from the first time I wrote a network driver is that
reading the chapter in LDD3 about network drivers was the fastest way
to get started. Learning how to write a network driver by reverse
engineering existing drivers would have required much more time. My
opinion is that mentioning all relevant details in a book makes sense.

Bart.
--

To: Ingo Molnar <mingo@...>
Cc: Roland Dreier <rdreier@...>, Peter Zijlstra <peterz@...>, Ingo Oeser <ioe-lkml@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>, Linus Torvalds <torvalds@...>
Date: Sunday, April 13, 2008 - 8:57 am

I think you're dreaming. We have 11 places in the tree that currently
demand to use a counting semaphore (with 'n' plausibly greater than 1).
If we remove the facility, these places are going to invent their own

Yes -- because it wasn't raised at the time for semaphore-to-completions.
It was raised for semaphore-to-mutex, of course. And I agree that we
/want/ to be rid of counting semaphores, but I don't see a way to do it.
I do believe tht we should be converting all the semaphore users to
mutexes that we can (eg scx200_wdt.c), and converting semaphore users to
completions where that makes sense (eg libusual.c). But there's still
the other dozen places to support.

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--

To: Matthew Wilcox <matthew@...>
Cc: Roland Dreier <rdreier@...>, Peter Zijlstra <peterz@...>, Ingo Oeser <ioe-lkml@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>, Linus Torvalds <torvalds@...>
Date: Monday, April 14, 2008 - 11:39 am

which ones exactly are these places that demand the use of a counting
semaphore? I cannot think of a single place where it's the best choice,
let alone one where it's the only choice.

And if there's just 5 isolated places in the kernel that would need a
different locking primitive then i say that it's probably not worth it.
Locking is at the center of every data structure and the clarity and
simplicity of basic data structure operations is paramount IMHO.

Semaphores are i think a relic from the days when people thought that
parallelism can be best achieved via complexity and rafinery of locking
primitives - regardless of how usable and debuggable those APIs end up
to be. Today we can go extremely parallel with very simple and robust

i'm not worried about weird locking anymore - lockdep tends to expose
them rather nicely and the lack of lockdep coverage for open-coded
locking schemes tends to be a barrier as well. The past 2 years have
shown that people want lockdep coverage for just about everything where
it can be applied: workqueues, special locks, etc. (there's even
questions about whether it could be extended to cover user-space locks.)

Ingo
--

To: Ingo Molnar <mingo@...>
Cc: Matthew Wilcox <matthew@...>, Peter Zijlstra <peterz@...>, Ingo Oeser <ioe-lkml@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>, Linus Torvalds <torvalds@...>
Date: Monday, April 14, 2008 - 11:58 am

> which ones exactly are these places that demand the use of a counting
> semaphore? I cannot think of a single place where it's the best choice,
> let alone one where it's the only choice.

Two of the places that use semaphores are drivers/infiniband/hw/mthca
and drivers/net/mlx4 -- in both cases, the device firmware allows up to
"N" outstanding firmware commands to be in flight, and the driver uses a
semaphore to handle issuing firmware commands. That is, down() when we
want to issue a command, and up() when the firmware responds that the
command is complete.

What would you suggest as a better way to code this? This is an honest
question -- there probably is a more elegant way to handle this
situation and I really would like to learn about it.

Also, the argument that removing semaphores makes the kernel as a whole
better does make sense to me; I wouldn't be opposed to basically
open-coding semaphores in terms of wait_event() in the driver or
something like that, but I wouldn't say that such an implementation is
locally more readable or maintainable if we look only at the driver
code.

- R.
--

To: Roland Dreier <rdreier@...>
Cc: Ingo Molnar <mingo@...>, Matthew Wilcox <matthew@...>, Ingo Oeser <ioe-lkml@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>, Linus Torvalds <torvalds@...>
Date: Monday, April 14, 2008 - 12:32 pm

Yeah, I would open code it. But this is indeed a sane usage of the
counting semaphore because there is no priority inversion.

--

To: Peter Zijlstra <peterz@...>
Cc: Roland Dreier <rdreier@...>, Ingo Molnar <mingo@...>, Matthew Wilcox <matthew@...>, Ingo Oeser <ioe-lkml@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>, Linus Torvalds <torvalds@...>
Date: Monday, April 14, 2008 - 1:46 pm

But when you open code that, how is it different from just having
semaphores?

-Andi
--

To: Andi Kleen <andi@...>
Cc: Roland Dreier <rdreier@...>, Ingo Molnar <mingo@...>, Matthew Wilcox <matthew@...>, Ingo Oeser <ioe-lkml@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>, Linus Torvalds <torvalds@...>
Date: Monday, April 14, 2008 - 1:54 pm

Because we can then eventually get rid of semaphores, so those people
cannot mistakenly use them. Its just too easy to create prio inversion
with them around.

--

To: Peter Zijlstra <peterz@...>
Cc: Andi Kleen <andi@...>, Roland Dreier <rdreier@...>, Ingo Molnar <mingo@...>, Matthew Wilcox <matthew@...>, Ingo Oeser <ioe-lkml@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>, Linus Torvalds <torvalds@...>
Date: Monday, April 14, 2008 - 3:16 pm

And it'll be far harder to even find the priority inversion problems in
the open coded alternatives.
--

To: Peter Zijlstra <peterz@...>
Cc: Roland Dreier <rdreier@...>, Ingo Molnar <mingo@...>, Matthew Wilcox <matthew@...>, Ingo Oeser <ioe-lkml@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>, Linus Torvalds <torvalds@...>
Date: Monday, April 14, 2008 - 3:16 pm

But then you end up with lots of likely subtly buggy open coded
implementations. Also not good.

For me it sounds like you just want to rename semaphores to some other
name that people don't use them for normal locking?

-Andi

--

To: Andi Kleen <andi@...>
Cc: Peter Zijlstra <peterz@...>, Roland Dreier <rdreier@...>, Ingo Molnar <mingo@...>, Matthew Wilcox <matthew@...>, Ingo Oeser <ioe-lkml@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>, Linus Torvalds <torvalds@...>
Date: Tuesday, April 15, 2008 - 2:18 am

Would it really be a good idea to give a synchronization concept that
behaves exactly like a semaphore another name than "semaphore" ? The
semaphore concept is well known and is taught in every computer
science course.

Bart.
--

To: Bart Van Assche <bart.vanassche@...>
Cc: Andi Kleen <andi@...>, Roland Dreier <rdreier@...>, Ingo Molnar <mingo@...>, Matthew Wilcox <matthew@...>, Ingo Oeser <ioe-lkml@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>, Linus Torvalds <torvalds@...>
Date: Tuesday, April 15, 2008 - 2:46 am

Are the ramifications wrt priority inversion taught? Is it made clear
that its hard to validate because there is no clear resource owner?

Afaik, non of these subjects are touched upon in the CS-101 courses and
that is exactly the problem. So you can say they are not well know, they
are just widely misunderstood.

And yes, if there are more hand a very few such users it doesn't make
sense to keep them open coded.

--

To: Peter Zijlstra <peterz@...>
Cc: Andi Kleen <andi@...>, Roland Dreier <rdreier@...>, Ingo Molnar <mingo@...>, Matthew Wilcox <matthew@...>, Ingo Oeser <ioe-lkml@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>, Linus Torvalds <torvalds@...>
Date: Tuesday, April 15, 2008 - 3:17 am

Regarding semaphores and priority inversion: I have never recommended
the use of semaphores over mutexes, all I recommended is to keep the
name "semaphore" for something that behaves like a semaphore. There
might be better ways to discourage the use of the semaphore API, e.g.
letting the compiler print a warning every time a semaphore function
is called unless one or another #define has been enabled.

Regarding priority inheritance: does the above mean that you consider
priority inheritance as an optimal solution for realizing real-time
behavior in the kernel ? Are you aware of the fundamental problems
associated with priority inheritance ? These issues are well explained
in Victor Yodaiken's paper "Against priority inheritance". See also
http://www.linuxdevices.com/files/misc/yodaiken-july02.pdf .

Bart.
--

To: Bart Van Assche <bart.vanassche@...>
Cc: Andi Kleen <andi@...>, Roland Dreier <rdreier@...>, Ingo Molnar <mingo@...>, Matthew Wilcox <matthew@...>, Ingo Oeser <ioe-lkml@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>, Linus Torvalds <torvalds@...>
Date: Tuesday, April 15, 2008 - 4:44 am

That sounds horrible; I really prefer targeted replacements like

Priority inheritance isn't ideal, but comming from a general purpose
kernel that wasn't build from scratch to accomodate hard realtime its
basically the only option.

Also things like lockdep are a real help to a lot of developers - loads
of locking bugs never make it into the kernel because of it.

--

To: Peter Zijlstra <peterz@...>
Cc: Bart Van Assche <bart.vanassche@...>, Andi Kleen <andi@...>, Roland Dreier <rdreier@...>, Ingo Molnar <mingo@...>, Matthew Wilcox <matthew@...>, Ingo Oeser <ioe-lkml@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>
Date: Tuesday, April 15, 2008 - 12:09 pm

I agree. I'd rather _keep_ the completions, and get rid of the semaphores,
so I think the whole fundamental patch here under discussion is broken.

We should do:
- continue replacing semaphores that are used purely for mutual exclusion
with mutexes.
- probably add support for completions to do counting
- and then eventually just do the _reverse_ of what the current patch
does, namely replace the existing counting semaphore usage with
completions.

Hmm?

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: Peter Zijlstra <peterz@...>, Bart Van Assche <bart.vanassche@...>, Roland Dreier <rdreier@...>, Ingo Molnar <mingo@...>, Matthew Wilcox <matthew@...>, Ingo Oeser <ioe-lkml@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>
Date: Tuesday, April 15, 2008 - 12:27 pm

But that's just a semaphore, isn't it?

-Andi
--

To: Andi Kleen <andi@...>
Cc: Peter Zijlstra <peterz@...>, Bart Van Assche <bart.vanassche@...>, Roland Dreier <rdreier@...>, Ingo Molnar <mingo@...>, Matthew Wilcox <matthew@...>, Ingo Oeser <ioe-lkml@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>
Date: Tuesday, April 15, 2008 - 12:57 pm

Exactly. But the point here is:

- nobody should use semaphores anyway (use mutexes)
- making *more* code use semaphores is wrong
- completions have a different _mental_ model

IOW, this is not about implementation issues. It's about how you think
about the operations.

We should _not_ implement completions as semaphores, simply because we
want to get *rid* of semaphores some day.

So rather than this long and involved patch series that first makes
semaphores generic, and then makes them be used as completions, I'd much
rather just skip this whole pointless exercise entirely.

Why have "generic semaphores" at all, if we want to get rid of them?

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: Peter Zijlstra <peterz@...>, Bart Van Assche <bart.vanassche@...>, Roland Dreier <rdreier@...>, Ingo Molnar <mingo@...>, Matthew Wilcox <matthew@...>, Ingo Oeser <ioe-lkml@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>
Date: Tuesday, April 15, 2008 - 1:15 pm

For normal locks. But if you have N number of outstanding
events you need to wait for the semaphore is the right primitive.

And it seems there is a not high but non trivial number

Ok so you just want to rename it.

Fine for me. I always found up() and down() unintuitive anyways

Because we still "counted completions" for some things and that's the
same code?

Rather i suspect the real problem is not the name, but just not sure
it gets abused. That is largely more a review problem and as far as I
can figure out basically all the usual reviewers take care of that
anyways. But renaming it also probably wouldn't hurt.

[IMHO I always thought we should have a maintained single "list of
things for reviewers to watch out for" list somewhere]

-Andi

--

To: Andi Kleen <andi@...>
Cc: Peter Zijlstra <peterz@...>, Bart Van Assche <bart.vanassche@...>, Roland Dreier <rdreier@...>, Ingo Molnar <mingo@...>, Matthew Wilcox <matthew@...>, Ingo Oeser <ioe-lkml@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>
Date: Tuesday, April 15, 2008 - 1:26 pm

Right. Except if you were to just use completions instead (which could be
trivially extended to do that).

MUCH more trivial than this complex series.

(You may think that the "Replace completions with semaphores" patch is not
very complicated, but it *is* - it depends very intimately on the big
patch-series that basically turns semaphores into what completions are
now!)

In other words, what makes me not like this is hat we first turn
semaphores into the generic code (which is largely what completions were:
just a special case of the generic semaphores!) and then turns completions
into these things. That just doesn't make any sense to me!

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: Andi Kleen <andi@...>, Peter Zijlstra <peterz@...>, Bart Van Assche <bart.vanassche@...>, Roland Dreier <rdreier@...>, Ingo Molnar <mingo@...>, Ingo Oeser <ioe-lkml@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>
Date: Tuesday, April 15, 2008 - 1:41 pm

complex? big? The big bits are dealing with renaming asm/semaphore.h to
linux/semaphore.h, and I've dropped those out now. There's a couple of
up-front patches which add inclusions of asm/semaphore.h to files which
were missing it. Then I add the new semaphore implementation, delete

Blame me for not realising that completions were semaphores under a
different name.

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--

To: Matthew Wilcox <matthew@...>
Cc: Andi Kleen <andi@...>, Peter Zijlstra <peterz@...>, Bart Van Assche <bart.vanassche@...>, Roland Dreier <rdreier@...>, Ingo Molnar <mingo@...>, Ingo Oeser <ioe-lkml@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>
Date: Tuesday, April 15, 2008 - 2:14 pm

The origin of completions is literally the semaphore code - just
simplified to use spinlocks and be usable as just a mutex. We used to use
semaphores, and because of the subtle race with lockless semaphores I
wrote that stupid completion code as a "generic semaphore with a very
specific usage schenario" and called them "completions".

The completions _could_ have been extended/used as mutex semaphores, but
the difference was really the mental model for them. That then limited the
implementation of them: the functions working on completions are defined
on purpose to be limited - it doesn't really have "up()" and "down()"
functions: "complete()" is really a up(), but "wait_for_completion()" is
more like a "wait_until_I_could_do_a_trydown()" function.

Would it make sense to use completions for countable events too? Yeah. In
fact, we have some things that really would like to do counting, both in
the sense of "wait for <n> events to all complete" _and_ in the sense of
"allow up to <n> events to be outstanding". Both of which could be done as
a counting function (just make "complete" increment the counter, and then
make "wait for <n> events" initialize it to negative, while "allow <n>
outstanding events" would be a positive counter, and make
"wait_for_completion()" basically be a "decrement and wait until it
is zero".

IOW, completions() really follow the same patterns as semaphores, and it
*does* make sense to just have one single code-base. But if we want to
make semaphores go away, I think that it would be better to implement
semaphores in terms of "extended completions" rather than the other way
around. That way, we could one day really get rid of semaphores entirely.

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: Matthew Wilcox <matthew@...>, Andi Kleen <andi@...>, Peter Zijlstra <peterz@...>, Bart Van Assche <bart.vanassche@...>, Roland Dreier <rdreier@...>, Ingo Molnar <mingo@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>
Date: Wednesday, April 16, 2008 - 12:07 pm

Hmm, why not introduce a provide_slots(),
wait_for_slot() and free_slot() API?

provide_slots() := initialize a resource with N available slots
wait_for_slot() := wait for a slot in countable resource to be free
free_slot() := make a slot available

That would:
- fit the use case nicely
- could be optimized for countable resources
with upper limits
- allow to measure slot utilization
- ...
- is not called semaphore :-)

And isn't this the same problem (called scheduling) as:
- keeping N cpus busy
- keeping N (disk) spindles busy
- keeping N nic transmit queues busy
- ...

???

But maybe I oversimplify here :-)

Best Regards

Ingo Oeser
--

To: Ingo Oeser <ioe-lkml@...>
Cc: Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Peter Zijlstra <peterz@...>, Bart Van Assche <bart.vanassche@...>, Roland Dreier <rdreier@...>, Ingo Molnar <mingo@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>
Date: Wednesday, April 16, 2008 - 12:16 pm

The kcounter API I'm working on has something in common with this. But
it adds the more important feature:

- Is debuggable

I've also researched the current users of semaphores and completions and
the API has to be extended a few different ways to take account of some
reasonable usage patterns.

The basic idea is that you get back a cookie from the kcounter_claim()
which you have to hand to the kcounter_release() function so it
knows which one you released. It's similar to mutex debugging except
that mutexes can use the thread_info pointer as an implicit cookie.
Completions and semaphores have usage patterns where the resource is
released from interrupt context.

I'm not finished yet, but this is the current API:

extern void kcounter_init(struct kcounter *kc, unsigned int count);

extern long __must_check kcounter_claim(struct kcounter *kc);
extern long __must_check kcounter_claim_interruptible(struct kcounter *kc);
extern long __must_check kcounter_claim_timeout(struct kcounter *kc,
long jiffies);
extern long __must_check kcounter_claim_interruptible_timeout(
struct kcounter *kc, long jiffies);
extern long __must_check kcounter_try_claim(struct kcounter *kc);
extern int __must_check kcounter_has_resources(struct kcounter *kc);
extern void kcounter_release(struct kcounter *kc, long resource);

#ifdef CONFIG_DEBUG_KCOUNTER
extern void kcounter_add_resource(struct kcounter *kc);
extern int __must_check kcounter_remove_resource(struct kcounter *kc);
#else
#define kcounter_add_resource(kc) kcounter_release(kc, 0)
#define kcounter_remove_resource(kc) kcounter_claim(kc)
#endif
extern void kcounter_add_all_resources(struct kcounter *kc);

The kcounter_add_resource()/kcounter_remove_resource() API is for where
we're actually adding new slots rather than releasing an already-acquired
slot. Without debugging they're the same thing, but conceptually,

It's not nearly as complex. Each o...

To: Matthew Wilcox <matthew@...>
Cc: Ingo Oeser <ioe-lkml@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Peter Zijlstra <peterz@...>, Bart Van Assche <bart.vanassche@...>, Roland Dreier <rdreier@...>, Ingo Molnar <mingo@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>
Date: Wednesday, April 16, 2008 - 12:31 pm

So in addition to the kcounter we need to save a token in a data structure?
In fact, there must be a data structure that can house that token. So you
can no longer live with a pointer just to a device descriptor, but every
individual use of a resource must have an associated data structure?

Regards
Oliver

--

To: Oliver Neukum <oliver@...>
Cc: Matthew Wilcox <matthew@...>, Ingo Oeser <ioe-lkml@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Peter Zijlstra <peterz@...>, Bart Van Assche <bart.vanassche@...>, Roland Dreier <rdreier@...>, Ingo Molnar <mingo@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>
Date: Wednesday, April 16, 2008 - 12:50 pm

On Wed, 16 Apr 2008 18:31:08 +0200

yup. For kcounters there is a clear owner for each "slot".
[This is the part that makes it debugable again)

Now for non-debug builds, the space taken for the token can be.. zero

--
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
--

To: Arjan van de Ven <arjan@...>
Cc: Oliver Neukum <oliver@...>, Ingo Oeser <ioe-lkml@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Peter Zijlstra <peterz@...>, Bart Van Assche <bart.vanassche@...>, Roland Dreier <rdreier@...>, Ingo Molnar <mingo@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>
Date: Wednesday, April 16, 2008 - 12:58 pm

Hm. That might be worth changing the API for. Right now, I'm
overloading the return value of kcounter_claim with an errno and a
cookie. Maybe we should go for:

extern int __must_check kcounter_claim(struct kcounter *kc, kcounter_cookie_t *resource);

with the return value being 0 or errno and on success, kcounter_claim
writes a cookie to 'resource'?

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--

To: Matthew Wilcox <matthew@...>
Cc: Oliver Neukum <oliver@...>, Ingo Oeser <ioe-lkml@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Peter Zijlstra <peterz@...>, Bart Van Assche <bart.vanassche@...>, Roland Dreier <rdreier@...>, Ingo Molnar <mingo@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>
Date: Wednesday, April 16, 2008 - 1:08 pm

On Wed, 16 Apr 2008 10:58:11 -0600

--
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
--

To: Arjan van de Ven <arjan@...>
Cc: Oliver Neukum <oliver@...>, Ingo Oeser <ioe-lkml@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Peter Zijlstra <peterz@...>, Bart Van Assche <bart.vanassche@...>, Roland Dreier <rdreier@...>, Ingo Molnar <mingo@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>
Date: Wednesday, April 16, 2008 - 2:10 pm

OK, here's the first version that even compiles. It's not completed --
a few functions need to be written, and I certainly haven't tried
to convert anything to use it. Also, documentation would be good.
But it's useful for people to take pot-shots at the API and suggest
other things that are worth debugging checks.

One thing I want to add is a 'check if we can free this kcounter'.
ie are there any outstanding resources claimed. And I'm not checking
for a double-release of a resource yet (trivial to add, but I need to
grab lunch).

commit 4496d41a540897039138a7d67dbef51f96dd0b09
Author: Matthew Wilcox <matthew@wil.cx>
Date: Wed Apr 16 14:04:28 2008 -0400

kcounters

diff --git a/include/linux/kcounter.h b/include/linux/kcounter.h
new file mode 100644
index 0000000..35f83ed
--- /dev/null
+++ b/include/linux/kcounter.h
@@ -0,0 +1,58 @@
+#ifndef __LINUX_KCOUNTER_H
+#define __LINUX_KCOUNTER_H
+/*
+ * include/linux/kcounter.h
+ *
+ * Copyright (c) 2008 Intel Corporation
+ * Author: Matthew Wilcox <willy@linux.intel.com>
+ *
+ * Distributed under the terms of the GNU GPL, version 2
+ *
+ * kcounters model a situation where you have a certain number of resources
+ * and tasks must sleep to acquire a resource if there are none available.
+ * New resources may be added or existing ones removed.
+ */
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+
+#ifdef CONFIG_DEBUG_KCOUNTER
+typedef struct { struct kcounter_owner *owner; } kcounter_cookie_t;
+#else
+typedef struct { } kcounter_cookie_t;
+#endif
+
+struct kcounter {
+ spinlock_t lock;
+ unsigned int count;
+ struct list_head wait_list;
+#ifdef CONFIG_DEBUG_KCOUNTER
+ unsigned int size;
+ struct kcounter_owner *owners;
+#endif
+};
+
+extern void kcounter_init(struct kcounter *kc, unsigned int count);
+
+extern int __must_check kcounter_claim(struct kcounter *, kcounter_cookie_t *);
+extern int __must_check kcounter_claim_interruptible(struct kcounter *,
+ kcount...

To: Arjan van de Ven <arjan@...>
Cc: Oliver Neukum <oliver@...>, Ingo Oeser <ioe-lkml@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Peter Zijlstra <peterz@...>, Bart Van Assche <bart.vanassche@...>, Roland Dreier <rdreier@...>, Ingo Molnar <mingo@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>
Date: Wednesday, April 16, 2008 - 1:12 pm

I was considering that, but then we can't rely on __must_check to warn
whether people are actually checking the errno or just storing the
cookie somewhere.

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--

To: Oliver Neukum <oliver@...>
Cc: Ingo Oeser <ioe-lkml@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Peter Zijlstra <peterz@...>, Bart Van Assche <bart.vanassche@...>, Roland Dreier <rdreier@...>, Ingo Molnar <mingo@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>
Date: Wednesday, April 16, 2008 - 12:34 pm

That's right. Do you have an example where this would be inconvenient?
I couldn't find one. For example, with USB, you could place one in the
struct urb.

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--

To: Matthew Wilcox <matthew@...>
Cc: Ingo Oeser <ioe-lkml@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Peter Zijlstra <peterz@...>, Bart Van Assche <bart.vanassche@...>, Roland Dreier <rdreier@...>, Ingo Molnar <mingo@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>
Date: Wednesday, April 16, 2008 - 12:42 pm

That's a data structure we really want to shrink. And furthermore, the needs
of the use cases should shape the locking primitives, not the reverse.

Regards
Oliver
--

To: Oliver Neukum <oliver@...>
Cc: Ingo Oeser <ioe-lkml@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Peter Zijlstra <peterz@...>, Bart Van Assche <bart.vanassche@...>, Roland Dreier <rdreier@...>, Ingo Molnar <mingo@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>
Date: Wednesday, April 16, 2008 - 12:44 pm

The cookies aren't checked by the kcounter implementation if
CONFIG_DEBUG_KCOUNTER isn't set. So you can avoid storing them if it's
*really* that important to shrink your data structures.

You'll still have to check the return value from the various
kcounter_claim* APIs for errors, of course.

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--

To: Matthew Wilcox <matthew@...>
Cc: Oliver Neukum <oliver@...>, Ingo Oeser <ioe-lkml@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Peter Zijlstra <peterz@...>, Bart Van Assche <bart.vanassche@...>, Ingo Molnar <mingo@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>
Date: Wednesday, April 16, 2008 - 12:47 pm

> The cookies aren't checked by the kcounter implementation if
> CONFIG_DEBUG_KCOUNTER isn't set. So you can avoid storing them if it's
> *really* that important to shrink your data structures.

I guess if anyone cared we could have something analogous to
DECLARE_PCI_UNMAP_ADDR() et al that compiles to nothing if
CONFIG_DEBUG_KCOUNTER isn't set. But I'm not sure it's worth it.
Certainly for the mthca/mlx4 cases of semaphore use there's no problem
holding onto a cookie.

- R.
--

To: Linus Torvalds <torvalds@...>
Cc: Andi Kleen <andi@...>, Peter Zijlstra <peterz@...>, Bart Van Assche <bart.vanassche@...>, Roland Dreier <rdreier@...>, Matthew Wilcox <matthew@...>, Ingo Oeser <ioe-lkml@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>
Date: Tuesday, April 15, 2008 - 1:05 pm

i very much agree with the "get rid of semaphores" argument - the reason
why i initially supported the "move to generic semaphores" step was
because i saw it basically as the precursor to full removal: it is the
removal of semaphores from all architectures - with a small generic
compatibility wrapper to handle the remaining few uses of semaphores.

i got thoroughly surprised by the "increase the scope of semaphores"
angle to the patchset though, and in hindsight i'd rather see neither of
those generalizations and see semaphores die a slow but sure natural
death than to see their prolongation :-/

perhaps the 'remove all semaphores from all architectures' privilege
should be a final step kept for the lucky guy who manages to get rid of
the last remaining semaphore use in the kernel? :)

Ingo
--

To: Ingo Molnar <mingo@...>
Cc: Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Peter Zijlstra <peterz@...>, Bart Van Assche <bart.vanassche@...>, Roland Dreier <rdreier@...>, Ingo Oeser <ioe-lkml@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>
Date: Tuesday, April 15, 2008 - 2:50 pm

Hm. I thought you initially supported it because it deleted so much
code. I don't want to go and add down_killable() to each architecture

I'm fully in favour of reducing the number of semaphore users, and
eventually eliminating them. Arjan and I discussed a way to do that
just now ... I'll write some code, see how it looks.

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--

To: Matthew Wilcox <matthew@...>
Cc: Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Peter Zijlstra <peterz@...>, Bart Van Assche <bart.vanassche@...>, Roland Dreier <rdreier@...>, Ingo Oeser <ioe-lkml@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>
Date: Wednesday, April 16, 2008 - 8:37 am

... the killable sleeps should and are already propagated everywhere - i
never thought of them as a semaphore-only feature.

killable sleeps are probably the next best thing to true
interruptability.

btw., has anyone thought about killable sync/fsync syscalls - would that

cool!

Ingo
--

To: Ingo Molnar <mingo@...>
Cc: Matthew Wilcox <matthew@...>, Linus Torvalds <torvalds@...>, Peter Zijlstra <peterz@...>, Bart Van Assche <bart.vanassche@...>, Roland Dreier <rdreier@...>, Ingo Oeser <ioe-lkml@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>
Date: Wednesday, April 16, 2008 - 8:50 am

killable stat and readdir would be even more important I would say.

When sync/fsync takes too long it's typically just a kernel bug
of some sort (like that long running MM starvation issue that stalls
writes on some kinds of background activity) that should be really just
fixed.

But handling down network servers which hit stat/readdir etc. is
a real situation not explained by a bug.

For example a standard situation that hits me regularly is that
I save something in firefox on a different server and then later
turn that machine off. Then next time I try to save something
in firefox it first blocks forever in stat()ing that down directory.

-Andi

--

To: Andi Kleen <andi@...>
Cc: Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Peter Zijlstra <peterz@...>, Bart Van Assche <bart.vanassche@...>, Roland Dreier <rdreier@...>, Ingo Oeser <ioe-lkml@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>
Date: Wednesday, April 16, 2008 - 8:59 am

I think that's fixed now. Can you check with 2.6.25-rc1 or later?
If it's unkillable, can you let me know the wchan for that process?

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--

To: Peter Zijlstra <peterz@...>
Cc: Andi Kleen <andi@...>, Roland Dreier <rdreier@...>, Ingo Molnar <mingo@...>, Matthew Wilcox <matthew@...>, Ingo Oeser <ioe-lkml@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>, Linus Torvalds <torvalds@...>
Date: Tuesday, April 15, 2008 - 9:15 am

I agree that lockdep is a real help for kernel developers. But please
keep in mind that locking order checking has its limitations. While
locking order checking can detect certain very important classes of
deadlocks, it can't detect all classes of possible deadlocks. Consider
e.g. the example mentioned by Roland Dreier, where semaphores are used
to count the number of elements in a queue. Deadlocks triggered by
waiting for a certain semaphore value can't be detected by
lockdep-style algorithms. And renaming semaphores into something else
won't help.

Bart.
--

To: Peter Zijlstra <peterz@...>
Cc: Andi Kleen <andi@...>, Roland Dreier <rdreier@...>, Ingo Molnar <mingo@...>, Matthew Wilcox <matthew@...>, Ingo Oeser <ioe-lkml@...>, <linux-kernel@...>, Linus Torvalds <torvalds@...>
Date: Monday, April 14, 2008 - 2:09 pm

I'm not for open coding anything .. This "writes in flight" type code
happens often enough we should have something that covers it..
completions come fairly close (since they're just like locked
semaphores) only it's not easy to initialize them to allow multiple
completes before the waiting starts..

Daniel

--

To: Peter Zijlstra <peterz@...>
Cc: Roland Dreier <rdreier@...>, Ingo Molnar <mingo@...>, Matthew Wilcox <matthew@...>, Ingo Oeser <ioe-lkml@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>, Linus Torvalds <torvalds@...>
Date: Monday, April 14, 2008 - 12:56 pm

On Mon, 14 Apr 2008 18:32:28 +0200

Maybe we need a "counter" primitive instead?
From a conceptual point of view that even makes sense

(the implementation can be pretty much the current semaphore one of course)

--
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
--

To: Arjan van de Ven <arjan@...>
Cc: Peter Zijlstra <peterz@...>, Roland Dreier <rdreier@...>, Ingo Molnar <mingo@...>, Ingo Oeser <ioe-lkml@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>, Linus Torvalds <torvalds@...>
Date: Monday, April 14, 2008 - 1:50 pm

I'm only too happy to rename semaphore.c to atomic_counter.c and do
appropriate renames. Or maybe 'kcounter' would be more in vogue for a
name ;-) I'd like a name that implies sleeping, but I'm not able to
think of one right now.

Then semaphores and completions each become wrappers around counters
and everybody's happy. Right?

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--

To: Matthew Wilcox <matthew@...>
Cc: Ingo Oeser <ioe-lkml@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>
Date: Saturday, April 12, 2008 - 2:05 pm

Which is exactly the point, semaphores are rarely the right tool for the
job (I'm currently only aware of XFS that actually uses them as actual
semaphores, and the driver model that uses it to get around lockdep -
still need to come up with a good solution for that).

Even worse, semaphores actively harm the development of linux-rt and
arguably linux kernel development as a whole by not being part of
lockdep - so making a mistake will not be noticed until its too late.

Completions OTOH are not the typical exclusion primitive, they are more
a synchronisation primitive - not unlike wait4(). Typically not things
concerned about priority inversion.

--

To: Peter Zijlstra <peterz@...>
Cc: Ingo Oeser <ioe-lkml@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>
Date: Saturday, April 12, 2008 - 3:04 pm

I guess you haven't grepped the tree recently then. We have 11 places
which call sema_init with a number other than 0 or 1 as the second
argument:

./fs/xfs/linux-2.6/sema.h:#define initnsema(sp, val, name) sema_init(sp, val)
./drivers/net/mlx4/cmd.c: sema_init(&priv->cmd.event_sem, priv->cmd.max_cmds);
./drivers/char/viotape.c: sema_init(&reqSem, VIOTAPE_MAXREQ);
./drivers/infiniband/hw/mthca/mthca_cmd.c: sema_init(&dev->cmd.event_sem, dev->cmd.max_cmds);
./drivers/acpi/osl.c: sema_init(sem, initial_units);
./drivers/scsi/megaraid/megaraid_sas.c: sema_init(&instance->ioctl_sem, MEGASAS_INT_CMDS);
./drivers/scsi/megaraid/megaraid_mm.c: sema_init(&adapter->kioc_semaphore, lld_adp->max_kioc);
./drivers/video/omap/hwa742.c: sema_init(&hwa742.req_sema, i - IRQ_REQ_POOL_SIZE);
./drivers/video/omap/blizzard.c: sema_init(&blizzard.req_sema, i - IRQ_REQ_POOL_SIZE);
./drivers/usb/misc/usblcd.c: sema_init(&dev->limit_sem, USB_LCD_CONCURRENT_WRITES);
./drivers/usb/usb-skeleton.c: sema_init(&dev->limit_sem, WRITES_IN_FLIGHT);

It's nice to know that nobody is using __DECLARE_SEMAPHORE_GENERIC or
__SEMAPHORE_INITIALIZER at this time. I'll delete those from my semaphore
tree soon.

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--

To: Matthew Wilcox <matthew@...>
Cc: Ingo Oeser <ioe-lkml@...>, Daniel Walker <dwalker@...>, <linux-kernel@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>
Date: Saturday, April 12, 2008 - 3:16 pm

Luckily those are mostly drivers, but yeah - I haven't actually grepped
the tree I just know the few sem users I've ran into myself.

Still doesn't make it a good idea to have semaphores.

--

To: Matthew Wilcox <matthew@...>
Cc: Peter Zijlstra <peterz@...>, Ingo Oeser <ioe-lkml@...>, <linux-kernel@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>
Date: Saturday, April 12, 2008 - 2:01 pm

The above doesn't look all that odd to me. It may be that you've seen

I would just re-write completions keeping the name and API in tact, make
them better and just leave semaphores alone..

Daniel

--

To: Matthew Wilcox <matthew@...>
Cc: <linux-kernel@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>
Date: Saturday, April 12, 2008 - 2:43 am

semaphores.. The problem with semaphores is the lack of a strict API,
and loose usage. I've seen a lot of "creative" locking in the kernel,
and if we allow that we're just asking for continued maintainability
problems in the code that uses semaphores. At times I've spent hours
trying to figure out what a semaphore is doing, or suppose to be doing.
If we enforce strict usage of semaphores, then we'll basically reproduce
mutex usage, and we have a generic mutex already..

Daniel

--

To: Daniel Walker <dwalker@...>
Cc: Matthew Wilcox <matthew@...>, <linux-kernel@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>
Date: Saturday, April 12, 2008 - 6:31 am

Hi Daniel,

Seconded!

Ask any software engineer, to describe what a semaphore is and
how it behaves. You'll get the full spectrum of possible semaphore
implementation options. Ask any advanced software engineer and he'll
ask "What kind of semaphore?".

Same here. Or figuring out what kind of semphore behavior the developer

Or have different API for using the same semaphores in different
creative ways. It might be hard to implement lockdep for that :-)

No, I'm happy that Linux has so many more advanced concurrency concepts,
that the pre-dinosaur locking mechanism (semaphore) is usually
the last on your list.

Best Regards

Ingo Oeser
--

Previous thread: Linux 2.6.25-rc9 by Linus Torvalds on Friday, April 11, 2008 - 4:51 pm. (2 messages)

Next thread: get a still sysrq-t call trace, not a moving one by Gary Shi on Friday, April 11, 2008 - 5:43 pm. (1 message)