Re: [RFC PATCH] Fair low-latency rwlock v5

!MAILaRCHIVE_VOTE_RePLACE
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
To: Paul E. McKenney <paulmck@...>
Cc: Linus Torvalds <torvalds@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Joe Perches <joe@...>, <linux-kernel@...>
Date: Tuesday, August 19, 2008 - 2:04 am

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:

Agreed. I am careful about overdesigning for RT though, given that once
we get it right for the current thread vs softirq vs hardirq mess, the
solution for preemptable critical sections might come more naturally.


See below about RT kernel.


Technical detail : x86_64 cannot pass immediate values > 2^32 to most
instructions; it must load an immediate value in a register, which would
add a few instructions. Therefore, I prefer to use the lowest bits
possible given the NR_CPUS configuration.


Actually, given that softirqs and hardirqs are really just tasks in -RT,
I guess threads could be given the whole 45 bits available for threads,
softirqs and irqs (0 .. 3*log_2(NR_CPUS)-1).

However, in that case, we would also like to have as much preempable
writer critical sections as we have reader critical sections. Therefore,
it would be best to split the 60 bits in two :

30 bits for reader threads
30 bits for writer threads

However, I wonder if the whole algorithm included in this rwlock
primitive to make sure that interrupt latency < softirq latency < thread
latency won't just vanish in -rt kernels. We might want to group those
threads by priority (e.g. grouping irq threads together and maybe
leaving softirq and normal threads in the same bunch), but we would then
impact softirq latency a little bit more.

The problem of this approach wrt RT kernels is that we cannot provide
enough "priority groups" (current irq, softirq and threads in mainline
kernel) for all the subtile priority levels of RT kernels. The more
groups we add, the less threads we allow on the system.

So basically, the relationship between the max number of threads (T) and
the number of reader priorities goes as follow on a 64 bits machine :

T writers subscribed count bits
1 bit for writer mutex

for first priority group :
T reader count bits
(no need of reader exclusion bit because the writer subscribed count
bits and the writer mutex act as exclusion)

for each other priority group :
T reader count bits
1 reader exclusion bit (set by the writer)

We have the inequality :

64 >= (T + 1) + T + (NR_PRIORITIES - 1) * (T + 1)

64 >= (2T + 1) + (NR_PRIORITIES - 1) * (T + 1)
63 - 2T >= (NR_PRIORITIES - 1) * (T + 1)
((63 - 2T) / (T + 1)) + 1 >= NR_PRIORITIES

Therefore :

Thread bits  |  Max number of threads  |  Number of priorities
  31         |    2147483648           |          1
  20         |       1048576           |          2
  15         |         32768           |          3
  12         |          4096           |          4
   9         |           512           |          5
   8         |           256           |          6
   7         |           128           |          7
   6         |            64           |          8
   5         |            32           |          9
   4         |            16           |         10
   3         |             8           |         15

Starting from here, we have more priority groups than threads in the
system, which becomes somewhat pointless... :)

So currently, for the mainline kernel, I chose 3 priority levels thread,
softirq, irq), which gives me 32768 max CPU in the system because I
choose to disable preemption. However, we can think of ways to tune that
in the direction we prefer. We could also hybrid those : having more
bits for some groups which have preemptable threads (for which we need
a max. of nr. threads) and less bits for other groups where preemption
is disabled (where we only need enough bits to cound NR_CPUS)

Ideas are welcome...



I thought it would have been sane to make is a power of two, but you are
right.. I should use the neat "roundup_pow_of_two()" :)


gcc is smart nowadays, it gets inlined automatically when it is small
enough and just called once.


Good point, will fix. The atomic_long_cmpxchg executed before we exit
the busy loop takes care of ordering.


Yep, I've done that in my fast/slowpath rewrite. Will be posted soonish.


Yep.


What would make sure that we have correct read ordering between
rwlock->value and the data we are trying to protect then ?

Given that we are a writer, we might argue that we are not interested in
ordering memory reads, only writes.

Yeah, given that all we are interested to know is when a reader has
exited its critical section (we are not actually interested in reading
any other data that this reader might have written for us), it should be
ok to just use a barrier().


Actually, I changed it quite a bit in the new version. I merged the
subscription with the first "trylock" so we can do the two in a single
operation, which either gets the critical section quickly, or
subscribes.


Yes, but it's more than that. I'll do a table resuming what is allowed
by reader contexts :

IRQ reader context can access the reader lock when :

- Lock not contended
- Reader threads are holding the lock
- Writers are subscribed
- A writer is holding the WRITE_MUTEX (which excludes other writers)
- A writer is locking out softirqs.

SoftIRQ reader context can access the reader lock when :

- Lock not contended
- Reader threads are holding the lock
- Writers are subscribed
- A writer is holding the WRITE_MUTEX (which excludes other writers)

Thread reader context can access the reader lock when :

- Lock not contended
- Reader threads are holding the lock

Note that the fact that "Writers are subscribed" is not present for the
thread reader condition to obtain the lock implies that we favor writers
against thread readers. This is however not the case for IRQ and
Softirq.


In fair_write_trylock_irq_uncontended(), which becomes the fastpath in
the new patch version, I disable softirqs and irqs, and try to get the
lock in one go if not contended. If that fails, then I start to lock-out
execution contexts from the lock, one at a time.

I am not sure that I understand your comment completely though.



Right.


If interrupts would be disabled before calling :


It would be busy-looping, waiting for softirqs to exit their read-side
critical section, while interrupts are disabled. Given that those
softirqs can run for longer than the duration for which we would like to
disable interrupts (the ciritical section itself can be long, but
softirqs could also be interrupted, and more than once), it's better to
let softirqs exit their critical sections before disabling irqs.


 :)


Because we do


in fair_write_unlock_irq().

Actually, if we could do, in the slow path :

local_bh_disable();
wait for bh to exit
local_irq_disable();
local_bh_enable();
wait for irq to exit

We could then remove the local_bh_enable() from the unlock and therefore
remove the local_bh_disable/enable pair from the fast path.

However, is it legal to reenable bottom-halves when interrupts are off,
or is there some check done like the preempt_check_resched() code which
forbids that ?

Considering what I see in _local_bh_enable_ip(), it seems wrong :

  WARN_ON_ONCE(in_irq() || irqs_disabled());

Ah, and there is a preempt_check_resched(); call at the end of
_local_bh_enable_ip. That explains it. I wonder if there would be some
way around that limitation. Given this comment :

/*
 * Special-case - softirqs can safely be enabled in
 * cond_resched_softirq(), or by __do_softirq(),
 * without processing still-pending softirqs:
 */
void _local_bh_enable(void)

There seems to be a few special-cases, which gives me hope (but no clear
solution). :/



Good idea, I'll add a check in the slowpath, but I'll leave it out of
the fast path to keep it, well, fast.


This fair_write_lock should _only_ be used if there are no bh nor irq
readers for this lock. Therefore we do not disable interrupts. Actually,
I should add a WARN_ON here to be certain it is used properly. A reader
thread which reads this data structure with interrupts disabled is
considered as reading from interrupt context in my semantic (implemented
in the new patch version).

Actually, all my writer slow paths should have the following :

  WARN_ON_ONCE(in_interrupt() || irqs_disabled());

Which checks if bh are disabled, if I am in irq context or in bh context
and checks if irqs are disabled. All these requirements must be met to
be able to correctly wait for contended reader threads without
increasing the irq or softirq latency.


It's updated in the following version, the use case is found in comment
in the code.

Thanks a lot for the review !

Mathieu


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
[PATCH] x86_64 : support atomic ops with 64 bits integer val..., Mathieu Desnoyers, (Sat Aug 16, 3:39 am)
Re: [PATCH] x86_64 : support atomic ops with 64 bits integer..., Mathieu Desnoyers, (Sat Aug 16, 11:43 am)
[RFC PATCH] Fair rwlock, Mathieu Desnoyers, (Sat Aug 16, 5:19 pm)
Re: [RFC PATCH] Fair rwlock, Linus Torvalds, (Sat Aug 16, 5:33 pm)
[RFC PATCH] Fair low-latency rwlock v3, Mathieu Desnoyers, (Sun Aug 17, 3:53 am)
Re: [RFC PATCH] Fair low-latency rwlock v3, Linus Torvalds, (Sun Aug 17, 12:17 pm)
[RFC PATCH] Fair low-latency rwlock v5, Mathieu Desnoyers, (Sun Aug 17, 3:10 pm)
Re: [RFC PATCH] Fair low-latency rwlock v5, Peter Zijlstra, (Mon Aug 25, 3:20 pm)
Re: [RFC PATCH] Fair low-latency rwlock v5, Paul E. McKenney, (Mon Aug 18, 7:25 pm)
Re: [RFC PATCH] Fair low-latency rwlock v5, Mathieu Desnoyers, (Tue Aug 19, 2:04 am)
Re: [RFC PATCH] Fair low-latency rwlock v5, Mathieu Desnoyers, (Tue Aug 19, 3:33 am)
Re: [RFC PATCH] Fair low-latency rwlock v5, Linus Torvalds, (Tue Aug 19, 12:48 pm)
[RFC PATCH] Writer-biased low-latency rwlock v8, Mathieu Desnoyers, (Thu Aug 21, 4:50 pm)
Re: [RFC PATCH] Writer-biased low-latency rwlock v8, Linus Torvalds, (Thu Aug 21, 5:00 pm)
Re: [RFC PATCH] Writer-biased low-latency rwlock v8, H. Peter Anvin, (Thu Aug 21, 5:26 pm)
Re: [RFC PATCH] Writer-biased low-latency rwlock v8, Linus Torvalds, (Thu Aug 21, 5:41 pm)
Re: [RFC PATCH] Writer-biased low-latency rwlock v8, Linus Torvalds, (Thu Aug 21, 5:15 pm)
Re: [RFC PATCH] Writer-biased low-latency rwlock v8, Mathieu Desnoyers, (Sat Aug 23, 1:09 am)
Re: [RFC PATCH] Writer-biased low-latency rwlock v8, Linus Torvalds, (Sat Aug 23, 2:02 pm)
Re: [RFC PATCH] Writer-biased low-latency rwlock v8, Mathieu Desnoyers, (Sat Aug 23, 4:30 pm)
Re: [RFC PATCH] Writer-biased low-latency rwlock v8, Linus Torvalds, (Sat Aug 23, 5:40 pm)
Re: [RFC PATCH] Writer-biased low-latency rwlock v8, Linus Torvalds, (Thu Aug 21, 6:22 pm)
Re: [RFC PATCH] Fair low-latency rwlock v5, Mathieu Desnoyers, (Tue Aug 19, 5:06 am)
Re: [RFC PATCH] Fair low-latency rwlock v5, Linus Torvalds, (Mon Aug 18, 2:59 pm)
Re: [RFC PATCH] Fair low-latency rwlock v5 (updated benchmar..., Mathieu Desnoyers, (Sun Aug 17, 5:30 pm)