Re: [PATCH] seqlock: serialize against writers

Previous thread: Out of mtrrs by "J.A. Magallón" on Friday, August 29, 2008 - 8:32 am. (3 messages)

Next thread: Re: Out of mtrrs by "J.A. Magallón" on Friday, August 29, 2008 - 9:17 am. (2 messages)
From: Gregory Haskins
Date: Friday, August 29, 2008 - 8:44 am

*Patch submitted for inclusion in PREEMPT_RT 26-rt4.  Applies to 2.6.26.3-rt3*

Hi Ingo, Steven, Thomas,
  Please consider for -rt4.  This fixes a nasty deadlock on my systems under
  heavy load.

-Greg

----

Seqlocks have always advertised that readers do not "block", but this was
never really true.  Readers have always logically blocked at the head of
the critical section under contention with writers, regardless of whether
they were allowed to run code or not.

Recent changes in this space (88a411c07b6fedcfc97b8dc51ae18540bd2beda0)
have turned this into a more explicit blocking operation in mainline.
However, this change highlights a short-coming in -rt because the
normal seqlock_ts are preemptible.  This means that we can potentially
deadlock should a reader spin waiting for a write critical-section to end
while the writer is preempted.

This patch changes the internal implementation to use a rwlock and forces
the readers to serialize with the writers under contention.  This will
have the advantage that -rt seqlocks_t will sleep the reader if deadlock
were imminent, and it will pi-boost the writer to prevent inversion.

This fixes a deadlock discovered under testing where all high prioritiy
readers were hogging the cpus and preventing a writer from releasing the
lock.

Since seqlocks are designed to be used as rarely-write locks, this should
not affect the performance in the fast-path

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 include/linux/seqlock.h |   76 +++++++++++++++++++++++++----------------------
 1 files changed, 40 insertions(+), 36 deletions(-)

diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
index 345d726..7010169 100644
--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -3,9 +3,11 @@
 /*
  * Reader/writer consistent mechanism without starving writers. This type of
  * lock for data where the reader wants a consistent set of information
- * and is willing to retry if the information changes.  Readers ...
From: Andi Kleen
Date: Friday, August 29, 2008 - 9:09 am

Does this even work under x86-64? x86-64 uses seqlocks in user space
in its vsyscalls. And read_lock() definitely doesn't work there because 
it writes.

You would need at least to disable vsyscall gettimeofday(), making
it much much slower.

Perhaps you tested on one of the systems where the vsyscalls need
to fallback for other reasons? (e.g. one using pmtimer for timing).

-Andi
--

From: Gregory Haskins
Date: Friday, August 29, 2008 - 9:10 am

Im running it on a x86_64 box as we speak.  How can I tell if there is a
certain mode that is permitting this?

-Greg

From: Andi Kleen
Date: Friday, August 29, 2008 - 9:22 am

If the boot up says you're running with PMtimer then it uses the fallback
(usually happens on pre Fam10h AMD boxes). A typical Intel box
would use the faster ring 3 only TSC path and then explode with your
change I bet. 

Or step with gdb through gettimeofday() and see if it does a syscall.

-Andi
--

From: Gregory Haskins
Date: Friday, August 29, 2008 - 9:26 am

It seems to be running fine with no indication it has fallen back.=20
Perhaps I need a certain workload to bring out the issue?

Here are some details of my system:

-------------------

ghaskins@test:~> uname -a
Linux test 2.6.26.3-rt3-rt #2 SMP PREEMPT RT Fri Aug 29 10:47:17 EDT
2008 x86_64 x86_64 x86_64 GNU/Linux
ghaskins@test:~> cat /proc/cpuinfo
processor    : 0
vendor_id    : GenuineIntel
cpu family    : 6
model        : 15
model name    : Intel(R) Xeon(R) CPU            5130  @ 2.00GHz
stepping    : 6
cpu MHz        : 1995.006
cache size    : 4096 KB
physical id    : 0
siblings    : 2
core id        : 0
cpu cores    : 2
apicid        : 0
initial apicid    : 0
fpu        : yes
fpu_exception    : yes
cpuid level    : 10
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall
lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx
tm2 ssse3 cx16 xtpr dca lahf_lm
bogomips    : 3992.49
clflush size    : 64
cache_alignment    : 64
address sizes    : 36 bits physical, 48 bits virtual
power management:

processor    : 1
vendor_id    : GenuineIntel
cpu family    : 6
model        : 15
model name    : Intel(R) Xeon(R) CPU            5130  @ 2.00GHz
stepping    : 6
cpu MHz        : 1995.006
cache size    : 4096 KB
physical id    : 3
siblings    : 2
core id        : 0
cpu cores    : 2
apicid        : 6
initial apicid    : 6
fpu        : yes
fpu_exception    : yes
cpuid level    : 10
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall
lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx
tm2 ssse3 cx16 xtpr dca lahf_lm
bogomips    : 3990.05
clflush size    : 64
cache_alignment    : 64
address sizes    : 36 bits physical, 48 bits virtual
power management:

processor    : 2
vendor_id    : GenuineIntel
cpu family  ...
From: Steven Rostedt
Date: Friday, August 29, 2008 - 9:34 am

Perhaps you never hit the slow path in userland. That's the only place it 
would write. Perhaps add a dummy static variable in the fast path, and 
write to it. See if that crashes you apps.

-- Steve
--

From: Gregory Haskins
Date: Friday, August 29, 2008 - 9:35 am

Yeah, ideas crossed in the mail ;)

I could just force all of the seqbegins to hit the slowpath by hacking
the code and see what happens (aside from slowing down, of course ;)

Question: Which seqlock_t does userspace use?  I assume it uses
seqlock_t and not raw_seqlock_t.  But the only reason that I ask is that
I converted raw_seqlock_t to use the new style as well to be consistent,
even though it is not strictly necessary for the same reasons.  So if
perchance userspace uses the raw variant, I could solve this issue by
only re-working the seqlock_t variant.  Kind of a long shot, but figured
I would mention it :)

-Greg



From: Andi Kleen
Date: Friday, August 29, 2008 - 9:45 am

Only if you don't believe it will really crash? I think it's pretty

There's no raw_seqlock_t anywhere in mainline?


I guess you could define a new seqlock_t which is explicitely user space
safe. That might avoid such issues in the future. But then
that would likely require some code duplication and be ugly.

On the other hand whatever problem you fixing in the kernel
(to be honest it's still unclear to me what the problem is)
needs to be likely fixed for the userland lock too.

-Andi

--

From: Steven Rostedt
Date: Friday, August 29, 2008 - 9:53 am

The subject forgot to add "RT" in the brackets.



I'm not convinced that the raw_seqlocks (mainline normal seqlocks) has a 
problem anyway.

-- Steve

--

From: Gregory Haskins
Date: Friday, August 29, 2008 - 10:00 am

(continuing from IRC)

Agreed.   I converted them to be consistent.  Steve just told me that
userspace actually uses the raw_seqlock_t variant, so the answer is
simple.  Just leave raw_seqlock_t alone and the patch will work fine.

Thoughts?

-Greg


From: Gregory Haskins
Date: Friday, August 29, 2008 - 10:00 am

Well, I guess it was just to prove to myself that I broke something
because I dont understand how the vsyscall interface works.  But given
Yeah, understood.  There is both in -rt and I was just saying that we
technically only need the seqlock_t fix in -rt.  So if raw_seqlock_t
could be left pristine and solve this problem, that is an acceptable

Yeah, it would possibly be a problem in both cases.

The problem I am addressing only exists in -rt since it has seqlock_t
and raw_seqlock_t (with the former using preemptible spinlock_t's).=20
Since the underlying seqlock_t->spinlock_t is preemptible, you can see
that one thread that does:

{
    write_seqlock();
    /* asl */
    write_sequnlock();
}

while other high-prio threads do

do { read_seqbegin(); /* asl */; } while (read_seqretry());

The readers could preempt the writer mid critical section and enter a
live-locked loop.

raw_seqlock_t (which is equivalent to a mainline seqlock_t) do not have
this problem because the spinlock acquisition inside the write_seqlock
disables preemption.

HTH

-Greg


From: Steven Rostedt
Date: Friday, August 29, 2008 - 9:58 am

I answered this on IRC, but this is for the rest of those reading this 
thread.

Userspace (vsyscalls) can only use raw_seqlock_t. And only the read 
version for that matter. Since the read of raw_seqlock_t is just that, a 
read, no writes, and no jumping to other functions on contention.

The vsyscalls should never use the -rt seqlock_t. Not modifying the raws 
here should make us golden.

-- Steve

--

From: Gregory Haskins
Date: Friday, August 29, 2008 - 9:29 am

Thinking about this some more, perhaps the issue is I am not hitting the
contended path in vsyscall?

-Greg


From: Andi Kleen
Date: Friday, August 29, 2008 - 9:37 am

Yes it will be only contended when gettimeofday() races with the timer 
interrupt.  You could try to run gettimeofday() in a loop and see how
long it holds up.

But anyways from the theory you should crash when it happens. 
Writes to kernel data are not allowed in vsyscalls and your read_lock clearly 
does a write.

-Andi

-- 
ak@linux.intel.com
--

From: Gregory Haskins
Date: Friday, August 29, 2008 - 9:41 am

Oh I don't deny that it does.  The compiler neatly reminded me that it
could no longer be "const".  I just was ignorant of the userspace
requirement ;)

But we *do* have a serious problem here.  A non-preemptible seqlock_t
will do bad things if it preempts the writer, so we need some kind of
solution here, one way or the other.  So suggestions welcome :)  I
realize this is only an issue currently in PREEMPT_RT so perhaps most
will not care... but I do need to solve this at least for this branch.

I currently do not see a way to solve this problem that doesn't involve
some heavier involvement in the read path (read: const seqlock_t need
not apply).  Given that, we either need to address the const requirement
for userspace, alter userspace's usage, or find another way to address
the deadlock.  Any ideas?

-Greg


From: Andi Kleen
Date: Friday, August 29, 2008 - 10:08 am

In the worst case you can always disable the vsyscall gtod() and
thus the need for a user space safe seqlock_t, but it will
hurt some workloads significantly. 

-Andi

--

From: Stephen Hemminger
Date: Friday, August 29, 2008 - 9:57 am

On Fri, 29 Aug 2008 11:44:58 -0400

I have mixed feelings about this. The point of this primitive was for
cases where write contention was rare and writers only did small updates.
So what you did was fix the primitive for cases where it is being misused.
Do you have a real example where this is a problem? If so then the
user of seqlock should be fixed, rather than fixing seqlock.
--

From: Steven Rostedt
Date: Friday, August 29, 2008 - 10:02 am

[Empty message]
From: Gregory Haskins
Date: Friday, August 29, 2008 - 11:03 am

*Patch submitted for inclusion in PREEMPT_RT 26-rt4.  Applies to 2.6.26.3-rt3*

Hi Ingo, Steven, Thomas,
  Please consider for -rt4.  This fixes a nasty deadlock on my systems under
  heavy load.

[
Changelog:
	v2: only touch seqlock_t because raw_seqlock_t doesn't require
	    serialization and userspace cannot modify data during a read

	v1: initial release
]

-Greg

----
seqlock: serialize against writers

Seqlocks have always advertised that readers do not "block", but this was
never really true.  Readers have always logically blocked at the head of
the critical section under contention with writers, regardless of whether
they were allowed to run code or not.

Recent changes in this space (88a411c07b6fedcfc97b8dc51ae18540bd2beda0)
have turned this into a more explicit blocking operation in mainline.
However, this change highlights a short-coming in -rt because the
normal seqlock_ts are preemptible.  This means that we can potentially
deadlock should a reader spin waiting for a write critical-section to end
while the writer is preempted.

This patch changes the internal implementation to use a rwlock and forces
the readers to serialize with the writers under contention.  This will
have the advantage that -rt seqlocks_t will sleep the reader if deadlock
were imminent, and it will pi-boost the writer to prevent inversion.

This fixes a deadlock discovered under testing where all high prioritiy
readers were hogging the cpus and preventing a writer from releasing the
lock.

Since seqlocks are designed to be used as rarely-write locks, this should
not affect the performance in the fast-path

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 include/linux/seqlock.h |   51 +++++++++++++++++++++++------------------------
 1 files changed, 25 insertions(+), 26 deletions(-)

diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
index 345d726..605fcdb 100644
--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -3,9 +3,11 @@
 /*
  * ...
From: Gregory Haskins
Date: Friday, August 29, 2008 - 11:12 am

Hi Andi,
  As it turns out, my distcc backend was an x86_64 machine running the
v1 patch and I started to notice sometime today that certain cc1 jobs
were sometimes (albeit rarely) segfaulting on me.  I noticed that before
I even published the first patch, but I chalked it up to a corrupt .o on
my NFS home.  Plus I was forgetting that the distcc machine was running
the patch, and I would probably have never made the userspace connection
had you not mentioned it.  In any case, I self-built this v2 patch with
v2 applied, and the segfaults have gone away.  So I think we know
several things:

1) You were right that this would cause an issue if the slow path is hit.=

2) Steven was right that userspace must use raw_seqlock_t because it no
longer crashes with v2.
3) I am satisfied that my primary concern is still properly addressed.

Hopefully everyone is satisfied with this patch now.

Thanks for the help.  It's much appreciated!

-Greg


From: Peter Zijlstra
Date: Saturday, August 30, 2008 - 4:17 am

Ah, the point I was missing is higher-priority realtime task, in which

Still dont like this patch, once you have a rwlock you might as well go
all the way. Esp since this half-arsed construct defeats PI in certain
cases.

--

From: Gregory Haskins
Date: Saturday, August 30, 2008 - 5:32 am

Why?  A full rwlock will still be much slower since the readers will
always need an atomic op.  This construct only uses atomic ops in the
slow path under contention, which should be rare, and is thus still

Ouch.  While I admit that you can still get into inversion scenarios
once the reader leaves the seqbegin, this is the nature of seqlocks.=20
The only ways I can think of to get around this involve atomic ops in
the fast path, which I think should be avoided.  What would you suggest
otherwise?

-Greg





From: Peter Zijlstra
Date: Saturday, August 30, 2008 - 5:38 am

Since we're talking -rt here, determinism rules, so bite the bullet and
do full PI.

The only reason we made all that stuff preemptable is to gain
determinism, that also means we have to do the PI thing.



--

From: Gregory Haskins
Date: Saturday, August 30, 2008 - 6:05 am

Yeah, you have a point.  I still think this patch will solve the
deadlock thing, so please consider it in the interim.  I will whip up a
full PI solution next week.

-Greg


From: Peter Zijlstra
Date: Saturday, August 30, 2008 - 4:08 am

I think the technical term is livelock.

So the problem is that the write side gets preempted, and the read side
spins in a non-preemptive fashion?

Looking at the code, __read_seqbegin() doesn't disable preemption, so
even while highly inefficient when spinning against a preempted lock, it
shouldn't livelock, since the spinner can get preempted giving the

Not quite, seqlocks never suffered the cacheline bounce rwlocks have -
which was they strongest point - so I very much not like this change.

As to the x86_64 gtod-vsyscall, that uses a raw_seqlock_t on -rt, which
is still non-preemptable and should thus not be affected by this
livelock scenario.

--

From: Gregory Haskins
Date: Tuesday, September 2, 2008 - 5:45 am

Hi Guys,

I realized the prologue was not sufficiently descriptive based on the
feedback I received.  Therefore, here is a V3 with a new description.
The patch content itself is identical to v2.

-Greg

---------------------------------

seqlock: serialize against writers

There are currently several problems in -rt w.r.t. seqlock objects. RT
moves mainline seqlock_t to "raw_seqlock_t", and creates a new seqlock_t
object that is fully preemptible.  Being preemptible is a great step
towards deterministic behavior, but there are a few areas that need
additional measures to protect new vulnerabilities created by the
preemptible code. For the purposes of demonstration, consider three tasks
of different priority: A, B, and C.  A is the logically highest, and C is
the lowest.  A is trying to acquire a seqlock read critical section, while
C is involved in write locks.

Problem 1) If A spins in seqbegin due to writer contention retries, it may
prevent C from running even if C currently holds the write lock.  This
is a deadlock.

Problem 2) B may preempt C, preventing it from releasing the write
critical section.  In this case, A becomes inverted behind B.

Problem 3) Lower priority tasks such as C may continually acquire the write
section which subsequently causes A to continually retry and thus fail to
make forward progress.  Since C is lower priority it ideally should not
cause delays in A. E.g. C should block if A is in a read-lock and C is <= A.

This patch addresses Problems 1 & 2, and leaves 3 for a later time.

This patch changes the internal seqlock_t implementation to substitute
a rwlock for the basic spinlock previously used, and forces the readers
to serialize with the writers under contention.  Blocking on the read_lock
simultaneously sleeps A (preventing problem 1), while boosting C to A's
priority (preventing problem 2).  Non reader-to-writer contended
acquisitions, which are the predominant mode, remain free of atomic
operations.  Therefore the fast path should ...
From: Gregory Haskins
Date: Tuesday, September 2, 2008 - 6:01 am

[Empty message]
From: Gregory Haskins
Date: Tuesday, September 2, 2008 - 6:29 am

[ here is the updated prologue rebased against the proper tree (26.3-rt3) ]

--------------------------

seqlock: serialize against writers

There are currently several problems in -rt w.r.t. seqlock objects. RT
moves mainline seqlock_t to "raw_seqlock_t", and creates a new seqlock_t
object that is fully preemptible.  Being preemptible is a great step
towards deterministic behavior, but there are a few areas that need
additional measures to protect new vulnerabilities created by the
preemptible code. For the purposes of demonstration, consider three tasks
of different priority: A, B, and C.  A is the logically highest, and C is
the lowest.  A is trying to acquire a seqlock read critical section, while
C is involved in write locks.

Problem 1) If A spins in seqbegin due to writer contention retries, it may
prevent C from running even if C currently holds the write lock.  This
is a deadlock.

Problem 2) B may preempt C, preventing it from releasing the write
critical section.  In this case, A becomes inverted behind B.

Problem 3) Lower priority tasks such as C may continually acquire the write
section which subsequently causes A to continually retry and thus fail to
make forward progress.  Since C is lower priority it ideally should not
cause delays in A. E.g. C should block if A is in a read-lock and C is <= A.

This patch addresses Problems 1 & 2, and leaves 3 for a later time.

This patch changes the internal seqlock_t implementation to substitute
a rwlock for the basic spinlock previously used, and forces the readers
to serialize with the writers under contention.  Blocking on the read_lock
simultaneously sleeps A (preventing problem 1), while boosting C to A's
priority (preventing problem 2).  Non reader-to-writer contended
acquisitions, which are the predominant mode, remain free of atomic
operations.  Therefore the fast path should not be perturbed by this
change.

This fixes a real-world deadlock discovered under testing where all
high priority readers were ...
Previous thread: Out of mtrrs by "J.A. Magallón" on Friday, August 29, 2008 - 8:32 am. (3 messages)

Next thread: Re: Out of mtrrs by "J.A. Magallón" on Friday, August 29, 2008 - 9:17 am. (2 messages)