*Patch submitted for inclusion in PREEMPT_RT 26-rt4. Applies to 2.6.26.3-rt3* Hi Ingo, Steven, Thomas, Please consider for -rt4. This fixes a nasty deadlock on my systems under heavy load. -Greg ---- Seqlocks have always advertised that readers do not "block", but this was never really true. Readers have always logically blocked at the head of the critical section under contention with writers, regardless of whether they were allowed to run code or not. Recent changes in this space (88a411c07b6fedcfc97b8dc51ae18540bd2beda0) have turned this into a more explicit blocking operation in mainline. However, this change highlights a short-coming in -rt because the normal seqlock_ts are preemptible. This means that we can potentially deadlock should a reader spin waiting for a write critical-section to end while the writer is preempted. This patch changes the internal implementation to use a rwlock and forces the readers to serialize with the writers under contention. This will have the advantage that -rt seqlocks_t will sleep the reader if deadlock were imminent, and it will pi-boost the writer to prevent inversion. This fixes a deadlock discovered under testing where all high prioritiy readers were hogging the cpus and preventing a writer from releasing the lock. Since seqlocks are designed to be used as rarely-write locks, this should not affect the performance in the fast-path Signed-off-by: Gregory Haskins <ghaskins@novell.com> --- include/linux/seqlock.h | 76 +++++++++++++++++++++++++---------------------- 1 files changed, 40 insertions(+), 36 deletions(-) diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h index 345d726..7010169 100644 --- a/include/linux/seqlock.h +++ b/include/linux/seqlock.h @@ -3,9 +3,11 @@ /* * Reader/writer consistent mechanism without starving writers. This type of * lock for data where the reader wants a consistent set of information - * and is willing to retry if the information changes. Readers ...
Does this even work under x86-64? x86-64 uses seqlocks in user space in its vsyscalls. And read_lock() definitely doesn't work there because it writes. You would need at least to disable vsyscall gettimeofday(), making it much much slower. Perhaps you tested on one of the systems where the vsyscalls need to fallback for other reasons? (e.g. one using pmtimer for timing). -Andi --
Im running it on a x86_64 box as we speak. How can I tell if there is a certain mode that is permitting this? -Greg
If the boot up says you're running with PMtimer then it uses the fallback (usually happens on pre Fam10h AMD boxes). A typical Intel box would use the faster ring 3 only TSC path and then explode with your change I bet. Or step with gdb through gettimeofday() and see if it does a syscall. -Andi --
It seems to be running fine with no indication it has fallen back.=20 Perhaps I need a certain workload to bring out the issue? Here are some details of my system: ------------------- ghaskins@test:~> uname -a Linux test 2.6.26.3-rt3-rt #2 SMP PREEMPT RT Fri Aug 29 10:47:17 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux ghaskins@test:~> cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU 5130 @ 2.00GHz stepping : 6 cpu MHz : 1995.006 cache size : 4096 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx tm2 ssse3 cx16 xtpr dca lahf_lm bogomips : 3992.49 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU 5130 @ 2.00GHz stepping : 6 cpu MHz : 1995.006 cache size : 4096 KB physical id : 3 siblings : 2 core id : 0 cpu cores : 2 apicid : 6 initial apicid : 6 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx tm2 ssse3 cx16 xtpr dca lahf_lm bogomips : 3990.05 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 2 vendor_id : GenuineIntel cpu family ...
Perhaps you never hit the slow path in userland. That's the only place it would write. Perhaps add a dummy static variable in the fast path, and write to it. See if that crashes you apps. -- Steve --
Yeah, ideas crossed in the mail ;) I could just force all of the seqbegins to hit the slowpath by hacking the code and see what happens (aside from slowing down, of course ;) Question: Which seqlock_t does userspace use? I assume it uses seqlock_t and not raw_seqlock_t. But the only reason that I ask is that I converted raw_seqlock_t to use the new style as well to be consistent, even though it is not strictly necessary for the same reasons. So if perchance userspace uses the raw variant, I could solve this issue by only re-working the seqlock_t variant. Kind of a long shot, but figured I would mention it :) -Greg
Only if you don't believe it will really crash? I think it's pretty There's no raw_seqlock_t anywhere in mainline? I guess you could define a new seqlock_t which is explicitely user space safe. That might avoid such issues in the future. But then that would likely require some code duplication and be ugly. On the other hand whatever problem you fixing in the kernel (to be honest it's still unclear to me what the problem is) needs to be likely fixed for the userland lock too. -Andi --
The subject forgot to add "RT" in the brackets. I'm not convinced that the raw_seqlocks (mainline normal seqlocks) has a problem anyway. -- Steve --
(continuing from IRC) Agreed. I converted them to be consistent. Steve just told me that userspace actually uses the raw_seqlock_t variant, so the answer is simple. Just leave raw_seqlock_t alone and the patch will work fine. Thoughts? -Greg
Well, I guess it was just to prove to myself that I broke something
because I dont understand how the vsyscall interface works. But given
Yeah, understood. There is both in -rt and I was just saying that we
technically only need the seqlock_t fix in -rt. So if raw_seqlock_t
could be left pristine and solve this problem, that is an acceptable
Yeah, it would possibly be a problem in both cases.
The problem I am addressing only exists in -rt since it has seqlock_t
and raw_seqlock_t (with the former using preemptible spinlock_t's).=20
Since the underlying seqlock_t->spinlock_t is preemptible, you can see
that one thread that does:
{
write_seqlock();
/* asl */
write_sequnlock();
}
while other high-prio threads do
do { read_seqbegin(); /* asl */; } while (read_seqretry());
The readers could preempt the writer mid critical section and enter a
live-locked loop.
raw_seqlock_t (which is equivalent to a mainline seqlock_t) do not have
this problem because the spinlock acquisition inside the write_seqlock
disables preemption.
HTH
-Greg
I answered this on IRC, but this is for the rest of those reading this thread. Userspace (vsyscalls) can only use raw_seqlock_t. And only the read version for that matter. Since the read of raw_seqlock_t is just that, a read, no writes, and no jumping to other functions on contention. The vsyscalls should never use the -rt seqlock_t. Not modifying the raws here should make us golden. -- Steve --
Thinking about this some more, perhaps the issue is I am not hitting the contended path in vsyscall? -Greg
Yes it will be only contended when gettimeofday() races with the timer interrupt. You could try to run gettimeofday() in a loop and see how long it holds up. But anyways from the theory you should crash when it happens. Writes to kernel data are not allowed in vsyscalls and your read_lock clearly does a write. -Andi -- ak@linux.intel.com --
Oh I don't deny that it does. The compiler neatly reminded me that it could no longer be "const". I just was ignorant of the userspace requirement ;) But we *do* have a serious problem here. A non-preemptible seqlock_t will do bad things if it preempts the writer, so we need some kind of solution here, one way or the other. So suggestions welcome :) I realize this is only an issue currently in PREEMPT_RT so perhaps most will not care... but I do need to solve this at least for this branch. I currently do not see a way to solve this problem that doesn't involve some heavier involvement in the read path (read: const seqlock_t need not apply). Given that, we either need to address the const requirement for userspace, alter userspace's usage, or find another way to address the deadlock. Any ideas? -Greg
In the worst case you can always disable the vsyscall gtod() and thus the need for a user space safe seqlock_t, but it will hurt some workloads significantly. -Andi --
On Fri, 29 Aug 2008 11:44:58 -0400 I have mixed feelings about this. The point of this primitive was for cases where write contention was rare and writers only did small updates. So what you did was fix the primitive for cases where it is being misused. Do you have a real example where this is a problem? If so then the user of seqlock should be fixed, rather than fixing seqlock. --
*Patch submitted for inclusion in PREEMPT_RT 26-rt4. Applies to 2.6.26.3-rt3* Hi Ingo, Steven, Thomas, Please consider for -rt4. This fixes a nasty deadlock on my systems under heavy load. [ Changelog: v2: only touch seqlock_t because raw_seqlock_t doesn't require serialization and userspace cannot modify data during a read v1: initial release ] -Greg ---- seqlock: serialize against writers Seqlocks have always advertised that readers do not "block", but this was never really true. Readers have always logically blocked at the head of the critical section under contention with writers, regardless of whether they were allowed to run code or not. Recent changes in this space (88a411c07b6fedcfc97b8dc51ae18540bd2beda0) have turned this into a more explicit blocking operation in mainline. However, this change highlights a short-coming in -rt because the normal seqlock_ts are preemptible. This means that we can potentially deadlock should a reader spin waiting for a write critical-section to end while the writer is preempted. This patch changes the internal implementation to use a rwlock and forces the readers to serialize with the writers under contention. This will have the advantage that -rt seqlocks_t will sleep the reader if deadlock were imminent, and it will pi-boost the writer to prevent inversion. This fixes a deadlock discovered under testing where all high prioritiy readers were hogging the cpus and preventing a writer from releasing the lock. Since seqlocks are designed to be used as rarely-write locks, this should not affect the performance in the fast-path Signed-off-by: Gregory Haskins <ghaskins@novell.com> --- include/linux/seqlock.h | 51 +++++++++++++++++++++++------------------------ 1 files changed, 25 insertions(+), 26 deletions(-) diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h index 345d726..605fcdb 100644 --- a/include/linux/seqlock.h +++ b/include/linux/seqlock.h @@ -3,9 +3,11 @@ /* * ...
Hi Andi, As it turns out, my distcc backend was an x86_64 machine running the v1 patch and I started to notice sometime today that certain cc1 jobs were sometimes (albeit rarely) segfaulting on me. I noticed that before I even published the first patch, but I chalked it up to a corrupt .o on my NFS home. Plus I was forgetting that the distcc machine was running the patch, and I would probably have never made the userspace connection had you not mentioned it. In any case, I self-built this v2 patch with v2 applied, and the segfaults have gone away. So I think we know several things: 1) You were right that this would cause an issue if the slow path is hit.= 2) Steven was right that userspace must use raw_seqlock_t because it no longer crashes with v2. 3) I am satisfied that my primary concern is still properly addressed. Hopefully everyone is satisfied with this patch now. Thanks for the help. It's much appreciated! -Greg
Ah, the point I was missing is higher-priority realtime task, in which Still dont like this patch, once you have a rwlock you might as well go all the way. Esp since this half-arsed construct defeats PI in certain cases. --
Why? A full rwlock will still be much slower since the readers will always need an atomic op. This construct only uses atomic ops in the slow path under contention, which should be rare, and is thus still Ouch. While I admit that you can still get into inversion scenarios once the reader leaves the seqbegin, this is the nature of seqlocks.=20 The only ways I can think of to get around this involve atomic ops in the fast path, which I think should be avoided. What would you suggest otherwise? -Greg
Since we're talking -rt here, determinism rules, so bite the bullet and do full PI. The only reason we made all that stuff preemptable is to gain determinism, that also means we have to do the PI thing. --
Yeah, you have a point. I still think this patch will solve the deadlock thing, so please consider it in the interim. I will whip up a full PI solution next week. -Greg
I think the technical term is livelock. So the problem is that the write side gets preempted, and the read side spins in a non-preemptive fashion? Looking at the code, __read_seqbegin() doesn't disable preemption, so even while highly inefficient when spinning against a preempted lock, it shouldn't livelock, since the spinner can get preempted giving the Not quite, seqlocks never suffered the cacheline bounce rwlocks have - which was they strongest point - so I very much not like this change. As to the x86_64 gtod-vsyscall, that uses a raw_seqlock_t on -rt, which is still non-preemptable and should thus not be affected by this livelock scenario. --
Hi Guys, I realized the prologue was not sufficiently descriptive based on the feedback I received. Therefore, here is a V3 with a new description. The patch content itself is identical to v2. -Greg --------------------------------- seqlock: serialize against writers There are currently several problems in -rt w.r.t. seqlock objects. RT moves mainline seqlock_t to "raw_seqlock_t", and creates a new seqlock_t object that is fully preemptible. Being preemptible is a great step towards deterministic behavior, but there are a few areas that need additional measures to protect new vulnerabilities created by the preemptible code. For the purposes of demonstration, consider three tasks of different priority: A, B, and C. A is the logically highest, and C is the lowest. A is trying to acquire a seqlock read critical section, while C is involved in write locks. Problem 1) If A spins in seqbegin due to writer contention retries, it may prevent C from running even if C currently holds the write lock. This is a deadlock. Problem 2) B may preempt C, preventing it from releasing the write critical section. In this case, A becomes inverted behind B. Problem 3) Lower priority tasks such as C may continually acquire the write section which subsequently causes A to continually retry and thus fail to make forward progress. Since C is lower priority it ideally should not cause delays in A. E.g. C should block if A is in a read-lock and C is <= A. This patch addresses Problems 1 & 2, and leaves 3 for a later time. This patch changes the internal seqlock_t implementation to substitute a rwlock for the basic spinlock previously used, and forces the readers to serialize with the writers under contention. Blocking on the read_lock simultaneously sleeps A (preventing problem 1), while boosting C to A's priority (preventing problem 2). Non reader-to-writer contended acquisitions, which are the predominant mode, remain free of atomic operations. Therefore the fast path should ...
[ here is the updated prologue rebased against the proper tree (26.3-rt3) ] -------------------------- seqlock: serialize against writers There are currently several problems in -rt w.r.t. seqlock objects. RT moves mainline seqlock_t to "raw_seqlock_t", and creates a new seqlock_t object that is fully preemptible. Being preemptible is a great step towards deterministic behavior, but there are a few areas that need additional measures to protect new vulnerabilities created by the preemptible code. For the purposes of demonstration, consider three tasks of different priority: A, B, and C. A is the logically highest, and C is the lowest. A is trying to acquire a seqlock read critical section, while C is involved in write locks. Problem 1) If A spins in seqbegin due to writer contention retries, it may prevent C from running even if C currently holds the write lock. This is a deadlock. Problem 2) B may preempt C, preventing it from releasing the write critical section. In this case, A becomes inverted behind B. Problem 3) Lower priority tasks such as C may continually acquire the write section which subsequently causes A to continually retry and thus fail to make forward progress. Since C is lower priority it ideally should not cause delays in A. E.g. C should block if A is in a read-lock and C is <= A. This patch addresses Problems 1 & 2, and leaves 3 for a later time. This patch changes the internal seqlock_t implementation to substitute a rwlock for the basic spinlock previously used, and forces the readers to serialize with the writers under contention. Blocking on the read_lock simultaneously sleeps A (preventing problem 1), while boosting C to A's priority (preventing problem 2). Non reader-to-writer contended acquisitions, which are the predominant mode, remain free of atomic operations. Therefore the fast path should not be perturbed by this change. This fixes a real-world deadlock discovered under testing where all high priority readers were ...
