[HACK] convert i_alloc_sem for direct_io.c craziness!

Previous thread: 2.6.23-rc7-mm1: panic in scheduler by Lee Schermerhorn on Monday, September 24, 2007 - 5:12 pm. (6 messages)

Next thread: [git] CFS-devel, latest code by Ingo Molnar on Monday, September 24, 2007 - 5:45 pm. (37 messages)
To: LKML <linux-kernel@...>
Cc: linux-rt-users <linux-rt-users@...>, <mingo@...>, Thomas Gleixner <tglx@...>
Date: Monday, September 24, 2007 - 5:14 pm

Hopefully I will get some attention from those that are responsible for
fs/direct_io.c

Ingo and Thomas,

This patch converts the i_alloc_sem into a compat_rw_semaphore for the -rt
patch. Seems that the code in fs/direct_io.c does some nasty logic with
the i_alloc_sem. For DIO_LOCKING, I'm assuming that the i_alloc_sem is
used as a reference counter for pending requests. When the request is
made, the down_read is performed. When the request is handled by the block
softirq, then that softirq does an up on the request. So the owner is not
the same between down and up. When all requests are handled, the semaphore
counter should be zero. This keeps away any write access while requests
are pending.

Now this may all be well and dandy for vanilla Linux, but it breaks
miserbly when converted to -rt.

1) In RT rw_semaphores must be up'd by the same thread that down's it.

2) We can't do PI on the correct processes.

This patch converts (for now) the i_alloc_sem into a compat_rw_semaphore
to give back the old features to the sem. This fixes deadlocks that we've
been having WRT direct_io. But unfortunately, it now opens up
unbonded priority inversion with this semaphore. But really, those that
can be affected by this, shouldn't be doing disk IO anyway.

The real fix would be to get rid of the read semaphore trickery in
direct_io.c.

Signed-off-by: Steve Rostedt <rostedt@goodmis.org>

Index: linux-2.6.23-rc4-rt1/include/linux/fs.h
===================================================================
--- linux-2.6.23-rc4-rt1.orig/include/linux/fs.h 2007-09-24 16:58:59.000000000 -0400
+++ linux-2.6.23-rc4-rt1/include/linux/fs.h 2007-09-24 16:59:11.000000000 -0400
@@ -577,7 +577,7 @@ struct inode {
umode_t i_mode;
spinlock_t i_lock; /* i_blocks, i_bytes, maybe i_size */
struct mutex i_mutex;
- struct rw_semaphore i_alloc_sem;
+ struct compat_rw_semaphore i_alloc_sem;
const struct inode_operations *i_op;
const struct file_operations *i_fop; /* former ->i_op-...

To: Steven Rostedt <rostedt@...>
Cc: LKML <linux-kernel@...>, linux-rt-users <linux-rt-users@...>, <mingo@...>, Thomas Gleixner <tglx@...>
Date: Monday, October 1, 2007 - 3:52 pm

Yeah. This is used for performing concurrent asynchronous O_DIRECT

Do you have any suggestions for locking constructs that RT would prefer?

The core problem is that async IO is in flight while no process holds
the usual locks in system calls. We don't want the blocks referenced by
IOs to be freed and realocated some where else while the IO is in
flight. Hence the i_alloc_sem acquiry in the file block modification
paths: vmtruncate - free, notify_change - a proxy for allocating writes.

The agents are:

- many tasks issuing concurrent async IO and exiting from system calls
while the IO is still in flight

- operations completed in interrupt handlers from storage devices

- tasks changing file block mapping via system calls

There's some long-term work to integrate the locking between the
buffered and O_DIRECT paths, but it's not close to ready.

- z
-

To: Zach Brown <zach.brown@...>
Cc: Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, linux-rt-users <linux-rt-users@...>, <mingo@...>, Thomas Gleixner <tglx@...>
Date: Monday, October 1, 2007 - 4:39 pm

Basically, anything that maps to a simple mutex. Anything more complex
gets real messy real quick.

Locks that have non-exclusive states become non-deterministic because an
unbounded number of contexts can be in this state. Hence acquisition of
the exclusive state has unbounded time. Even when limited to a bounded
number, the ramifications to the PI graph will get you a head-ache.

Also, non-owner locks, ie. semaphores (asymetric acquisition vs release
contexts) are unusable because the lack of ownership undermines PI - who
to boost?

-

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, linux-rt-users <linux-rt-users@...>, <mingo@...>, Thomas Gleixner <tglx@...>
Date: Tuesday, October 2, 2007 - 2:03 pm

I'm worried that the aio+dio implementation of concurrent pending IOs
just doesn't map well to PI.

Would a hack with a mutex and counts help at all? It seems like it
would still have the same problem. The count increments don't
transfer ownership to the count decrements and the wake up.

io submission from tasks:
down(&inode->i_mutex);
atomic_inc(&inode->in_flight);
up(&inode->i_mutex);

io completion from interrupts:
if(atomic_dec_and_test(&inode->in_flight))
wake_up(&inode->waiting);

file allocation in tasks:
down(&inode->i_mutex);
wait_event(inode->waiting, atomic_read(&inode->in_flight) == 0);
up(&inode->i_mutex);

(yeah, yeah, starvation -- it's just a demonstration)

In any case, this seems like it's not a very high priority now that
RT has Steven's work-around. If it does become a priority can you
guys let linux-fsdevel know?

- z
-

To: Zach Brown <zach.brown@...>
Cc: Peter Zijlstra <a.p.zijlstra@...>, LKML <linux-kernel@...>, linux-rt-users <linux-rt-users@...>, <mingo@...>, Thomas Gleixner <tglx@...>
Date: Tuesday, October 2, 2007 - 2:12 pm

Hi Zach!

Thanks for the responses.

--

This looks more like a completion. Actually, completions don't have PI

Actually, it may still be a high priority. We have one BUG currently that
seems to trigger starvation. But we don't know yet if it's a bug in the
test or in the FS code yet. I don't know the full details yet, but we do
have someone currently looking at it. These tests are what found this
issue in the first place.

When more info becomes available, I will definitely CC the linux-fsdevel
list. Thanks for letting me know about it. (I usually get confused by all
the different lists that are out there).

-- Steve

-

To: Steven Rostedt <rostedt@...>
Cc: Peter Zijlstra <a.p.zijlstra@...>, LKML <linux-kernel@...>, linux-rt-users <linux-rt-users@...>, <mingo@...>, Thomas Gleixner <tglx@...>
Date: Tuesday, October 2, 2007 - 2:18 pm

OK, thanks. (Yeah, it's not a great situation.)

- z
-

To: Steven Rostedt <rostedt@...>
Cc: LKML <linux-kernel@...>, linux-rt-users <linux-rt-users@...>, <mingo@...>, Thomas Gleixner <tglx@...>
Date: Tuesday, September 25, 2007 - 8:42 am

On Mon, 24 Sep 2007 17:14:26 -0400 (EDT) Steven Rostedt

How about teaching {up,down}_read_non_owner() to barf on rw_semaphore
-

To: Peter Zijlstra <peterz@...>
Cc: LKML <linux-kernel@...>, linux-rt-users <linux-rt-users@...>, <mingo@...>, Thomas Gleixner <tglx@...>
Date: Tuesday, September 25, 2007 - 11:29 am

--

Sure thing!

This patch prevents rw_semaphore in PREEMPT_RT from performing
down_read_non_owner and up_read_non_owner. If this must be used, then
either convert to a completion or use compat_rw_semaphore.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

Index: linux-2.6.23-rc4-rt1/include/linux/rt_lock.h
===================================================================
--- linux-2.6.23-rc4-rt1.orig/include/linux/rt_lock.h 2007-09-25 09:38:56.000000000 -0400
+++ linux-2.6.23-rc4-rt1/include/linux/rt_lock.h 2007-09-25 09:42:19.000000000 -0400
@@ -248,25 +248,20 @@ do { \
__rt_rwsem_init((sem), #sem, &__key); \
} while (0)

+extern void __dont_do_this_in_rt(struct rw_semaphore *rwsem);
+
+#define rt_down_read_non_owner(rwsem) __dont_do_this_in_rt(rwsem)
+#define rt_up_read_non_owner(rwsem) __dont_do_this_in_rt(rwsem)
+
extern void fastcall rt_down_write(struct rw_semaphore *rwsem);
extern void fastcall
rt_down_read_nested(struct rw_semaphore *rwsem, int subclass);
extern void fastcall
rt_down_write_nested(struct rw_semaphore *rwsem, int subclass);
extern void fastcall rt_down_read(struct rw_semaphore *rwsem);
-#ifdef CONFIG_DEBUG_LOCK_ALLOC
-extern void fastcall rt_down_read_non_owner(struct rw_semaphore *rwsem);
-#else
-# define rt_down_read_non_owner(rwsem) rt_down_read(rwsem)
-#endif
extern int fastcall rt_down_write_trylock(struct rw_semaphore *rwsem);
extern int fastcall rt_down_read_trylock(struct rw_semaphore *rwsem);
extern void fastcall rt_up_read(struct rw_semaphore *rwsem);
-#ifdef CONFIG_DEBUG_LOCK_ALLOC
-extern void fastcall rt_up_read_non_owner(struct rw_semaphore *rwsem);
-#else
-# define rt_up_read_non_owner(rwsem) rt_up_read(rwsem)
-#endif
extern void fastcall rt_up_write(struct rw_semaphore *rwsem);
extern void fastcall rt_downgrade_write(struct rw_semaphore *rwsem);

Index: linux-2.6.23-rc4-rt1/kernel/rt.c
===================================================================
--- linux-2.6.23-rc4-rt1.orig/kern...

Previous thread: 2.6.23-rc7-mm1: panic in scheduler by Lee Schermerhorn on Monday, September 24, 2007 - 5:12 pm. (6 messages)

Next thread: [git] CFS-devel, latest code by Ingo Molnar on Monday, September 24, 2007 - 5:45 pm. (37 messages)