[PATCH] documentation: explain memory barriers

Previous thread: [PATCH 2/4] x86: Add UV bios call infrastructure by Russ Anderson on Monday, September 22, 2008 - 2:09 pm. (5 messages)

Next thread: [PATCH 3/4] x86: Add UV partition call by Russ Anderson on Monday, September 22, 2008 - 2:10 pm. (2 messages)
From: Mikulas Patocka
Date: Monday, September 22, 2008 - 2:10 pm

Hi

Here is a patch for MM livelock. The original bug report follows after the 
patch. To reproduce the bug on my computer, I had to change bs=4M to 
bs=65536 in examples in the original report.

Mikulas

---

Fix starvation in memory management.

The bug happens when one process is doing sequential buffered writes to
a block device (or file) and another process is attempting to execute
sync(), fsync() or direct-IO on that device (or file). This syncing
process will wait indefinitelly, until the first writing process
finishes.

For example, run these two commands:
dd if=/dev/zero of=/dev/sda1 bs=65536 &
dd if=/dev/sda1 of=/dev/null bs=4096 count=1 iflag=direct

The bug is caused by sequential walking of address space in
write_cache_pages and wait_on_page_writeback_range: if some other
process is constantly making dirty and writeback pages while these
functions run, the functions will wait on every new page, resulting in
indefinite wait.

To fix the starvation, I declared a mutex starvation_barrier in struct
address_space. When the mutex is taken, anyone making dirty pages on
that address space should stop. The functions that walk address space
sequentially first estimate a number of pages to process. If they
process more than this amount (plus some small delta), it means that
someone is making dirty pages under them --- they take the mutex and
hold it until they finish. When the mutex is taken, the function
balance_dirty_pages will wait and not allow more dirty pages being made.

The mutex is not really used as a mutex, it does not prevent access to
any critical section. It is used as a barrier that blocks new dirty
pages from being created. I use mutex and not wait queue to make sure
that the starvation can't happend the other way (if there were wait
queue, excessive calls to write_cache_pages and
wait_on_page_writeback_range would block balance_dirty_pages forever;
with mutex it is guaranteed that every process eventually makes
progress).

The essential property ...
From: Andrew Morton
Date: Monday, September 22, 2008 - 5:48 pm

On Mon, 22 Sep 2008 17:10:04 -0400 (EDT)

Shouldn't happen.  All the data-syncing functions should have an upper
bound on the number of pages which they attempt to write.  In the
example above, we end up in here:

int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
				loff_t end, int sync_mode)
{
	int ret;
	struct writeback_control wbc = {
		.sync_mode = sync_mode,
		.nr_to_write = mapping->nrpages * 2,	<<--
		.range_start = start,
		.range_end = end,
	};

so generic_file_direct_write()'s filemap_write_and_wait() will attempt
to write at most 2* the number of pages which are in cache for that inode.

I'd say that either a) that logic got broken or b) you didn't wait long
enough, and we might need to do something to make it not wait so long.

But before we patch anything we should fully understand what is
happening and why the current anti-livelock code isn't working in this
case.

--

From: Mikulas Patocka
Date: Tuesday, September 23, 2008 - 3:34 pm

See write_cache_pages:

if (wbc->sync_mode != WB_SYNC_NONE)
        wait_on_page_writeback(page);	(1)
if (PageWriteback(page) ||
    !clear_page_dirty_for_io(page)) {
        unlock_page(page);		(2)
        continue;
}
ret = (*writepage)(page, wbc, data);
if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
        unlock_page(page);
        ret = 0;
}
if (ret || (--(wbc->nr_to_write) <= 0))
        done = 1;

--- so if it goes by points (1) and (2), the counter is not decremented, 
yet the function waits for the page. If there is constant stream of 
writeback pages being generated, it waits on each on them --- that is, 
forever. I have seen livelock in this function. For you that example with 
two dd's, one buffered write and the other directIO read doesn't work? For 
me it livelocks here.

wait_on_page_writeback_range is another example where the livelock 
happened, there is no protection at all against starvation.


BTW. that .nr_to_write = mapping->nrpages * 2 looks like a dangerous thing 
to me.

Imagine this case: You have two pages with indices 4 and 5 dirty in a 
file. You call fsync(). It sets nr_to_write to 4.

Meanwhile, another process makes pages 0, 1, 2, 3 dirty.

The fsync() process goes to write_cache_pages, writes the first 4 dirty 
pages and exits because it goes over the limit.

result --- you violate fsync() semantics, pages that were dirty before 

Mikulas
--

From: Andrew Morton
Date: Tuesday, September 23, 2008 - 3:49 pm

On Tue, 23 Sep 2008 18:34:20 -0400 (EDT)

um, OK.  So someone else is initiating IO for this inode and this
thread *never* gets to initiate any writeback.  That's a bit of a
surprise.

How do we fix that?  Maybe decrement nt_to_write for these pages as

yup, that's pretty much unfixable, really, unless new locks are added
which block threads which are writing to unrelated sections of the
file, and that could hurt some workloads quite a lot, I expect.

Hopefully high performance applications are instantiating the file
up-front and are using sync_file_range() to prevent these sorts of
things from happening.  But they probably aren't.


--

From: Mikulas Patocka
Date: Tuesday, September 23, 2008 - 4:11 pm

And what do you want to do with wait_on_page_writeback_range? When I 
solved that livelock in write_cache_pages(), I got another livelock in 

It is fixable with the patch I sent --- it doesn't take any locks unless 
the starvation happens. Then, you don't have to use .nr_to_write for 
fsync anymore.

Another solution could be to record in page structure jiffies when the 
page entered dirty state and writeback state. The start writeback/wait on 
writeback functions could then trivially ignore pages that were 

--- for databases it is pretty much possible that one thread is writing 
already journaled data (so it doesn't care when the data are really 
written) and another thread is calling fsync() on the same inode 
simultaneously --- so fsync() could mistakenly write the data generated by 
the first thread and ignore the data generated by the second thread, that 
it should really write.

Mikulas
--

From: Andrew Morton
Date: Tuesday, September 23, 2008 - 4:46 pm

On Tue, 23 Sep 2008 19:11:51 -0400 (EDT)


I agree that the patch is low-impact and relatively straightforward. 
The main problem is making the address_space larger - there can (and
often are) millions and millions of these things in memory.  Making it
larger is a big deal.  We should work hard to seek an alternative and
afacit that isn't happening here.

We already have existing code and design which attempts to avoid
livelock without adding stuff to the address_space.  Can it be modified
--

From: Mikulas Patocka
Date: Wednesday, September 24, 2008 - 11:50 am

I reworked my patch to use a bit in address_space->flags and hashes wait 
queues, so it doesn't take any extra memory. I'm sending it in three 
parts.
1 - make generic function wait_action_schedule
2 - fix the livelock, the logic is exactly the same as in my previous 
patch, wait_on_bit_lock is used instead of mutexes
3 - remove that nr_pages * 2 limit, because it causes misbehavior and 
possible data loss.

Mikulas
--

From: Mikulas Patocka
Date: Wednesday, September 24, 2008 - 11:51 am

A generic function wait_action_schedule that allows to use wait_on_bit_lock just
like mutexes.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 include/linux/wait.h |    8 +++++++-
 kernel/wait.c        |    7 +++++++
 2 files changed, 14 insertions(+), 1 deletion(-)

Index: linux-2.6.27-rc7-devel/include/linux/wait.h
===================================================================
--- linux-2.6.27-rc7-devel.orig/include/linux/wait.h	2008-09-24 03:20:59.000000000 +0200
+++ linux-2.6.27-rc7-devel/include/linux/wait.h	2008-09-24 03:26:34.000000000 +0200
@@ -513,7 +513,13 @@ static inline int wait_on_bit_lock(void 
 		return 0;
 	return out_of_line_wait_on_bit_lock(word, bit, action, mode);
 }
-	
+
+/**
+ * wait_action_schedule - this function can be passed to wait_on_bit or
+ * wait_on_bit_lock and it will call just schedule().
+ */
+int wait_action_schedule(void *);
+
 #endif /* __KERNEL__ */
 
 #endif
Index: linux-2.6.27-rc7-devel/kernel/wait.c
===================================================================
--- linux-2.6.27-rc7-devel.orig/kernel/wait.c	2008-09-24 03:22:58.000000000 +0200
+++ linux-2.6.27-rc7-devel/kernel/wait.c	2008-09-24 03:24:10.000000000 +0200
@@ -251,3 +251,10 @@ wait_queue_head_t *bit_waitqueue(void *w
 	return &zone->wait_table[hash_long(val, zone->wait_table_bits)];
 }
 EXPORT_SYMBOL(bit_waitqueue);
+
+int wait_action_schedule(void *word)
+{
+	schedule();
+	return 0;
+}
+EXPORT_SYMBOL(wait_action_schedule);

--

From: Mikulas Patocka
Date: Wednesday, September 24, 2008 - 11:52 am

Avoid starvation when walking address space.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 include/linux/pagemap.h |    1 +
 mm/filemap.c            |   20 ++++++++++++++++++++
 mm/page-writeback.c     |   37 ++++++++++++++++++++++++++++++++++++-
 mm/truncate.c           |   24 +++++++++++++++++++++++-
 4 files changed, 80 insertions(+), 2 deletions(-)

Index: linux-2.6.27-rc7-devel/include/linux/pagemap.h
===================================================================
--- linux-2.6.27-rc7-devel.orig/include/linux/pagemap.h	2008-09-24 02:57:37.000000000 +0200
+++ linux-2.6.27-rc7-devel/include/linux/pagemap.h	2008-09-24 02:59:04.000000000 +0200
@@ -21,6 +21,7 @@
 #define	AS_EIO		(__GFP_BITS_SHIFT + 0)	/* IO error on async write */
 #define AS_ENOSPC	(__GFP_BITS_SHIFT + 1)	/* ENOSPC on async write */
 #define AS_MM_ALL_LOCKS	(__GFP_BITS_SHIFT + 2)	/* under mm_take_all_locks() */
+#define AS_STARVATION	(__GFP_BITS_SHIFT + 3)	/* an anti-starvation barrier */
 
 static inline void mapping_set_error(struct address_space *mapping, int error)
 {
Index: linux-2.6.27-rc7-devel/mm/filemap.c
===================================================================
--- linux-2.6.27-rc7-devel.orig/mm/filemap.c	2008-09-24 02:59:33.000000000 +0200
+++ linux-2.6.27-rc7-devel/mm/filemap.c	2008-09-24 03:13:47.000000000 +0200
@@ -269,10 +269,19 @@ int wait_on_page_writeback_range(struct 
 	int nr_pages;
 	int ret = 0;
 	pgoff_t index;
+	long pages_to_process;
 
 	if (end < start)
 		return 0;
 
+	/*
+	 * Estimate the number of pages to process. If we process significantly
+	 * more than this, someone is making writeback pages under us.
+	 * We must pull the anti-starvation plug.
+	 */
+	pages_to_process = bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
+	pages_to_process += (pages_to_process >> 3) + 16;
+
 	pagevec_init(&pvec, 0);
 	index = start;
 	while ((index <= end) &&
@@ -288,6 +297,10 @@ int wait_on_page_writeback_range(struct 
 			if (page->index > end)
 ...
From: Andrew Morton
Date: Wednesday, October 1, 2008 - 10:54 pm

Please don't send multiple patches with identical titles - think up a
good, unique, meaningful title for each patch.


Please include a full changelog with each iteration of each patch.

That changelog should explain the reason for playing games with

This sequence appears twice and it would probably be clearer to

This is copied three times and perhaps also should be factored out.

Please note that an effort has been made to make mm/filemap.c look

This sequence is repeated three or four times and should be pulled out
into a well-commented function.  That comment should explain the logic
behind the use of these barriers, please.


--

From: Mikulas Patocka
Date: Sunday, October 5, 2008 - 3:11 pm

Hi

I removed the repeated code and create a new bit mutexes. They are 
space-efficient mutexes that consume only one bit. See the next 3 patches.

If you are concerned about the size of an inode, I can convert other 
mutexes to bit mutexes: i_mutex and inotify_mutex. I could also create 
bit_spinlock (one-bit spinlock that uses test_and_set_bit) and save space 
for address_space->tree_lock, address_space->i_mmap_lock, 
address_space->private_lock, inode->i_lock.

Look at it and say what you think about the idea of condensing mutexes 
into single bits.

--

From: Nick Piggin
Date: Saturday, October 11, 2008 - 5:06 am

I wouldn't worry for now. mutexes can be unlocked much faster than bit
mutexes, especially in the fastpath. And due to slab, it would be

We have that already. It is much much faster to unlock spinlocks than
bit spinlocks in general (if you own the word exclusively, then it's
not, but then you would be less likely to save space), and we can also

Looks pretty good to me.
--

From: Mikulas Patocka
Date: Monday, October 20, 2008 - 1:14 pm

Maybe inotify_mutex. You are right that i_mutex is so heavily contended 
that slowing it down to save few words wouldn't be good. Do you know about 

BTW. why do spinlocks on x86(64) have 32 bits and not 8 bits or 16 bits? 
Are atomic 32-bit instuctions faster?

Can x86(86) system have 256 CPUs?

Mikulas
--

From: Nick Piggin
Date: Monday, October 20, 2008 - 6:51 pm

Don't really know, no. I think most desktop environments use it to

In the case of <= 256 CPUs, they could be an unsigned short I think.
Probably it has never been found to be a huge win because they are
often beside other ints or longs. I think I actually booted up the
kernel with 16-bit spinlocks when doing the FIFO locks, but never

Well, none that I know of which actually exist. SGI is hoping to have
4096 CPU x86 systems as far as I can tell.
--

From: Mikulas Patocka
Date: Sunday, October 5, 2008 - 3:14 pm

A bit mutex implementation.

Bit mutex is a space-efficient mutex that consumes exactly one bit in
the apropriate structure (if mutex debugging is turned off).  Bit mutex
is implemented as a bit in unsigned long field. Other bits in the field
may be used for other purposes, if test/set/clear_bit functions are used
to manipulate them. There is no wait queue in the structure containing
the bit mutex; hashed wait queues for waiting on bits are used.

Additional structure struct bit_mutex is needed, it contains lock
debugging & dependency information. When the kernel is compiled without
lock debugging, this structure is empty.

Bit mutexes are used with functions
bit_mutex_init
bit_mutex_lock
bit_mutex_unlock
bit_mutex_is_locked
These functions take 3 arguments: pointer to the unsigned long field,
the bit position and pointer to struct bit_mutex.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 include/linux/bit-mutex.h |   98 ++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/wait.h      |    8 +++
 kernel/Makefile           |    2 
 kernel/bit-mutex-debug.c  |   55 +++++++++++++++++++++++++
 kernel/wait.c             |    7 +++
 5 files changed, 169 insertions(+), 1 deletion(-)

Index: linux-2.6.27-rc8-devel/include/linux/bit-mutex.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.27-rc8-devel/include/linux/bit-mutex.h	2008-10-05 19:58:30.000000000 +0200
@@ -0,0 +1,98 @@
+#ifndef __LINUX_BIT_MUTEX_H
+#define __LINUX_BIT_MUTEX_H
+
+#include <linux/bitops.h>
+#include <linux/lockdep.h>
+#include <linux/wait.h>
+#include <linux/sched.h>
+
+struct bit_mutex {
+#ifdef CONFIG_DEBUG_MUTEXES
+	unsigned long		*field;
+	int 			bit;
+	struct thread_info	*owner;
+	const char		*name;
+	void			*magic;
+#endif
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	struct lockdep_map	dep_map;
+#endif
+};
+
+#ifndef CONFIG_DEBUG_MUTEXES
+
+static inline void bit_mutex_init(unsigned long ...
From: Mikulas Patocka
Date: Sunday, October 5, 2008 - 3:14 pm

Fix starvation in memory management.

The bug happens when one process is doing sequential buffered writes to
a block device (or file) and another process is attempting to execute
sync(), fsync() or direct-IO on that device (or file). This syncing
process will wait indefinitelly, until the first writing process
finishes.

For example, run these two commands:
dd if=/dev/zero of=/dev/sda1 bs=65536 &
dd if=/dev/sda1 of=/dev/null bs=4096 count=1 iflag=direct

The bug is caused by sequential walking of address space in
write_cache_pages and wait_on_page_writeback_range: if some other
process is constantly making dirty and writeback pages while these
functions run, the functions will wait on every new page, resulting in
indefinite wait.

To fix the starvation, I declared a bit mutex starvation_barrier in
struct address_space. The bit AS_STARVATION_BARRIER in
address_space->flags is used for actual locking. When the mutex is
taken, anyone making dirty pages on that address space should stop. The
functions that walk address space sequentially first estimate a number
of pages to process. If they process more than this amount (plus some
small delta), it means that someone is making dirty pages under them ---
they take the mutex and hold it until they finish. When the mutex is
taken, the function balance_dirty_pages will wait and not allow more
dirty pages being made.

The mutex is not really used as a mutex, it does not prevent access to
any critical section. It is used as a barrier that blocks new dirty
pages from being created. I use mutex and not wait queue to make sure
that the starvation can't happend the other way (if there were wait
queue, excessive calls to write_cache_pages and
wait_on_page_writeback_range would block balance_dirty_pages forever;
with mutex it is guaranteed that every process eventually makes
progress).

The essential property of this patch is that if the starvation doesn't
happen, no additional locks are taken and no atomic operations are
performed. So ...
From: Arjan van de Ven
Date: Sunday, October 5, 2008 - 3:33 pm

On Sun, 5 Oct 2008 18:14:50 -0400 (EDT)

are you sure?
isn't the right fix to just walk the file pages only once?
--

From: Mikulas Patocka
Date: Sunday, October 5, 2008 - 4:02 pm

It walks the pages only once --- and waits on each on them. But because 
new pages are contantly appearing under it, that "walk only once" becomes 
infinite loop (well, finite, until the whole disk is written).

Mikulas
--

From: Arjan van de Ven
Date: Sunday, October 5, 2008 - 4:07 pm

On Sun, 5 Oct 2008 19:02:57 -0400 (EDT)

well. fsync() promises that everything that's dirty at the time of the
call will hit the disk. That is not something you can compromise.
The only way out would be is to not allow new dirty during an fsync()...
which is imo even worse.

Submit them all in one go, then wait, should not be TOO bad. Unless a
lot was dirty already, but then you go back to "but it has to go to
disk".


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Mikulas Patocka
Date: Sunday, October 5, 2008 - 4:18 pm

The problem here is that you have two processes --- one is writing, the 
other is simultaneously syncing. The syncing process can't distinguish the 
pages that were created before fsync() was invoked (it has to wait on 
them) and the pages that were created while fsync() was running (it 
doesn't have to wait on them) --- so it waits on them all. The result is 
livelock, it waits indefinitely, because more and more pages are being 
created.

The patch changes it so that if it waits long enough, it stops the other 
writers creating dirty pages.

Or, how otherwise would you implement "Submit them all in one go, then 
wait"? The current code is:
you grab page 0, see it is under writeback, wait on it
you grab page 1, see it is under writeback, wait on it
you grab page 2, see it is under writeback, wait on it
you grab page 3, see it is under writeback, wait on it
...
--- and the other process is just making more and more writeback pages 
while your waiting routine run. So the waiting is indefinite.

Mikulas
--

From: Arjan van de Ven
Date: Sunday, October 5, 2008 - 4:28 pm

On Sun, 5 Oct 2008 19:18:22 -0400 (EDT)

for pages it already passed that's not true.

it shouldn't be doing this. It should be "submit the lot"
"wait on the lot that got submitted".

keeping away new writers is just shifting the latency to an innocent
party (since the fsync() is just bloody expensive, the person doing it
should be paying the price, not the rest of the system), and a grave
mistake.

If the fsync() implementation isn't smart enough, sure, lets improve
it. But not by shifting latency around... lets make it more efficient
at submitting IO.
If we need to invent something like "chained IO" where if you wait on
the last of the chain, you wait on the entirely chain, so be it.


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Mikulas Patocka
Date: Sunday, October 5, 2008 - 5:01 pm

And how do you want to implement that "wait on the lot that got 
submitted". Note that you can't allocate an array or linked list that 
holds pointers to all submitted pages. And if you add anything to the page 
structure or radix tree, you have to serialize concurrent sync,fsync 

I assume that if very few people complained about the livelock till now, 
very few people will see degraded write performance. My patch blocks the 
writes only if the livelock happens, so if the livelock doesn't happen in 

This looks madly complicated. And ineffective, because if some page was 
submitted before fsync() was invoked, and is under writeback while fsync() 
is called, fsync() still has to wait on it.

Mikulas
--

From: Arjan van de Ven
Date: Sunday, October 5, 2008 - 5:30 pm

On Sun, 5 Oct 2008 20:01:46 -0400 (EDT)

I object to calling this a livelock. It's not. 
And yes, fsync is slow and lots of people are seeing that.
It's not helped by how ext3 is implemented (where fsync is effectively
equivalent of a sync for many cases).

so?
just make a chain per inode always...


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Mikulas Patocka
Date: Sunday, October 5, 2008 - 8:30 pm

It unlocks itself when the whole disk is written, and it can be several 
hours (or days, if you have many-terabyte array). So formally it is not 
livelock, from the user experience it is --- he sees unkillable process in 

The point is that many fsync()s may run in parallel and you have just one 
inode and just one chain. And if you add two-word list_head to a page, to 
link it on this list, many developers will hate it for increasing its 
size.

See the work dobe by Nick Piggin somewhere in this thread. He uses just 
one bit in radix tree to mark pages to process. But he needs to serialize 
all syncs on a given file, they no longer run in parallel.

Mikulas
--

From: Arjan van de Ven
Date: Sunday, October 5, 2008 - 9:20 pm

why to a page?
a list head in the inode and chain up the bios....
or not make an actual list but just a "is the previous one done" thing
it's not all that hard to get something that works on a per inode basis,
that gives "wait for all io upto this one".



-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Mikulas Patocka
Date: Monday, October 6, 2008 - 6:00 am

And if you want to wait for a bio submitted by a different process? 

So code it :)

Mikulas
--

From: Arjan van de Ven
Date: Monday, October 6, 2008 - 6:50 am

On Mon, 6 Oct 2008 09:00:14 -0400 (EDT)

the point is that the kernel would always chain it to the inode,
independent of who or when it is submitted


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Mikulas Patocka
Date: Monday, October 6, 2008 - 1:44 pm

If you add a list to an inode, you need to protect it with a spinlock. So 
you take one more spinlock for any write bio submitted --- a lot of 
developers would hate it.

Another problem: how do you want to walk all dirty pages and submit bio 
for them?

The act of allocating and submission of bio can block (if you run out of 
some mempool) and in this case it wait until some other bio is finished. 
During this time, more dirty pages can be created.

Also, if you find a page that is both dirty and under writeback, you need 
to wait until a writeback finishes and then initiate another writeback 
(because the old writeback may be writing stale data). You again, block, 
and more dirty pages can appear.

And if you block and more dirty pages appear, you are prone to the 
livelock.

[ In Nick Piggin's patch, it is needed to lock the whole address space, 
mark dirty pages in one non-blocking pass and write marked pages again in 
a blocking pass --- so that if more dirty pages appear while bios are 
submitted, the new pages will be skipped ]

Mikulas
--

From: Pavel Machek
Date: Wednesday, October 8, 2008 - 3:56 am

8 hours of process in D state is a livelock. And we can do minimal fix
here, this almost never happens in real life anyway.

Latency imposed of writer should not be a problem...
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Dave Chinner
Date: Sunday, October 5, 2008 - 7:51 pm

fsync() waiting on pre-issued writeback pages is the correct
behaviour.

IOW, if the page is under writeback at the time an fsync() is
issued (e.g. issued by pdflush), the page was *not clean* at the
time the fsync() was called and hence must be clean when fsync()
returns. fsync() needs to wait for all pages under I/O at the time
it is called, not just the dirty pages it issues I/O on.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Mikulas Patocka
Date: Sunday, October 5, 2008 - 3:16 pm

Fix violation of sync()/fsync() semantics. Previous code walked up to
mapping->nrpages * 2 pages. Because pages could be created while
__filemap_fdatawrite_range was in progress, it could lead to a
misbehavior.  Example: there are two pages in address space with indices
4, 5. Both are dirty.  Someone calls __filemap_fdatawrite_range, it sets
.nr_to_write = 4.  Meanwhile, some other process creates dirty pages 0,
1, 2, 3.  __filemap_fdatawrite_range writes pages 0, 1, 2, 3, finds out
that it reached the limit and exits.

Result: pages that were dirty before __filemap_fdatawrite_range was
invoked were not written.

With starvation protection from the previous patch, this
mapping->nrpages * 2 logic is no longer needed.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 mm/filemap.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Index: linux-2.6.27-rc7-devel/mm/filemap.c
===================================================================
--- linux-2.6.27-rc7-devel.orig/mm/filemap.c	2008-09-24 14:47:01.000000000 +0200
+++ linux-2.6.27-rc7-devel/mm/filemap.c	2008-09-24 15:01:23.000000000 +0200
@@ -202,6 +202,11 @@ static int sync_page_killable(void *word
  * opposed to a regular memory cleansing writeback.  The difference between
  * these two operations is that if a dirty page/buffer is encountered, it must
  * be waited upon, and not just skipped over.
+ *
+ * Because new pages dirty can be created while this is executing, that
+ * mapping->nrpages * 2 condition is unsafe. If we are doing data integrity
+ * write, we must write all the pages. AS_STARVATION bit will eventually prevent
+ * creating more dirty pages to avoid starvation.
  */
 int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
 				loff_t end, int sync_mode)
@@ -209,7 +214,7 @@ int __filemap_fdatawrite_range(struct ad
 	int ret;
 	struct writeback_control wbc = {
 		.sync_mode = sync_mode,
-		.nr_to_write = mapping->nrpages * 2,
+		.nr_to_write = sync_mode == ...
From: Randy Dunlap
Date: Wednesday, October 8, 2008 - 6:12 pm

Should this be added to CodingStyle or some other file?




From: Randy Dunlap <randy.dunlap@oracle.com>

We want all uses of memory barriers to be explained in the source
code.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
---
 Documentation/SubmitChecklist |    3 +++
 1 file changed, 3 insertions(+)

--- lin2627-rc9-docs.orig/Documentation/SubmitChecklist
+++ lin2627-rc9-docs/Documentation/SubmitChecklist
@@ -85,3 +85,6 @@ kernel patches.
 23: Tested after it has been merged into the -mm patchset to make sure
     that it still works with all of the other queued patches and various
     changes in the VM, VFS, and other subsystems.
+
+24: All memory barriers {e.g., barrier(), rmb(), wmb()} need a comment in the
+    source code that explains the logic of what they are doing and why.
--

From: Chris Snook
Date: Wednesday, October 8, 2008 - 6:17 pm

Seriously?  When a barrier is used, it's generally self-evident what 
it's doing.  The real problem is when a barrier is *not* used.  It would 
probably be more helpful to comment every non-barrier line of code to 
explain why it's okay if memory operations are getting reordered.

-- Chris
--

From: Andrew Morton
Date: Wednesday, October 8, 2008 - 6:31 pm

fs/buffer.c:sync_buffer().  Have fun.
--

From: Chris Snook
Date: Wednesday, October 8, 2008 - 10:51 pm

The real disaster there is the clear_buffer_##name macro and friends, as
evidenced by fs/ext2/inode.c:599

                clear_buffer_new(bh_result); /* What's this do? */

I'm completely in favor of documenting everything that can potentially interact
with that train wreck, but I maintain that the vast majority of memory barriers
are self-evident.

-- Chris
--

From: Nick Piggin
Date: Thursday, October 9, 2008 - 10:29 am

That's not a disaster. It is relatively easy even if you have no
idea about any of that code to a) work out what BH_New flag does,
see where it gets set and where it gets cleared, and then work out
what that does. Actually, buffer.c used to leak BH_New in some
cases, but now it should be a bug if BH_New is found to be set
there (buffer.c should take any BH_New buffers, initialize them
appropriately, and clear BH_New; a dangling BH_New would indicate
uninitialized data going to or coming from the block).

No, they're easy, because you can find every single place where any
one cares about them with a single simple grep.

Again, fs/buffer.c:sync_buffer(). Which stores and/or loads is that
barrier sequencing against which subsequent stores and/or loads? Why?

For another fun example, mm/filemap.c:sync_page. This actually has a
big comment, but (to me) it isn't even evident then which loads and
stores are being sequenced against which subsequent ones because it
is not explicitly documented. And I do have some experience in adding
barriers to existing mm code where they have been missed completely.



They are self-evident if you have spent a lot of time getting the
state machine and locking/concurrency model in your mind. If you
have not, then how do you know there is not some memory operation
way back or forward in the instruction stream that is supposed to
be ordered by this barrier?

All memory barriers have to be documented, except acquire/release
for critical sections, IMO.
--

From: Ben Hutchings
Date: Thursday, October 9, 2008 - 2:58 am

Acquire and release barriers attached to operations are usually self-
evident; standalone wmb() and rmb() much less so.  It is helpful to be
explicit about exactly which memory operations need to be ordered, which
are often not the memory operations immediately preceding and following
it.  "all" may have been a bit strong though.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--

From: Nick Piggin
Date: Thursday, October 9, 2008 - 2:27 pm

No, I don't think so. We should absolutely force "all". That allows nobody
to be lazy, no confusion, and reminds people that memory barriers are not
easy to follow for a new reader of the code, or necessarily even the author,
6 months later. If somebody is too lazy to write a comment, they can use
locks

One last quick quiz, easier than the earlier ones...
mm/vmscan.c:__remove_mapping has a score of lines documenting exactly
what memory operations are being ordered, and even an example of what
happens if the ordering is not folllowed. This is a pretty good comment,
if I say so myself. However, it has one deficiency in that it doesn't
explicitly state where the write barrier(s) is (IMO the comments for one
part of an ordering protocol should reference the other parts of the
protcol).

Where are the store barriers, or why are they not required?
--

From: Valdis.Kletnieks
Date: Wednesday, October 8, 2008 - 6:50 pm

"what they are doing" will almost always be "flush value to RAM" or similar.
How about this instead:

+ 24: All memory barriers ({e.g., barrier(), rmb(), wmb()} need a comment in the
+    source code that explains the race condition being prevented, and states
+    the location of the other code or hardware feature that races with this.
+
+    An example comment:
+
+ 	/*
+ 	 * If we don't do a wmb() here, the RBFROBNIZ register on the XU293
+ 	 * card will get confused and wedge the hardware...
+ 	 */
+	wmb();
From: Nick Piggin
Date: Thursday, October 9, 2008 - 10:35 am

Memory barriers don't flush anything anywhere. It's simple: you must

This comment is no good, because it doesn't tell you what the memory barrier
does. It tells you what might happen if you omit it.

 /*
  * If we don't do a wmb() here, the store to the RBFROBNIZ,
  * above, might reach the device before the store X, below.
  * 
  * If that happens, then the XU293  card will get confused
  * and wedge the hardware...
  */
  wmb();

If you don't comment like that, then how does the reader know that the wmb
is not *also* supposed to order the store with any other of the limitless
subsequent stores until the next memory ordering operation? Or any of the
previous stores since the last one?
--

From: Valdis.Kletnieks
Date: Wednesday, October 8, 2008 - 11:52 pm

That's what I get for commenting on stuff when I'm into a 40-hour week by
Wednesday. :)

I was speaking of the generic programmer who does stuff like:

	x = 10; /* set x to 10 */ 

for "what they are doing".  You know the type. ;)

"flush value to RAM", "force memory barrier operation", and I think I've seen
a few kzalloc()'s that have "allocate zero'ed memory" on them. "what they are
doing" is usually not worth writing down, but being verbose for the *why*
is almost always good, especially for things like memory barriers that
almost nobody can get their brains wrapped around (how many flame-fests per

Even better (as I missed the "also supposed to know" case).  My general point
was that a concrete example would improve Randy's original patch by showing
what sort of things should be in the comment, and your correction pointed
out *why* a concrete example was needed. ;)

From: Mikulas Patocka
Date: Wednesday, September 24, 2008 - 11:53 am

Fix violation of sync()/fsync() semantics. Previous code walked up to
mapping->nrpages * 2 pages. Because pages could be created while
__filemap_fdatawrite_range was in progress, it could lead to a misbehavior.
Example: there are two pages in address space with indices 4, 5. Both are dirty.
Someone calls __filemap_fdatawrite_range, it sets .nr_to_write = 4.
Meanwhile, some other process creates dirty pages 0, 1, 2, 3.
__filemap_fdatawrite_range writes pages 0, 1, 2, 3, finds out that it reached
the limit and exits.
Result: pages that were dirty before __filemap_fdatawrite_range was invoked were
not written.

With starvation protection from the previous patch, this mapping->nrpages * 2
logic is no longer needed.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 mm/filemap.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Index: linux-2.6.27-rc7-devel/mm/filemap.c
===================================================================
--- linux-2.6.27-rc7-devel.orig/mm/filemap.c	2008-09-24 14:47:01.000000000 +0200
+++ linux-2.6.27-rc7-devel/mm/filemap.c	2008-09-24 15:01:23.000000000 +0200
@@ -202,6 +202,11 @@ static int sync_page_killable(void *word
  * opposed to a regular memory cleansing writeback.  The difference between
  * these two operations is that if a dirty page/buffer is encountered, it must
  * be waited upon, and not just skipped over.
+ *
+ * Because new pages dirty can be created while this is executing, that
+ * mapping->nrpages * 2 condition is unsafe. If we are doing data integrity
+ * write, we must write all the pages. AS_STARVATION bit will eventually prevent
+ * creating more dirty pages to avoid starvation.
  */
 int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
 				loff_t end, int sync_mode)
@@ -209,7 +214,7 @@ int __filemap_fdatawrite_range(struct ad
 	int ret;
 	struct writeback_control wbc = {
 		.sync_mode = sync_mode,
-		.nr_to_write = mapping->nrpages * 2,
+		.nr_to_write = sync_mode == ...
From: Nick Piggin
Date: Thursday, October 2, 2008 - 7:32 pm

I think the problem has been misidentified, or else I have misread the
code. See below. I hope I'm right, because I think the patches are pretty
heavy on complexity in these already complex paths...

It would help if you explicitly identify the exact livelock. Ie. give a

*What* is, forever? Data integrity syncs should have pages operated on
in-order, until we get to the end of the range. Circular writeback could

What's the actual problem, though? nr_to_write should not be used for
data integrity operations, and it should not be critical for other
writeout. Upper layers should be able to deal with it rather than

Wow, that's really nasty. Sad we still have known data integrity problems

Why is it unfixable? Just ignore nr_to_write, and write out everything
properly, I would have thought.

Some things may go a tad slower, but those are going to be the things
that are using fsync, in which cases they are going to hurt much more
from the loss of data integrity than a slowdown.

Unfortunately because we have played fast and loose for so long, they
expect this behaviour, were tested and optimised with it, and systems
designed and deployed with it, and will notice performance regressions
if we start trying to do things properly. This is one of my main
arguments for doing things correctly up-front, even if it means a
massive slowdown in some real or imagined workload: at least then we
will hear about complaints and be able to try to improve them rather
than setting ourselves up for failure later.
/rant

Anyway, in this case, I don't think there would be really big problems.
Also, I think there is a reasonable optimisation that might improve it
(2nd last point, in attached patch).

OK, so after glancing at the code... wow, it seems like there are a lot
of bugs in there.
From: Andrew Morton
Date: Thursday, October 2, 2008 - 7:40 pm

That can cause fsync to wait arbitrarily long if some other process is
writing the file.  This happens.
--

From: Nick Piggin
Date: Thursday, October 2, 2008 - 7:59 pm

It can be fixed without touching non-fsync paths (see my next email for

What does such a thing? It would have been nicer to ask them to not do
that then, or get them to use range syncs or something. Now that's much
harder because we've accepted the crappy workaround for so long.

It's far far worse to just ignore data integrity of fsync because of the
problem. Should at least have returned an error from fsync in that case,
or make it interruptible or something.
--

From: Andrew Morton
Date: Thursday, October 2, 2008 - 8:14 pm

If a file has one dirty page at offset 1000000000000000 then someone
does an fsync() and someone else gets in first and starts madly writing
pages at offset 0, we want to write that page at 1000000000000000. 
Somehow.

I expect there's no solution which avoids blocking the writers at some
stage.
--

From: Nick Piggin
Date: Thursday, October 2, 2008 - 8:47 pm

Damn people...


See my other email. Something roughly like this would do the trick
(hey, it actually boots and runs and does fix the problem too).

It's ugly because we don't have quite the right radix tree operations 
yet (eg. lookup multiple tags, set tag X if tag Y was set, proper range
lookups). But the theory is to up-front tag the pages that we need to
get to disk.

Completely no impact or slowdown to any writers (although it does add
8 bytes of tags to the radix tree node... but doesn't increase memory
footprint as such due to slab).
From: Andrew Morton
Date: Thursday, October 2, 2008 - 8:56 pm

It needs exclusion to protect all those temp tags.  Is do_fsync()'s
i_mutex sufficient?  It's qute unobvious (and unmaintainable?) that all


Can we reduce the amount of copy-n-pasting here?
--

From: Nick Piggin
Date: Thursday, October 2, 2008 - 9:07 pm

Yeah... it does need a lock, which I brushed under the carpet :P
I was going to just say use i_mutex, but then we really would start
impacting on other fastpaths (eg writers).

Possibly a new mutex in the address_space? That way we can say
"anybody who holds this mutex is allowed to use the tag for anything"
and it doesn't have to be fsync specific (whether that would be of


Yeah... I went to break the sync/async cases into two, but it looks like
it may not have been worthwhile. Just another branch might be the best
way to go.

As far as the c&p in setting the FSYNC tag, yes that should all go away
if the radix-tree is up to scratch. Basically:

radix_tree_tag_set_if_tagged(start, end, ifWRITEBACK|DIRTY, setFSYNC);

should be able to replace the whole thing, and we'd hold the tree_lock, so
we would not have to take the page lock etc. Basically it would be much
nicer... even somewhere close to a viable solution.
--

From: Andrew Morton
Date: Thursday, October 2, 2008 - 9:17 pm

That's another, umm 24 bytes minimum in the address_space (and inode). 
That's fairly ouch, which is why Miklaus did that hokey bit-based

Yup.  Could add another do-this flag in the writeback_control, perhaps.
--

From: Nick Piggin
Date: Thursday, October 2, 2008 - 9:29 pm

Well yeah, it would be a bit based mutex in mapping->flags with

Yeah... possibly we could just _always_ do the PAGECACHE_TAG_FSYNC thing
if mode != WB_SYNC_NONE. I think if we had the infrastructure there to
do it all, it should always be something we want to do for data integrity
writeout.
--

From: Mikulas Patocka
Date: Friday, October 3, 2008 - 4:43 am

That filemap_fdatawrite and filemap_fdatawait in fsync() aren't really 
called under i_mutex (see do_fsync).

So the possible solutions are:

1. Add jiffies when the page was diried and wroteback to struct page
+ no impact on locking and concurrency
- increases the structure by 8 bytes

2. Stop the writers when the starvation happens (what I did)
+ doesn't do any locking if the livelock doesn't happen
- locks writers when the livelock happens (I think it's not really serious 
--- because very few people complained about the livelock, very few people 
will see performance degradation from blocking the writers).

3. Add another bit to radix tree (what Nick did)
+ doesn't ever block writers
- unconditionally takes the lock on fsync path and serializates concurrent 
syncs/fsyncs. Probably low overhead too ... or I don't know, is there any 
possible situation when more processes execute sync() in parallel and user 
would see degradations if those syncs were serialized?

Mikulas
--

From: Nick Piggin
Date: Friday, October 3, 2008 - 5:27 am

Maybe it is because not much actually does sequential writes to a massive
file or block device while trying to fsync it as well? I don't know. You
could still have cases where fsync takes much longer than expected but it

--

From: Mikulas Patocka
Date: Friday, October 3, 2008 - 6:53 am

At most twice the time it would normally take (one loop of writeback queue 
until it detects the livelock and the other loop until it drains all the 
new pages that were created during the first loop).

While with solution (3) it would take only once for the whole writeback 
queue.

--

From: Nick Piggin
Date: Thursday, October 2, 2008 - 7:54 pm

OK, I have been able to reproduce it somewhat. It is not a livelock,
but what is happening is that direct IO read basically does an fsync
on the file before performing the IO. The fsync gets stuck behind the
dd that is dirtying the pages, and ends up following behind it and
doing all its IO for it.

The following patch avoids the issue for direct IO, by using the range
syncs rather than trying to sync the whole file.

The underlying problem I guess is unchanged. Is it really a problem,
though? The way I'd love to solve it is actually by adding another bit
or two to the pagecache radix tree,  that can be used to transiently tag
the tree for future operations. That way we could record the dirty and
writeback pages up front, and then only bother with operating on them.

That's *if* it really is a problem. I don't have much pity for someone
doing buffered IO and direct IO to the same pages of the same file :)
From: Mikulas Patocka
Date: Friday, October 3, 2008 - 4:26 am

LVM does (that is where the bug was discovered). Basically, it scans all 
the block devices with direct IO and if someone else does buffered IO on 
any device simultaneously, it locks up.

That fsync-vs-write livelock is quite improbably (why would some 
application do it?) --- although it could be used as a DoS --- getting 
unkillable process.

But there is another possible real-world problem --- sync-vs-write --- 
i.e. admin plugs in two disks and copies data from one to the other. 
Meanwhile, some unrelated server process executes sync(). The server goes 
into coma until the copy finishes.

Mikulas
--

From: Nick Piggin
Date: Friday, October 3, 2008 - 5:31 am

Scans all block devices with direct IO? Hmm, why, I wonder? Should

I'm not sure.
--

From: Mikulas Patocka
Date: Friday, October 3, 2008 - 6:50 am

LVM must not allocate any memory when doing IO because it suspends the 
block device and memory allocation could trigger writeback on the 
suspended device and deadlock.

So it preallocates heap and stack, mlockall()s itself and does direct IO.

Well, there are still two allocations on direct IO path --- one in 
__blockdev_direct_IO and the other in dio_bio_alloc. If you have an idea 
how to get rid of them (especially that kzalloc(sizeof(*dio), GFP_KERNEL), 
that bio allocation would be quite easy to avoid), it would be benefical.

Mikulas
--

From: Alasdair G Kergon
Date: Friday, October 3, 2008 - 7:50 am

True, but unrelated to the scanning, which LVM performs *prior* to
entering such a state.

We use direct IO while scanning because it's essential all nodes in a
cluster see the same updated version of the data after any node updated
it.

Alasdair
-- 
agk@redhat.com
--

From: Alasdair G Kergon
Date: Friday, October 3, 2008 - 7:36 am

Scan in the sense of it reading a few sectors from each potential LVM
device to check whether it contains an LVM label.

Alasdair
-- 
agk@redhat.com
--

From: david
Date: Friday, October 3, 2008 - 8:52 am

I've seen lots of discussions here about different options in syncing. in 
this case a fix is to do a fsync of a range. I've also seen discussions of 
how the kernel filesystem code can do ordered writes without having to 
wait for them with the use of barriers, is this capability exported to 
userspace? if so, could you point me at documentation for it?

David Lang
--

From: Mikulas Patocka
Date: Sunday, October 5, 2008 - 5:04 pm

It fixes the bug in concurrent direct read+buffed write, but won't fix the 

It isn't. And it is good that it isn't --- the more complicated API, the 
more maintenance work.

--

From: david
Date: Sunday, October 5, 2008 - 5:19 pm

I can understand that most software would not want to deal with 
complications like this, but for things thta have requirements similar to 
journaling filesystems (databases for example) it would seem that there 
would be advantages to exposing this capabilities.

David Lang
--

From: Mikulas Patocka
Date: Sunday, October 5, 2008 - 8:42 pm

If you invent new interface that allows submitting several ordered IOs 
from userspace, it will require excessive maintenance overhead over long 
period of time. So it should be only justified, if the performance 
improvement is excessive as well.

It should not be like "here you improve 10% performance on some synthetic 
benchmark in one application that was rewritten to support the new 
interface" and then create a few more security vulnerabilities (because of 
the complexity of the interface) and damage overall Linux progress, 
because everyone is catching bugs in the new interface and checking it for 
correctness.

Mikulas
--

From: david
Date: Monday, October 6, 2008 - 8:37 pm

the same benchmarks that show that it's far better for the in-kernel 
filesystem code to use write barriers should apply for FUSE filesystems.

this isn't a matter of a few % in performance, if an application is 
sync-limited in a way that can be converted to write-ordered the potential 
is for the application to speed up my many times.

programs that maintain indexes or caches of data that lives in other files 
will be able to write data && barrier && write index && fsync and double 
their performance vs write data && fsync && write index && fsync

databases can potentially do even better, today they need to fsync data to 
disk before they can update their journal to indicate that the data has 
been written, with a barrier they could order the writes so that the write 
to the journal doesn't happen until the writes of the data. they would 
neve need to call an fsync at all (when emptying the journal)

for systems without solid-state drives or battery-backed caches, the 
ability to eliminate fsyncs by being able to rely on the order of the 
writes is a huge benifit.

David Lang
--

From: Mikulas Patocka
Date: Tuesday, October 7, 2008 - 8:44 am

FUSE is slow by design, and it is used in cases where performance isn't 

They can do: write data with O_SYNC; write another piece of data with 
O_SYNC.

And the only difference from barriers is the waiting time after the first 
O_SYNC before the second I/O is submitted (such delay wouldn't happen with 
barriers).

And now I/O delay is in milliseconds and process wakeup time is tens of 
microseconds, it doesn't look like eliminating process wakeup time would 

Good databases can pack several user transactions into one fsync() write. 
If the database server is properly engineered, it accumulates all user 
transactions committed so far into one chunk, writes that chunk with one 
fsync() call and then reports successful commit to the clients.

So if you increase fsync() latency, it should have no effect on the 
transactional throughput --- only on latency of transactions. Similarly, 
if you decrease fsync() latency, it won't increase number of processed 
transactions.

Certainly, there are primitive embedded database libraries that fsync() 

I may ask --- where are the applications that require extra slow fsync() 
latency? Databases are not that, they batch transactions.

If you want to improve things, you can try:
* implement O_DSYNC (like O_SYNC, but doesn't update inode mtime)
* implement range_fsync and range_fdatasync (sync on file range --- the 
kernel has already support for that, you can just add a syscall)
* turn on FUA bit for O_DSYNC writes, that eliminates the need to flush 
drive cache in O_DSYNC call

--- these are definitely less invasive than new I/O submitting interface.

--

From: david
Date: Tuesday, October 7, 2008 - 10:16 am

FUSE is slow, but I don't believe that it's a design goal for it to be 
slow, it's a limitation of the implementation. so things that could speed 

each sync write needs to wait for a disk rotation (and a seek if you are 
writing to different files). if you only do two writes you save one disk 

if there are multiple users doing transactions at the same time they will 
benifit from overlapping the fsyncs. but each user session cannot complete 

only if you have all your transactions happening in parallel. in the real 
world programs sometimes need to wait for one transaction to complete so 

but all of these require that the application stop and wait for each 
seperate write to take place before proceeding to the next step.

if this doesn't matter, then why the big push to have the in-kernel 
filesystems start using barriers? I understood that this resulted in large 
performance increases in the places that they are used from just being 
able to avoid having to drain the entire request queue, and you are saying 
that the applications would not only need to wait for the queue to flush, 
but for the disk to acknowledge the write.

syncs are slow, in some cases _very_ slow.

David Lang
--

Previous thread: [PATCH 2/4] x86: Add UV bios call infrastructure by Russ Anderson on Monday, September 22, 2008 - 2:09 pm. (5 messages)

Next thread: [PATCH 3/4] x86: Add UV partition call by Russ Anderson on Monday, September 22, 2008 - 2:10 pm. (2 messages)