Hi Here is a patch for MM livelock. The original bug report follows after the patch. To reproduce the bug on my computer, I had to change bs=4M to bs=65536 in examples in the original report. Mikulas --- Fix starvation in memory management. The bug happens when one process is doing sequential buffered writes to a block device (or file) and another process is attempting to execute sync(), fsync() or direct-IO on that device (or file). This syncing process will wait indefinitelly, until the first writing process finishes. For example, run these two commands: dd if=/dev/zero of=/dev/sda1 bs=65536 & dd if=/dev/sda1 of=/dev/null bs=4096 count=1 iflag=direct The bug is caused by sequential walking of address space in write_cache_pages and wait_on_page_writeback_range: if some other process is constantly making dirty and writeback pages while these functions run, the functions will wait on every new page, resulting in indefinite wait. To fix the starvation, I declared a mutex starvation_barrier in struct address_space. When the mutex is taken, anyone making dirty pages on that address space should stop. The functions that walk address space sequentially first estimate a number of pages to process. If they process more than this amount (plus some small delta), it means that someone is making dirty pages under them --- they take the mutex and hold it until they finish. When the mutex is taken, the function balance_dirty_pages will wait and not allow more dirty pages being made. The mutex is not really used as a mutex, it does not prevent access to any critical section. It is used as a barrier that blocks new dirty pages from being created. I use mutex and not wait queue to make sure that the starvation can't happend the other way (if there were wait queue, excessive calls to write_cache_pages and wait_on_page_writeback_range would block balance_dirty_pages forever; with mutex it is guaranteed that every process eventually makes progress). The essential property ...
On Mon, 22 Sep 2008 17:10:04 -0400 (EDT)
Shouldn't happen. All the data-syncing functions should have an upper
bound on the number of pages which they attempt to write. In the
example above, we end up in here:
int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
loff_t end, int sync_mode)
{
int ret;
struct writeback_control wbc = {
.sync_mode = sync_mode,
.nr_to_write = mapping->nrpages * 2, <<--
.range_start = start,
.range_end = end,
};
so generic_file_direct_write()'s filemap_write_and_wait() will attempt
to write at most 2* the number of pages which are in cache for that inode.
I'd say that either a) that logic got broken or b) you didn't wait long
enough, and we might need to do something to make it not wait so long.
But before we patch anything we should fully understand what is
happening and why the current anti-livelock code isn't working in this
case.
--
See write_cache_pages:
if (wbc->sync_mode != WB_SYNC_NONE)
wait_on_page_writeback(page); (1)
if (PageWriteback(page) ||
!clear_page_dirty_for_io(page)) {
unlock_page(page); (2)
continue;
}
ret = (*writepage)(page, wbc, data);
if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
unlock_page(page);
ret = 0;
}
if (ret || (--(wbc->nr_to_write) <= 0))
done = 1;
--- so if it goes by points (1) and (2), the counter is not decremented,
yet the function waits for the page. If there is constant stream of
writeback pages being generated, it waits on each on them --- that is,
forever. I have seen livelock in this function. For you that example with
two dd's, one buffered write and the other directIO read doesn't work? For
me it livelocks here.
wait_on_page_writeback_range is another example where the livelock
happened, there is no protection at all against starvation.
BTW. that .nr_to_write = mapping->nrpages * 2 looks like a dangerous thing
to me.
Imagine this case: You have two pages with indices 4 and 5 dirty in a
file. You call fsync(). It sets nr_to_write to 4.
Meanwhile, another process makes pages 0, 1, 2, 3 dirty.
The fsync() process goes to write_cache_pages, writes the first 4 dirty
pages and exits because it goes over the limit.
result --- you violate fsync() semantics, pages that were dirty before
Mikulas
--
On Tue, 23 Sep 2008 18:34:20 -0400 (EDT) um, OK. So someone else is initiating IO for this inode and this thread *never* gets to initiate any writeback. That's a bit of a surprise. How do we fix that? Maybe decrement nt_to_write for these pages as yup, that's pretty much unfixable, really, unless new locks are added which block threads which are writing to unrelated sections of the file, and that could hurt some workloads quite a lot, I expect. Hopefully high performance applications are instantiating the file up-front and are using sync_file_range() to prevent these sorts of things from happening. But they probably aren't. --
And what do you want to do with wait_on_page_writeback_range? When I solved that livelock in write_cache_pages(), I got another livelock in It is fixable with the patch I sent --- it doesn't take any locks unless the starvation happens. Then, you don't have to use .nr_to_write for fsync anymore. Another solution could be to record in page structure jiffies when the page entered dirty state and writeback state. The start writeback/wait on writeback functions could then trivially ignore pages that were --- for databases it is pretty much possible that one thread is writing already journaled data (so it doesn't care when the data are really written) and another thread is calling fsync() on the same inode simultaneously --- so fsync() could mistakenly write the data generated by the first thread and ignore the data generated by the second thread, that it should really write. Mikulas --
On Tue, 23 Sep 2008 19:11:51 -0400 (EDT) I agree that the patch is low-impact and relatively straightforward. The main problem is making the address_space larger - there can (and often are) millions and millions of these things in memory. Making it larger is a big deal. We should work hard to seek an alternative and afacit that isn't happening here. We already have existing code and design which attempts to avoid livelock without adding stuff to the address_space. Can it be modified --
I reworked my patch to use a bit in address_space->flags and hashes wait queues, so it doesn't take any extra memory. I'm sending it in three parts. 1 - make generic function wait_action_schedule 2 - fix the livelock, the logic is exactly the same as in my previous patch, wait_on_bit_lock is used instead of mutexes 3 - remove that nr_pages * 2 limit, because it causes misbehavior and possible data loss. Mikulas --
A generic function wait_action_schedule that allows to use wait_on_bit_lock just
like mutexes.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
---
include/linux/wait.h | 8 +++++++-
kernel/wait.c | 7 +++++++
2 files changed, 14 insertions(+), 1 deletion(-)
Index: linux-2.6.27-rc7-devel/include/linux/wait.h
===================================================================
--- linux-2.6.27-rc7-devel.orig/include/linux/wait.h 2008-09-24 03:20:59.000000000 +0200
+++ linux-2.6.27-rc7-devel/include/linux/wait.h 2008-09-24 03:26:34.000000000 +0200
@@ -513,7 +513,13 @@ static inline int wait_on_bit_lock(void
return 0;
return out_of_line_wait_on_bit_lock(word, bit, action, mode);
}
-
+
+/**
+ * wait_action_schedule - this function can be passed to wait_on_bit or
+ * wait_on_bit_lock and it will call just schedule().
+ */
+int wait_action_schedule(void *);
+
#endif /* __KERNEL__ */
#endif
Index: linux-2.6.27-rc7-devel/kernel/wait.c
===================================================================
--- linux-2.6.27-rc7-devel.orig/kernel/wait.c 2008-09-24 03:22:58.000000000 +0200
+++ linux-2.6.27-rc7-devel/kernel/wait.c 2008-09-24 03:24:10.000000000 +0200
@@ -251,3 +251,10 @@ wait_queue_head_t *bit_waitqueue(void *w
return &zone->wait_table[hash_long(val, zone->wait_table_bits)];
}
EXPORT_SYMBOL(bit_waitqueue);
+
+int wait_action_schedule(void *word)
+{
+ schedule();
+ return 0;
+}
+EXPORT_SYMBOL(wait_action_schedule);
--
Avoid starvation when walking address space.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
---
include/linux/pagemap.h | 1 +
mm/filemap.c | 20 ++++++++++++++++++++
mm/page-writeback.c | 37 ++++++++++++++++++++++++++++++++++++-
mm/truncate.c | 24 +++++++++++++++++++++++-
4 files changed, 80 insertions(+), 2 deletions(-)
Index: linux-2.6.27-rc7-devel/include/linux/pagemap.h
===================================================================
--- linux-2.6.27-rc7-devel.orig/include/linux/pagemap.h 2008-09-24 02:57:37.000000000 +0200
+++ linux-2.6.27-rc7-devel/include/linux/pagemap.h 2008-09-24 02:59:04.000000000 +0200
@@ -21,6 +21,7 @@
#define AS_EIO (__GFP_BITS_SHIFT + 0) /* IO error on async write */
#define AS_ENOSPC (__GFP_BITS_SHIFT + 1) /* ENOSPC on async write */
#define AS_MM_ALL_LOCKS (__GFP_BITS_SHIFT + 2) /* under mm_take_all_locks() */
+#define AS_STARVATION (__GFP_BITS_SHIFT + 3) /* an anti-starvation barrier */
static inline void mapping_set_error(struct address_space *mapping, int error)
{
Index: linux-2.6.27-rc7-devel/mm/filemap.c
===================================================================
--- linux-2.6.27-rc7-devel.orig/mm/filemap.c 2008-09-24 02:59:33.000000000 +0200
+++ linux-2.6.27-rc7-devel/mm/filemap.c 2008-09-24 03:13:47.000000000 +0200
@@ -269,10 +269,19 @@ int wait_on_page_writeback_range(struct
int nr_pages;
int ret = 0;
pgoff_t index;
+ long pages_to_process;
if (end < start)
return 0;
+ /*
+ * Estimate the number of pages to process. If we process significantly
+ * more than this, someone is making writeback pages under us.
+ * We must pull the anti-starvation plug.
+ */
+ pages_to_process = bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
+ pages_to_process += (pages_to_process >> 3) + 16;
+
pagevec_init(&pvec, 0);
index = start;
while ((index <= end) &&
@@ -288,6 +297,10 @@ int wait_on_page_writeback_range(struct
if (page->index > end)
...Please don't send multiple patches with identical titles - think up a good, unique, meaningful title for each patch. Please include a full changelog with each iteration of each patch. That changelog should explain the reason for playing games with This sequence appears twice and it would probably be clearer to This is copied three times and perhaps also should be factored out. Please note that an effort has been made to make mm/filemap.c look This sequence is repeated three or four times and should be pulled out into a well-commented function. That comment should explain the logic behind the use of these barriers, please. --
Hi I removed the repeated code and create a new bit mutexes. They are space-efficient mutexes that consume only one bit. See the next 3 patches. If you are concerned about the size of an inode, I can convert other mutexes to bit mutexes: i_mutex and inotify_mutex. I could also create bit_spinlock (one-bit spinlock that uses test_and_set_bit) and save space for address_space->tree_lock, address_space->i_mmap_lock, address_space->private_lock, inode->i_lock. Look at it and say what you think about the idea of condensing mutexes into single bits. --
I wouldn't worry for now. mutexes can be unlocked much faster than bit mutexes, especially in the fastpath. And due to slab, it would be We have that already. It is much much faster to unlock spinlocks than bit spinlocks in general (if you own the word exclusively, then it's not, but then you would be less likely to save space), and we can also Looks pretty good to me. --
Maybe inotify_mutex. You are right that i_mutex is so heavily contended that slowing it down to save few words wouldn't be good. Do you know about BTW. why do spinlocks on x86(64) have 32 bits and not 8 bits or 16 bits? Are atomic 32-bit instuctions faster? Can x86(86) system have 256 CPUs? Mikulas --
Don't really know, no. I think most desktop environments use it to In the case of <= 256 CPUs, they could be an unsigned short I think. Probably it has never been found to be a huge win because they are often beside other ints or longs. I think I actually booted up the kernel with 16-bit spinlocks when doing the FIFO locks, but never Well, none that I know of which actually exist. SGI is hoping to have 4096 CPU x86 systems as far as I can tell. --
A bit mutex implementation.
Bit mutex is a space-efficient mutex that consumes exactly one bit in
the apropriate structure (if mutex debugging is turned off). Bit mutex
is implemented as a bit in unsigned long field. Other bits in the field
may be used for other purposes, if test/set/clear_bit functions are used
to manipulate them. There is no wait queue in the structure containing
the bit mutex; hashed wait queues for waiting on bits are used.
Additional structure struct bit_mutex is needed, it contains lock
debugging & dependency information. When the kernel is compiled without
lock debugging, this structure is empty.
Bit mutexes are used with functions
bit_mutex_init
bit_mutex_lock
bit_mutex_unlock
bit_mutex_is_locked
These functions take 3 arguments: pointer to the unsigned long field,
the bit position and pointer to struct bit_mutex.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
---
include/linux/bit-mutex.h | 98 ++++++++++++++++++++++++++++++++++++++++++++++
include/linux/wait.h | 8 +++
kernel/Makefile | 2
kernel/bit-mutex-debug.c | 55 +++++++++++++++++++++++++
kernel/wait.c | 7 +++
5 files changed, 169 insertions(+), 1 deletion(-)
Index: linux-2.6.27-rc8-devel/include/linux/bit-mutex.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.27-rc8-devel/include/linux/bit-mutex.h 2008-10-05 19:58:30.000000000 +0200
@@ -0,0 +1,98 @@
+#ifndef __LINUX_BIT_MUTEX_H
+#define __LINUX_BIT_MUTEX_H
+
+#include <linux/bitops.h>
+#include <linux/lockdep.h>
+#include <linux/wait.h>
+#include <linux/sched.h>
+
+struct bit_mutex {
+#ifdef CONFIG_DEBUG_MUTEXES
+ unsigned long *field;
+ int bit;
+ struct thread_info *owner;
+ const char *name;
+ void *magic;
+#endif
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+ struct lockdep_map dep_map;
+#endif
+};
+
+#ifndef CONFIG_DEBUG_MUTEXES
+
+static inline void bit_mutex_init(unsigned long ...Fix starvation in memory management. The bug happens when one process is doing sequential buffered writes to a block device (or file) and another process is attempting to execute sync(), fsync() or direct-IO on that device (or file). This syncing process will wait indefinitelly, until the first writing process finishes. For example, run these two commands: dd if=/dev/zero of=/dev/sda1 bs=65536 & dd if=/dev/sda1 of=/dev/null bs=4096 count=1 iflag=direct The bug is caused by sequential walking of address space in write_cache_pages and wait_on_page_writeback_range: if some other process is constantly making dirty and writeback pages while these functions run, the functions will wait on every new page, resulting in indefinite wait. To fix the starvation, I declared a bit mutex starvation_barrier in struct address_space. The bit AS_STARVATION_BARRIER in address_space->flags is used for actual locking. When the mutex is taken, anyone making dirty pages on that address space should stop. The functions that walk address space sequentially first estimate a number of pages to process. If they process more than this amount (plus some small delta), it means that someone is making dirty pages under them --- they take the mutex and hold it until they finish. When the mutex is taken, the function balance_dirty_pages will wait and not allow more dirty pages being made. The mutex is not really used as a mutex, it does not prevent access to any critical section. It is used as a barrier that blocks new dirty pages from being created. I use mutex and not wait queue to make sure that the starvation can't happend the other way (if there were wait queue, excessive calls to write_cache_pages and wait_on_page_writeback_range would block balance_dirty_pages forever; with mutex it is guaranteed that every process eventually makes progress). The essential property of this patch is that if the starvation doesn't happen, no additional locks are taken and no atomic operations are performed. So ...
On Sun, 5 Oct 2008 18:14:50 -0400 (EDT) are you sure? isn't the right fix to just walk the file pages only once? --
It walks the pages only once --- and waits on each on them. But because new pages are contantly appearing under it, that "walk only once" becomes infinite loop (well, finite, until the whole disk is written). Mikulas --
On Sun, 5 Oct 2008 19:02:57 -0400 (EDT) well. fsync() promises that everything that's dirty at the time of the call will hit the disk. That is not something you can compromise. The only way out would be is to not allow new dirty during an fsync()... which is imo even worse. Submit them all in one go, then wait, should not be TOO bad. Unless a lot was dirty already, but then you go back to "but it has to go to disk". -- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org --
The problem here is that you have two processes --- one is writing, the other is simultaneously syncing. The syncing process can't distinguish the pages that were created before fsync() was invoked (it has to wait on them) and the pages that were created while fsync() was running (it doesn't have to wait on them) --- so it waits on them all. The result is livelock, it waits indefinitely, because more and more pages are being created. The patch changes it so that if it waits long enough, it stops the other writers creating dirty pages. Or, how otherwise would you implement "Submit them all in one go, then wait"? The current code is: you grab page 0, see it is under writeback, wait on it you grab page 1, see it is under writeback, wait on it you grab page 2, see it is under writeback, wait on it you grab page 3, see it is under writeback, wait on it ... --- and the other process is just making more and more writeback pages while your waiting routine run. So the waiting is indefinite. Mikulas --
On Sun, 5 Oct 2008 19:18:22 -0400 (EDT) for pages it already passed that's not true. it shouldn't be doing this. It should be "submit the lot" "wait on the lot that got submitted". keeping away new writers is just shifting the latency to an innocent party (since the fsync() is just bloody expensive, the person doing it should be paying the price, not the rest of the system), and a grave mistake. If the fsync() implementation isn't smart enough, sure, lets improve it. But not by shifting latency around... lets make it more efficient at submitting IO. If we need to invent something like "chained IO" where if you wait on the last of the chain, you wait on the entirely chain, so be it. -- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org --
And how do you want to implement that "wait on the lot that got submitted". Note that you can't allocate an array or linked list that holds pointers to all submitted pages. And if you add anything to the page structure or radix tree, you have to serialize concurrent sync,fsync I assume that if very few people complained about the livelock till now, very few people will see degraded write performance. My patch blocks the writes only if the livelock happens, so if the livelock doesn't happen in This looks madly complicated. And ineffective, because if some page was submitted before fsync() was invoked, and is under writeback while fsync() is called, fsync() still has to wait on it. Mikulas --
On Sun, 5 Oct 2008 20:01:46 -0400 (EDT) I object to calling this a livelock. It's not. And yes, fsync is slow and lots of people are seeing that. It's not helped by how ext3 is implemented (where fsync is effectively equivalent of a sync for many cases). so? just make a chain per inode always... -- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org --
It unlocks itself when the whole disk is written, and it can be several hours (or days, if you have many-terabyte array). So formally it is not livelock, from the user experience it is --- he sees unkillable process in The point is that many fsync()s may run in parallel and you have just one inode and just one chain. And if you add two-word list_head to a page, to link it on this list, many developers will hate it for increasing its size. See the work dobe by Nick Piggin somewhere in this thread. He uses just one bit in radix tree to mark pages to process. But he needs to serialize all syncs on a given file, they no longer run in parallel. Mikulas --
why to a page? a list head in the inode and chain up the bios.... or not make an actual list but just a "is the previous one done" thing it's not all that hard to get something that works on a per inode basis, that gives "wait for all io upto this one". -- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org --
And if you want to wait for a bio submitted by a different process? So code it :) Mikulas --
On Mon, 6 Oct 2008 09:00:14 -0400 (EDT) the point is that the kernel would always chain it to the inode, independent of who or when it is submitted -- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org --
If you add a list to an inode, you need to protect it with a spinlock. So you take one more spinlock for any write bio submitted --- a lot of developers would hate it. Another problem: how do you want to walk all dirty pages and submit bio for them? The act of allocating and submission of bio can block (if you run out of some mempool) and in this case it wait until some other bio is finished. During this time, more dirty pages can be created. Also, if you find a page that is both dirty and under writeback, you need to wait until a writeback finishes and then initiate another writeback (because the old writeback may be writing stale data). You again, block, and more dirty pages can appear. And if you block and more dirty pages appear, you are prone to the livelock. [ In Nick Piggin's patch, it is needed to lock the whole address space, mark dirty pages in one non-blocking pass and write marked pages again in a blocking pass --- so that if more dirty pages appear while bios are submitted, the new pages will be skipped ] Mikulas --
8 hours of process in D state is a livelock. And we can do minimal fix here, this almost never happens in real life anyway. Latency imposed of writer should not be a problem... -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
fsync() waiting on pre-issued writeback pages is the correct behaviour. IOW, if the page is under writeback at the time an fsync() is issued (e.g. issued by pdflush), the page was *not clean* at the time the fsync() was called and hence must be clean when fsync() returns. fsync() needs to wait for all pages under I/O at the time it is called, not just the dirty pages it issues I/O on..... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
Fix violation of sync()/fsync() semantics. Previous code walked up to
mapping->nrpages * 2 pages. Because pages could be created while
__filemap_fdatawrite_range was in progress, it could lead to a
misbehavior. Example: there are two pages in address space with indices
4, 5. Both are dirty. Someone calls __filemap_fdatawrite_range, it sets
.nr_to_write = 4. Meanwhile, some other process creates dirty pages 0,
1, 2, 3. __filemap_fdatawrite_range writes pages 0, 1, 2, 3, finds out
that it reached the limit and exits.
Result: pages that were dirty before __filemap_fdatawrite_range was
invoked were not written.
With starvation protection from the previous patch, this
mapping->nrpages * 2 logic is no longer needed.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
---
mm/filemap.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
Index: linux-2.6.27-rc7-devel/mm/filemap.c
===================================================================
--- linux-2.6.27-rc7-devel.orig/mm/filemap.c 2008-09-24 14:47:01.000000000 +0200
+++ linux-2.6.27-rc7-devel/mm/filemap.c 2008-09-24 15:01:23.000000000 +0200
@@ -202,6 +202,11 @@ static int sync_page_killable(void *word
* opposed to a regular memory cleansing writeback. The difference between
* these two operations is that if a dirty page/buffer is encountered, it must
* be waited upon, and not just skipped over.
+ *
+ * Because new pages dirty can be created while this is executing, that
+ * mapping->nrpages * 2 condition is unsafe. If we are doing data integrity
+ * write, we must write all the pages. AS_STARVATION bit will eventually prevent
+ * creating more dirty pages to avoid starvation.
*/
int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
loff_t end, int sync_mode)
@@ -209,7 +214,7 @@ int __filemap_fdatawrite_range(struct ad
int ret;
struct writeback_control wbc = {
.sync_mode = sync_mode,
- .nr_to_write = mapping->nrpages * 2,
+ .nr_to_write = sync_mode == ...Should this be added to CodingStyle or some other file?
From: Randy Dunlap <randy.dunlap@oracle.com>
We want all uses of memory barriers to be explained in the source
code.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
---
Documentation/SubmitChecklist | 3 +++
1 file changed, 3 insertions(+)
--- lin2627-rc9-docs.orig/Documentation/SubmitChecklist
+++ lin2627-rc9-docs/Documentation/SubmitChecklist
@@ -85,3 +85,6 @@ kernel patches.
23: Tested after it has been merged into the -mm patchset to make sure
that it still works with all of the other queued patches and various
changes in the VM, VFS, and other subsystems.
+
+24: All memory barriers {e.g., barrier(), rmb(), wmb()} need a comment in the
+ source code that explains the logic of what they are doing and why.
--
Seriously? When a barrier is used, it's generally self-evident what it's doing. The real problem is when a barrier is *not* used. It would probably be more helpful to comment every non-barrier line of code to explain why it's okay if memory operations are getting reordered. -- Chris --
fs/buffer.c:sync_buffer(). Have fun. --
The real disaster there is the clear_buffer_##name macro and friends, as
evidenced by fs/ext2/inode.c:599
clear_buffer_new(bh_result); /* What's this do? */
I'm completely in favor of documenting everything that can potentially interact
with that train wreck, but I maintain that the vast majority of memory barriers
are self-evident.
-- Chris
--
That's not a disaster. It is relatively easy even if you have no idea about any of that code to a) work out what BH_New flag does, see where it gets set and where it gets cleared, and then work out what that does. Actually, buffer.c used to leak BH_New in some cases, but now it should be a bug if BH_New is found to be set there (buffer.c should take any BH_New buffers, initialize them appropriately, and clear BH_New; a dangling BH_New would indicate uninitialized data going to or coming from the block). No, they're easy, because you can find every single place where any one cares about them with a single simple grep. Again, fs/buffer.c:sync_buffer(). Which stores and/or loads is that barrier sequencing against which subsequent stores and/or loads? Why? For another fun example, mm/filemap.c:sync_page. This actually has a big comment, but (to me) it isn't even evident then which loads and stores are being sequenced against which subsequent ones because it is not explicitly documented. And I do have some experience in adding barriers to existing mm code where they have been missed completely. They are self-evident if you have spent a lot of time getting the state machine and locking/concurrency model in your mind. If you have not, then how do you know there is not some memory operation way back or forward in the instruction stream that is supposed to be ordered by this barrier? All memory barriers have to be documented, except acquire/release for critical sections, IMO. --
Acquire and release barriers attached to operations are usually self- evident; standalone wmb() and rmb() much less so. It is helpful to be explicit about exactly which memory operations need to be ordered, which are often not the memory operations immediately preceding and following it. "all" may have been a bit strong though. Ben. -- Ben Hutchings, Senior Software Engineer, Solarflare Communications Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. --
No, I don't think so. We should absolutely force "all". That allows nobody to be lazy, no confusion, and reminds people that memory barriers are not easy to follow for a new reader of the code, or necessarily even the author, 6 months later. If somebody is too lazy to write a comment, they can use locks One last quick quiz, easier than the earlier ones... mm/vmscan.c:__remove_mapping has a score of lines documenting exactly what memory operations are being ordered, and even an example of what happens if the ordering is not folllowed. This is a pretty good comment, if I say so myself. However, it has one deficiency in that it doesn't explicitly state where the write barrier(s) is (IMO the comments for one part of an ordering protocol should reference the other parts of the protcol). Where are the store barriers, or why are they not required? --
"what they are doing" will almost always be "flush value to RAM" or similar.
How about this instead:
+ 24: All memory barriers ({e.g., barrier(), rmb(), wmb()} need a comment in the
+ source code that explains the race condition being prevented, and states
+ the location of the other code or hardware feature that races with this.
+
+ An example comment:
+
+ /*
+ * If we don't do a wmb() here, the RBFROBNIZ register on the XU293
+ * card will get confused and wedge the hardware...
+ */
+ wmb();
Memory barriers don't flush anything anywhere. It's simple: you must This comment is no good, because it doesn't tell you what the memory barrier does. It tells you what might happen if you omit it. /* * If we don't do a wmb() here, the store to the RBFROBNIZ, * above, might reach the device before the store X, below. * * If that happens, then the XU293 card will get confused * and wedge the hardware... */ wmb(); If you don't comment like that, then how does the reader know that the wmb is not *also* supposed to order the store with any other of the limitless subsequent stores until the next memory ordering operation? Or any of the previous stores since the last one? --
That's what I get for commenting on stuff when I'm into a 40-hour week by Wednesday. :) I was speaking of the generic programmer who does stuff like: x = 10; /* set x to 10 */ for "what they are doing". You know the type. ;) "flush value to RAM", "force memory barrier operation", and I think I've seen a few kzalloc()'s that have "allocate zero'ed memory" on them. "what they are doing" is usually not worth writing down, but being verbose for the *why* is almost always good, especially for things like memory barriers that almost nobody can get their brains wrapped around (how many flame-fests per Even better (as I missed the "also supposed to know" case). My general point was that a concrete example would improve Randy's original patch by showing what sort of things should be in the comment, and your correction pointed out *why* a concrete example was needed. ;)
Fix violation of sync()/fsync() semantics. Previous code walked up to
mapping->nrpages * 2 pages. Because pages could be created while
__filemap_fdatawrite_range was in progress, it could lead to a misbehavior.
Example: there are two pages in address space with indices 4, 5. Both are dirty.
Someone calls __filemap_fdatawrite_range, it sets .nr_to_write = 4.
Meanwhile, some other process creates dirty pages 0, 1, 2, 3.
__filemap_fdatawrite_range writes pages 0, 1, 2, 3, finds out that it reached
the limit and exits.
Result: pages that were dirty before __filemap_fdatawrite_range was invoked were
not written.
With starvation protection from the previous patch, this mapping->nrpages * 2
logic is no longer needed.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
---
mm/filemap.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
Index: linux-2.6.27-rc7-devel/mm/filemap.c
===================================================================
--- linux-2.6.27-rc7-devel.orig/mm/filemap.c 2008-09-24 14:47:01.000000000 +0200
+++ linux-2.6.27-rc7-devel/mm/filemap.c 2008-09-24 15:01:23.000000000 +0200
@@ -202,6 +202,11 @@ static int sync_page_killable(void *word
* opposed to a regular memory cleansing writeback. The difference between
* these two operations is that if a dirty page/buffer is encountered, it must
* be waited upon, and not just skipped over.
+ *
+ * Because new pages dirty can be created while this is executing, that
+ * mapping->nrpages * 2 condition is unsafe. If we are doing data integrity
+ * write, we must write all the pages. AS_STARVATION bit will eventually prevent
+ * creating more dirty pages to avoid starvation.
*/
int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
loff_t end, int sync_mode)
@@ -209,7 +214,7 @@ int __filemap_fdatawrite_range(struct ad
int ret;
struct writeback_control wbc = {
.sync_mode = sync_mode,
- .nr_to_write = mapping->nrpages * 2,
+ .nr_to_write = sync_mode == ...I think the problem has been misidentified, or else I have misread the code. See below. I hope I'm right, because I think the patches are pretty heavy on complexity in these already complex paths... It would help if you explicitly identify the exact livelock. Ie. give a *What* is, forever? Data integrity syncs should have pages operated on in-order, until we get to the end of the range. Circular writeback could What's the actual problem, though? nr_to_write should not be used for data integrity operations, and it should not be critical for other writeout. Upper layers should be able to deal with it rather than Wow, that's really nasty. Sad we still have known data integrity problems Why is it unfixable? Just ignore nr_to_write, and write out everything properly, I would have thought. Some things may go a tad slower, but those are going to be the things that are using fsync, in which cases they are going to hurt much more from the loss of data integrity than a slowdown. Unfortunately because we have played fast and loose for so long, they expect this behaviour, were tested and optimised with it, and systems designed and deployed with it, and will notice performance regressions if we start trying to do things properly. This is one of my main arguments for doing things correctly up-front, even if it means a massive slowdown in some real or imagined workload: at least then we will hear about complaints and be able to try to improve them rather than setting ourselves up for failure later. /rant Anyway, in this case, I don't think there would be really big problems. Also, I think there is a reasonable optimisation that might improve it (2nd last point, in attached patch). OK, so after glancing at the code... wow, it seems like there are a lot of bugs in there.
That can cause fsync to wait arbitrarily long if some other process is writing the file. This happens. --
It can be fixed without touching non-fsync paths (see my next email for What does such a thing? It would have been nicer to ask them to not do that then, or get them to use range syncs or something. Now that's much harder because we've accepted the crappy workaround for so long. It's far far worse to just ignore data integrity of fsync because of the problem. Should at least have returned an error from fsync in that case, or make it interruptible or something. --
If a file has one dirty page at offset 1000000000000000 then someone does an fsync() and someone else gets in first and starts madly writing pages at offset 0, we want to write that page at 1000000000000000. Somehow. I expect there's no solution which avoids blocking the writers at some stage. --
Damn people... See my other email. Something roughly like this would do the trick (hey, it actually boots and runs and does fix the problem too). It's ugly because we don't have quite the right radix tree operations yet (eg. lookup multiple tags, set tag X if tag Y was set, proper range lookups). But the theory is to up-front tag the pages that we need to get to disk. Completely no impact or slowdown to any writers (although it does add 8 bytes of tags to the radix tree node... but doesn't increase memory footprint as such due to slab).
It needs exclusion to protect all those temp tags. Is do_fsync()'s i_mutex sufficient? It's qute unobvious (and unmaintainable?) that all Can we reduce the amount of copy-n-pasting here? --
Yeah... it does need a lock, which I brushed under the carpet :P I was going to just say use i_mutex, but then we really would start impacting on other fastpaths (eg writers). Possibly a new mutex in the address_space? That way we can say "anybody who holds this mutex is allowed to use the tag for anything" and it doesn't have to be fsync specific (whether that would be of Yeah... I went to break the sync/async cases into two, but it looks like it may not have been worthwhile. Just another branch might be the best way to go. As far as the c&p in setting the FSYNC tag, yes that should all go away if the radix-tree is up to scratch. Basically: radix_tree_tag_set_if_tagged(start, end, ifWRITEBACK|DIRTY, setFSYNC); should be able to replace the whole thing, and we'd hold the tree_lock, so we would not have to take the page lock etc. Basically it would be much nicer... even somewhere close to a viable solution. --
That's another, umm 24 bytes minimum in the address_space (and inode). That's fairly ouch, which is why Miklaus did that hokey bit-based Yup. Could add another do-this flag in the writeback_control, perhaps. --
Well yeah, it would be a bit based mutex in mapping->flags with Yeah... possibly we could just _always_ do the PAGECACHE_TAG_FSYNC thing if mode != WB_SYNC_NONE. I think if we had the infrastructure there to do it all, it should always be something we want to do for data integrity writeout. --
That filemap_fdatawrite and filemap_fdatawait in fsync() aren't really called under i_mutex (see do_fsync). So the possible solutions are: 1. Add jiffies when the page was diried and wroteback to struct page + no impact on locking and concurrency - increases the structure by 8 bytes 2. Stop the writers when the starvation happens (what I did) + doesn't do any locking if the livelock doesn't happen - locks writers when the livelock happens (I think it's not really serious --- because very few people complained about the livelock, very few people will see performance degradation from blocking the writers). 3. Add another bit to radix tree (what Nick did) + doesn't ever block writers - unconditionally takes the lock on fsync path and serializates concurrent syncs/fsyncs. Probably low overhead too ... or I don't know, is there any possible situation when more processes execute sync() in parallel and user would see degradations if those syncs were serialized? Mikulas --
Maybe it is because not much actually does sequential writes to a massive file or block device while trying to fsync it as well? I don't know. You could still have cases where fsync takes much longer than expected but it --
At most twice the time it would normally take (one loop of writeback queue until it detects the livelock and the other loop until it drains all the new pages that were created during the first loop). While with solution (3) it would take only once for the whole writeback queue. --
OK, I have been able to reproduce it somewhat. It is not a livelock, but what is happening is that direct IO read basically does an fsync on the file before performing the IO. The fsync gets stuck behind the dd that is dirtying the pages, and ends up following behind it and doing all its IO for it. The following patch avoids the issue for direct IO, by using the range syncs rather than trying to sync the whole file. The underlying problem I guess is unchanged. Is it really a problem, though? The way I'd love to solve it is actually by adding another bit or two to the pagecache radix tree, that can be used to transiently tag the tree for future operations. That way we could record the dirty and writeback pages up front, and then only bother with operating on them. That's *if* it really is a problem. I don't have much pity for someone doing buffered IO and direct IO to the same pages of the same file :)
LVM does (that is where the bug was discovered). Basically, it scans all the block devices with direct IO and if someone else does buffered IO on any device simultaneously, it locks up. That fsync-vs-write livelock is quite improbably (why would some application do it?) --- although it could be used as a DoS --- getting unkillable process. But there is another possible real-world problem --- sync-vs-write --- i.e. admin plugs in two disks and copies data from one to the other. Meanwhile, some unrelated server process executes sync(). The server goes into coma until the copy finishes. Mikulas --
Scans all block devices with direct IO? Hmm, why, I wonder? Should I'm not sure. --
LVM must not allocate any memory when doing IO because it suspends the block device and memory allocation could trigger writeback on the suspended device and deadlock. So it preallocates heap and stack, mlockall()s itself and does direct IO. Well, there are still two allocations on direct IO path --- one in __blockdev_direct_IO and the other in dio_bio_alloc. If you have an idea how to get rid of them (especially that kzalloc(sizeof(*dio), GFP_KERNEL), that bio allocation would be quite easy to avoid), it would be benefical. Mikulas --
True, but unrelated to the scanning, which LVM performs *prior* to entering such a state. We use direct IO while scanning because it's essential all nodes in a cluster see the same updated version of the data after any node updated it. Alasdair -- agk@redhat.com --
Scan in the sense of it reading a few sectors from each potential LVM device to check whether it contains an LVM label. Alasdair -- agk@redhat.com --
I've seen lots of discussions here about different options in syncing. in this case a fix is to do a fsync of a range. I've also seen discussions of how the kernel filesystem code can do ordered writes without having to wait for them with the use of barriers, is this capability exported to userspace? if so, could you point me at documentation for it? David Lang --
It fixes the bug in concurrent direct read+buffed write, but won't fix the It isn't. And it is good that it isn't --- the more complicated API, the more maintenance work. --
I can understand that most software would not want to deal with complications like this, but for things thta have requirements similar to journaling filesystems (databases for example) it would seem that there would be advantages to exposing this capabilities. David Lang --
If you invent new interface that allows submitting several ordered IOs from userspace, it will require excessive maintenance overhead over long period of time. So it should be only justified, if the performance improvement is excessive as well. It should not be like "here you improve 10% performance on some synthetic benchmark in one application that was rewritten to support the new interface" and then create a few more security vulnerabilities (because of the complexity of the interface) and damage overall Linux progress, because everyone is catching bugs in the new interface and checking it for correctness. Mikulas --
the same benchmarks that show that it's far better for the in-kernel filesystem code to use write barriers should apply for FUSE filesystems. this isn't a matter of a few % in performance, if an application is sync-limited in a way that can be converted to write-ordered the potential is for the application to speed up my many times. programs that maintain indexes or caches of data that lives in other files will be able to write data && barrier && write index && fsync and double their performance vs write data && fsync && write index && fsync databases can potentially do even better, today they need to fsync data to disk before they can update their journal to indicate that the data has been written, with a barrier they could order the writes so that the write to the journal doesn't happen until the writes of the data. they would neve need to call an fsync at all (when emptying the journal) for systems without solid-state drives or battery-backed caches, the ability to eliminate fsyncs by being able to rely on the order of the writes is a huge benifit. David Lang --
FUSE is slow by design, and it is used in cases where performance isn't They can do: write data with O_SYNC; write another piece of data with O_SYNC. And the only difference from barriers is the waiting time after the first O_SYNC before the second I/O is submitted (such delay wouldn't happen with barriers). And now I/O delay is in milliseconds and process wakeup time is tens of microseconds, it doesn't look like eliminating process wakeup time would Good databases can pack several user transactions into one fsync() write. If the database server is properly engineered, it accumulates all user transactions committed so far into one chunk, writes that chunk with one fsync() call and then reports successful commit to the clients. So if you increase fsync() latency, it should have no effect on the transactional throughput --- only on latency of transactions. Similarly, if you decrease fsync() latency, it won't increase number of processed transactions. Certainly, there are primitive embedded database libraries that fsync() I may ask --- where are the applications that require extra slow fsync() latency? Databases are not that, they batch transactions. If you want to improve things, you can try: * implement O_DSYNC (like O_SYNC, but doesn't update inode mtime) * implement range_fsync and range_fdatasync (sync on file range --- the kernel has already support for that, you can just add a syscall) * turn on FUA bit for O_DSYNC writes, that eliminates the need to flush drive cache in O_DSYNC call --- these are definitely less invasive than new I/O submitting interface. --
FUSE is slow, but I don't believe that it's a design goal for it to be slow, it's a limitation of the implementation. so things that could speed each sync write needs to wait for a disk rotation (and a seek if you are writing to different files). if you only do two writes you save one disk if there are multiple users doing transactions at the same time they will benifit from overlapping the fsyncs. but each user session cannot complete only if you have all your transactions happening in parallel. in the real world programs sometimes need to wait for one transaction to complete so but all of these require that the application stop and wait for each seperate write to take place before proceeding to the next step. if this doesn't matter, then why the big push to have the in-kernel filesystems start using barriers? I understood that this resulted in large performance increases in the places that they are used from just being able to avoid having to drain the entire request queue, and you are saying that the applications would not only need to wait for the queue to flush, but for the disk to acknowledge the write. syncs are slow, in some cases _very_ slow. David Lang --
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing | Re: Git in a Nutshell guide |
| John Benes | Re: master has some toys |
| Matthias Lederhofer | [PATCH 4/7] introduce GIT_WORK_TREE to specify the work tree |
| Alexander Sulfrian | [RFC/PATCH] RE: git calls SSH_ASKPASS even if DISPLAY is not set |
| Junio C Hamano | Re: Rss produced by git is not valid xml? |
| Linux Kernel Mailing List | iSeries: fix section mismatch in iseries_veth |
| Linux Kernel Mailing List | ixbge: remove TX lock and redo TX accounting. |
| Linux Kernel Mailing List | ixgbe: fix several counter register errata |
| Linux Ker |
