Hi lists,
I'd like to report the following problem : after ~ 10 days' uptime on a
Debian 2.6.26-1-686 kernel, my system becomes extremely sluggish and
unresponsive and the OOM-killer starts targeting even innocent processes like
identd or rsync (when the swap is disabled). The machine is low on RAM (192
MB) but this has never been a problem before. As for the slowness, strace
shows that the brk() syscall takes ages to complete; the blocking processes
are in the D state (and for some reason the kernel gives no wchan info).
Free reports a large and constantly growing 'buffers' figure (more than
50% of the available memory). No userland processes seems to be consuming
lare amounts of memory (sum of RSS in 'ps aux' is about 30 megs), and the
overall system is mostly idle.
for instance, the following command reports
$ free ; dpkg -l '*' >/dev/null; dmesg | tail -n 2
total used free shared buffers cached
Mem: 190356 178920 11436 0 126104 12520
-/+ buffers/cache: 40296 150060
Swap: 0 0 0
/* Waits a minute or so */
Killed
[811245.911859] Out of memory: kill process 6639 (dpkg-query) score 7292 or a child
[811245.911972] Killed process 6639 (dpkg-query)
Config has nothing unusual except that all fs are ext4, which I am begining
to suspect may be the culprit, as every night some process stat()s and
read()s something like 850,000 files, which is maybe fs-stressing; also,
umounting then remounting /home makes the 'buffers' drop from 127304 to
55920 and partially solves the problem.
For info, on umount the last lines in the dmesg read:
[811842.209432] EXT4-fs: mballoc: 12288 blocks 12288 reqs (0 success)
[811842.209547] EXT4-fs: mballoc: 13038 extents scanned, 5902 goal hits,
6315 2^N hits, 0 breaks, 0 lost
[811842.209638] EXT4-fs: mballoc: 768 generated and it took 12577042
[811842.209695] EXT4-fs: mballoc: 1588424 preallocated, 371043 discarded
[811842.372165] ...[ Reply-to set to linux-ext4@vger.kernel.org ] Can you send the output of /proc/meminfo and /proc/slabinfo? - Ted --
Of course. However since I unmounted and remounted /home the 'buffer' line is now only 59megs, and they are still not dropped when a program tries to malloc all the memory. I'll tell next time the problem shows up (it can take ten days) MemTotal: 190356 kB MemFree: 12300 kB Buffers: 59652 kB Cached: 21612 kB SwapCached: 5508 kB Active: 84868 kB Inactive: 78116 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 190356 kB LowFree: 12300 kB SwapTotal: 506036 kB SwapFree: 474880 kB Dirty: 1252 kB Writeback: 0 kB AnonPages: 8268 kB Mapped: 5484 kB Slab: 8488 kB SReclaimable: 4472 kB SUnreclaim: 4016 kB PageTables: 1192 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 601212 kB Committed_AS: 139452 kB VmallocTotal: 835576 kB VmallocUsed: 4344 kB VmallocChunk: 830840 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 4096 kB slabinfo - version: 2.1 # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> nfsd4_delegations 0 0 212 18 1 : tunables 120 60 0 : slabdata 0 0 0 nfsd4_stateids 0 0 72 53 1 : tunables 120 60 0 : slabdata 0 0 0 nfsd4_files 0 0 40 92 1 : tunables 120 60 0 : slabdata 0 0 0 nfsd4_stateowners 0 0 344 11 1 : tunables 54 27 0 : slabdata 0 0 0 nfs_direct_cache 0 0 76 50 1 : tunables 120 60 0 : slabdata 0 0 0 nfs_write_data 36 36 416 9 1 : tunables 54 27 0 : slabdata 4 4 0 nfs_read_data 32 36 ...
Are you willing to patch and recompile your kernel? If so, the
following patch would be very helpful in determining what is going on.
It allows us to see what buffer heads are in use for a particular
block device. Attached please the kernel patch and the user program.
- Ted
P.S. Unfortunately, all of the code to debug buffer head leaks was
dropped when the buffer cache was moved into the page cache. Any
comments about a refined version of patch getting merged into the
mainline kernel as a debugging measure?
Now that the machine is again in the 100M+ in buffers (still unreleased when a program asks for all the memory), I launched the program on the devices which support / and /home. I also attached /proc/meminfo and /proc/slabinfo In both cases it freezes solid the machine for more than a minute or so, and it overflows the dmesg with messages. Here is shown what I got [166632.382632] buffer dirty: block 35491 count 1 [166632.386827] buffer dirty: block 35493 count 3 [166632.391019] buffer dirty: block 35494 count 1 [166632.395251] buffer dirty: block 35496 count 1 [166632.399446] buffer dirty: block 35497 count 2 [166632.403644] buffer dirty: block 35498 count 3 [166632.407979] buffer dirty: block 35499 count 1 [166632.412221] buffer dirty: block 35501 count 2 [166632.416464] buffer dirty: block 35502 count 1 [166632.420726] buffer dirty: block 35503 count 1 [166632.424962] buffer dirty: block 35504 count 3 [166632.429169] buffer dirty: block 35505 count 1 [166632.433410] buffer dirty: block 35506 count 1 [166632.437580] buffer dirty: block 35507 count 3 [166632.441787] buffer dirty: block 35508 count 1 [166632.446056] buffer dirty: block 35509 count 2 [166632.450333] buffer dirty: block 35510 count 2 [166632.454579] buffer dirty: block 35512 count 1 [166632.458791] buffer dirty: block 35513 count 1 [166632.462964] buffer dirty: block 35514 count 2 [166632.467136] buffer dirty: block 35515 count 1 [166632.471309] buffer dirty: block 35517 count 1 [166632.475482] buffer dirty: block 35518 count 2 [166632.479651] buffer dirty: block 35519 count 2 [166632.483933] buffer dirty: block 35522 count 1 [166632.488066] buffer dirty: block 35523 count 1 [166632.492162] buffer dirty: block 35524 count 1 [166632.496258] buffer dirty: block 35525 count 4 [166632.500350] buffer dirty: block 35527 count 1 [166632.504476] buffer dirty: block 35528 count 2 [166632.508607] buffer dirty: block 35529 count 3 [166632.512736] buffer dirty: block 35533 count 3 [166632.516831] buffer dirty: ...
Can you check and see if you got more of the messages recorded in
/var/log/messages? Once you do, can you take the block numbers, and
pull them out into a single command file to feed to debugfs.
Take the column of block numbers, and tack on "icheck " at the
beginning, like so:
icheck 35491 35493 35494 35496 35497 35498 35499 35501 ...
You can put a thousand or so block numbers on each line; then it's
probably better to start a new line with "icheck " at the beginning.
Then take that script and run it through debugfs:
debugfs /dev/XXX < icheck.in > icheck.out
That will result in a file icheck.out that looks like this:
debugfs: icheck 33347
Block Inode number
33347 8193
33348 8193
33350 8196
33351 8197
...
Now you'll need to take the inode numbers returned in icheck.out, and
create another file called ncheck.in that will take the inode numbers
and turn them into pathnames. (I find that using emacs's
kill-rectangle command very handy for doing this sort of thing, but
other people will like to use awk, and I'm sure there's some way to do
this using vi but I don't know what it is. :-) It's also a good idea
to take the inode numbers and run them through "sort -u" to get rid of
duplicates before putting them on a single line and adding ncheck to
them. So what you want is to create a file ncheck.in that looks like this:
ncheck 8193 8196 8197 ....
... and then feed that to debugfs again:
debugfs /dev/XXX < ncheck.in > ncheck.out
That will produce a file that looks like this:
debugfs: ncheck 8193
Inode Pathname
8193 /ext4
...
The next thing I'd ask you to do is to look at the pathnames and
eyeball them; are they all directories? Files? Files that you have
modified earlier? If you're not sure, you can look at a particular
inode either by giving its pathname:
debugfs: stat /ext4
or by its inode number, in angle brackets:
debugfs: stat <8193>
What I'm trying to do here is to get a pattern of what might be ...Unfortunately, there are no more messages in the syslog than in the dmesg. For the two fs the only inode which shows up is the inode 8 (this seems to be the journal. According to 'stat <8>' in debugfs it looks like the journal is 134Megs long. I don't remember exactly how I created the fs, but i'm sure I did not specified the journal size. Does it seem reasonable for a 6,6G fs? However seeing only the journal inode could be because I only had the last 300 blocks left in the dmesg. A small number of them did not belong to any inode. Should I reboot with a very big ring buffer? --
134 Megs sounds wrong. What does dumpe2fs -h say? I'm guessing you didn't calculate it quite correctly. I did some poking around myself, and noticed that a lot of in-use buffers hanging around from the journal inode. The following patch should fix that problem. I'm still doing some more testingto make sure there aren't any other buffer head leaks, but this is seems to fix the worst of the problems. Can you let me know how this works for you? - Ted jbd2: Fix buffer head leak when writing the commit block Also make sure the buffer heads are marked clean before submitting bh for writing. The previous code was marking the buffer head dirty, which would have forced an unneeded write (and seek) to the journal for no good reason. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c index e91f051..c2b04cd 100644 --- a/fs/jbd2/commit.c +++ b/fs/jbd2/commit.c @@ -127,8 +127,7 @@ static int journal_submit_commit_record(journal_t *journal, JBUFFER_TRACE(descriptor, "submit commit block"); lock_buffer(bh); - get_bh(bh); - set_buffer_dirty(bh); + clear_buffer_dirty(bh); set_buffer_uptodate(bh); bh->b_end_io = journal_end_buffer_io_sync; @@ -157,7 +156,7 @@ static int journal_submit_commit_record(journal_t *journal, /* And try again, without the barrier */ lock_buffer(bh); - set_buffer_uptodate(bh); + clear_buffer_uptodate(bh); set_buffer_dirty(bh); ret = submit_bh(WRITE, bh); } --
Just so it doesn't get lost (discussed w/ Ted today) I think this 2nd hunk flipped the wrong buffer funtion; this makes much more sense to me: @@ -157,7 +156,7 @@ static int journal_submit_commit_record(journal_t *journal, /* And try again, without the barrier */ lock_buffer(bh); set_buffer_uptodate(bh); - set_buffer_dirty(bh); + clear_buffer_dirty(bh); ret = submit_bh(WRITE, bh); } -Eric --
I'm running 2.6.26.2 (ext3) and dumpe2fs show me: Journal size: 128M # fdisk -l Disk /dev/sda: 100.0 GB, 100030242816 bytes 255 heads, 63 sectors/track, 12161 cylinders Units = cilindros of 16065 * 512 = 8225280 bytes Disk identifier: 0xe5d80ea3 Dispositivo Boot Start End Blocks Id System /dev/sda1 1 889 7139328 27 Unknown Partition 1 does not end on cylinder boundary. /dev/sda2 * 889 4780 31250000 7 HPFS ou NTFS Partition 2 does not end on cylinder boundary. /dev/sda3 4781 7212 19527480 83 Linux Partition 3 does not end on cylinder boundary. /dev/sda4 7212 12161 39758040 5 Extended Partition 4 does not end on cylinder boundary. /dev/sda5 7212 7274 498928+ 82 Linux swap /dev/sda6 7274 12161 39259048+ 83 Linux Regards, Renato --
Here are updates to the two patches I sent you. The first fixes some printk labels and requires that the debugging dump ioctl be triggered by root; the second fixes the think-o that Eric pointed out earlier this morning. - Ted
Hey Quentin, just wanted to check in and find out if the patches seem
to be working for you. They seem to solve the problem on my side; I
can no longer reproduce the problem. Is it working for you?
Thanks, regards,
- Ted
--
I rebooted, but as I didn't know exactly how to trigger the bug except than by waiting a few days I was waiting to see. Do you have an idea about how to trigger the bug? Regards, Quentin Godfroy --
The bug is that each time a transaction commits, if the buffer head hasn't been leaked already, it will leak pinning the memory until the total size of the journal is taking up memory. If you have a 2gigs of memory, and a single filesystem with 256meg journal, you might not notice the problem at all. If you less than 256 megs of memory, you'll notice fairly quickly. So you can test to see if you are running into problems by running the buffer_dump program to trigger a dump of buffers in use. Some buffers will always be pinned, such as the super block, block group descriptors, and journal super blocks. And there will be some number of buffers that are in use; but the number of dirty buffers should be no more than, oh, 30 or 40, and over the next couple of hours, it should not be growing. Even before you start runninng to memory exhaustion problems, the number of in-use buffers should relatively quick rise to hundreds or thousands after a few hours of run-time. Regards, - Ted --
Indeed after a couple of days of uptime the number of dirty blocks do not go further than ~50, so I think the bug is corrected as far as I am concerned. By the way, why does the kernel not commit to memory these remaining buffers when the memory is scarse (say instead of firing an OOM killer)? Regards Quentin Godfroy --
The bug was the jbd2 code was bumping the reference count for the buffers, so they were considered "in-use" and thus couldn't be freed and released for other uses. Regards, - Ted --
We actually fixed one in 6be2ded1d7c51b39144b9f07d2c839e1bd8707f1. It is not really a memory leak. But yes it was not releasing some the prealloc space memory. -aneesh --
