Dear developers, This is one of my productive servers, wich suddenly starts to freeze (crash) some weeks before. I have done all what i can, (i think) please somebody give to me some suggestion: Mar 24 19:22:28 alfa kernel: BUG: Bad page map in process httpd pte:2bf1e025 pmd:1535b5067 Mar 24 19:22:28 alfa kernel: page:ffffea0000f1b250 flags:4000000000000404 count:1 mapcount:-1 mapping:(null) index:0 Mar 24 19:22:28 alfa kernel: addr:00000037b4008000 vm_flags:08000875 anon_vma:(null) mapping:ffff88022b5d25a8 index:8 Mar 24 19:22:28 alfa kernel: vma->vm_ops->fault: filemap_fault+0x0/0x34d Mar 24 19:22:28 alfa kernel: vma->vm_file->f_op->mmap: xfs_file_mmap+0x0/0x33 Mar 24 19:22:28 alfa kernel: Pid: 7512, comm: httpd Not tainted 2.6.32.10 #2 Mar 24 19:22:28 alfa kernel: Call Trace: Mar 24 19:22:28 alfa kernel: [<ffffffff810c2ea3>] print_bad_pte+0x210/0x229 Mar 24 19:22:28 alfa kernel: [<ffffffff810c3c98>] unmap_vmas+0x44b/0x787 Mar 24 19:22:28 alfa kernel: [<ffffffff810c81d5>] exit_mmap+0xb0/0x133 Mar 24 19:22:28 alfa kernel: [<ffffffff81041f83>] mmput+0x48/0xb9 Mar 24 19:22:28 alfa kernel: [<ffffffff810463b0>] exit_mm+0x105/0x110 Mar 24 19:22:28 alfa kernel: [<ffffffff81371287>] ? tty_audit_exit+0x28/0x85 Mar 24 19:22:28 alfa kernel: [<ffffffff810477a0>] do_exit+0x1e9/0x6d2 Mar 24 19:22:28 alfa kernel: [<ffffffff81053c37>] ? __dequeue_signal+0xf1/0x127 Mar 24 19:22:28 alfa kernel: [<ffffffff81047d00>] do_group_exit+0x77/0xa1 Mar 24 19:22:28 alfa kernel: [<ffffffff810560f7>] get_signal_to_deliver+0x32c/0x37f Mar 24 19:22:28 alfa kernel: [<ffffffff8100a484>] do_notify_resume+0x90/0x740 Mar 24 19:22:28 alfa kernel: [<ffffffff8102724b>] ? __bad_area_nosemaphore+0x178/0x1a2 Mar 24 19:22:28 alfa kernel: [<ffffffff810272b9>] ? __bad_area+0x44/0x4d Mar 24 19:22:28 alfa kernel: [<ffffffff8100bba2>] retint_signal+0x46/0x84 Mar 24 19:22:28 alfa kernel: Disabling lock debugging due to kernel taint Mar 24 19:22:28 alfa kernel: swap_free: Bad swap file entry ...
On Thu, 25 Mar 2010 11:29:25 +0800 Hmm..here is summary of corruption (from log), but no idea. == process's address pte pnf->pte->page 00000037b4008000 2bf1e025 -> PG_reserved 00000037b400a000 d900000000 -> bad swap 00000037b400c000 2bfe8025 -> PG_reserved 00000037b400d000 12bfe9025 -> belongs to some other files' page cache 00000037b400e000 ff00000000 -> bad swap 00000037b400f000 5400000000 -> bad swap ... 00000037b4019000 ff00000000 -> bad swap == All ptes are on the same pmd 1535b5067. . I doubt some kind of buffer overflow bug overwrites page table... Because ptes for adddress of 00000037b4008000...00000037b400f000 are on head of a page (used for pmd), some data on page [0x1535b4000..0x1535b5000) caused buffer overflow and broke page table in [0x1535b5000...0x1535b6000) Is this bug found from 2.6.28.10 ? If I investigate this issue, I'll check the owner of page 0x1535b4000 by crash dump. Thanks, --
----- Original Message ----- From: "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com> To: "Américo Wang" <xiyou.wangcong@gmail.com> Cc: "Janos Haar" <janos.haar@netcenter.hu>; <linux-kernel@vger.kernel.org>; <linux-mm@kvack.org> Sent: Thursday, March 25, 2010 7:31 AM No, the bug, what i have sent was from 2.6.32.10. (you can check it from the messages file in the link) The story begins about marc 9-10 but unfortunately the system not all the time was able to write down the messages file. (At Mar 13 11:20:09 i have triggered the sysreq's process and memory information, you can see it in the link below.) We have more crashes with the 2.6.28.10 in the next some day and the server is removed for testing (7 days hole in the log), but looks stable Here is more serious crashes from the 2.6.28.10: http://download.netcenter.hu/bughunt/20100324/marc11-14 For me looks like all memory, swap and xfs related. I have tested/repaired all the filesystems offline, corrected the errors wich was left by the previous crashes, than disabled the swap, but nothing helps. :( Finally in marc 21, i have replaced the kernel to the 32.10, and the crashes looks gone but only for 4 days. (you can see the first dump in my first mail) Thanks for all the help, --
Hello, Another issue with this productive server: Can somebody point me to the rigth direction? Or support that this is a hw problem or not? The messages file are here: http://download.netcenter.hu/bughunt/20100324/marc30 Thanks, Janos Haar Mar 30 18:51:43 alfa kernel: BUG: unable to handle kernel paging request at 000000320000008c Mar 30 18:51:43 alfa kernel: IP: [<ffffffff811d755b>] xfs_iflush_cluster+0x148/0x35a Mar 30 18:51:43 alfa kernel: PGD 102d7a067 PUD 0 Mar 30 18:51:43 alfa kernel: Oops: 0000 [#1] SMP Mar 30 18:51:43 alfa kernel: last sysfs file: /sys/class/misc/rfkill/dev Mar 30 18:51:43 alfa kernel: CPU 0 Mar 30 18:51:43 alfa kernel: Modules linked in: hidp l2cap crc16 bluetooth rfkill ipv6 video output sbs sbshc battery ac parport_pc lp parport serio_raw 8250_ pnp 8250 serial_core shpchp button i2c_i801 i2c_core pcspkr Mar 30 18:51:43 alfa kernel: Pid: 3242, comm: flush-8:16 Not tainted 2.6.32.10 #2 Mar 30 18:51:43 alfa kernel: RIP: 0010:[<ffffffff811d755b>] [<ffffffff811d755b>] xfs_iflush_cluster+0x148/0x35a Mar 30 18:51:43 alfa kernel: RSP: 0000:ffff880228ce5b60 EFLAGS: 00010206 Mar 30 18:51:43 alfa kernel: RAX: 0000003200000000 RBX: ffff8801537947d0 RCX: 000000000000001a Mar 30 18:51:43 alfa kernel: RDX: 0000000000000020 RSI: 00000000000c6cc2 RDI: 0000000000000001 Mar 30 18:51:43 alfa kernel: RBP: ffff880228ce5bd0 R08: ffff880228ce5b20 R09: ffff8801ea436928 Mar 30 18:51:43 alfa kernel: R10: 00000000000c6cc2 R11: 0000000000000001 R12: ffff8800b630b11a Mar 30 18:51:43 alfa kernel: R13: ffff8801bd54ab30 R14: ffff88022962d2b8 R15: 00000000000c6ca0 Mar 30 18:51:43 alfa kernel: FS: 0000000000000000(0000) GS:ffff880028200000(0000) knlGS:0000000000000000 Mar 30 18:51:43 alfa kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b Mar 30 18:51:43 alfa kernel: CR2: 000000320000008c CR3: 0000000168e75000 CR4: 00000000000006f0 Mar 30 18:51:43 alfa kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Mar 30 18:51:43 ...
Hi, Probably no, it looks like an XFS bug or a write-back bug. --
Hello, ----- Original Message ----- From: "Américo Wang" <xiyou.wangcong@gmail.com> To: "Janos Haar" <janos.haar@netcenter.hu> Cc: <linux-kernel@vger.kernel.org>; "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>; <linux-mm@kvack.org>; <xfs@oss.sgi.com>; "Jens Axboe" <axboe@kernel.dk> Sent: Thursday, April 01, 2010 12:37 PM Today i have got this again, exactly the same. (if somebody wants the log, just ask) There is a cut: Apr 1 18:50:02 alfa kernel: possible SYN flooding on port 80. Sending cookies. Apr 2 21:16:59 alfa kernel: BUG: unable to handle kernel paging request at 000000010000008c Apr 2 21:16:59 alfa kernel: IP: [<ffffffff811d755b>] xfs_iflush_cluster+0x148/0x35a Apr 2 21:16:59 alfa kernel: PGD a7374067 PUD 0 Apr 2 21:16:59 alfa kernel: Oops: 0000 [#1] SMP Apr 2 21:16:59 alfa kernel: last sysfs file: /sys/class/misc/rfkill/dev Apr 2 21:16:59 alfa kernel: CPU 1 Apr 2 21:16:59 alfa kernel: Modules linked in: hidp l2cap crc16 bluetooth rfkill ipv6 video output sbs sbshc battery ac parport_pc lp parport 8250_pnp serio_ raw shpchp 8250 serial_core i2c_i801 button pcspkr i2c_core Apr 2 21:16:59 alfa kernel: Pid: 3118, comm: flush-8:16 Not tainted 2.6.32.10 #2 Apr 2 21:16:59 alfa kernel: RIP: 0010:[<ffffffff811d755b>] [<ffffffff811d755b>] xfs_iflush_cluster+0x148/0x35a Apr 2 21:16:59 alfa kernel: RSP: 0000:ffff88022849db60 EFLAGS: 00010206 Apr 2 21:16:59 alfa kernel: RAX: 0000000100000000 RBX: ffff8801535b47d0 RCX: 000000000000001a Apr 2 21:16:59 alfa kernel: RDX: 0000000000000020 RSI: ffff880178e49158 RDI: ffff88022a5c8138 Apr 2 21:16:59 alfa kernel: RBP: ffff88022849dbd0 R08: 0000000000000001 R09: ffff880137ba67a0 Apr 2 21:16:59 alfa kernel: R10: ffff88022849db50 R11: 0000000000000020 R12: ffff880137ba6858 Apr 2 21:16:59 alfa kernel: R13: ffff880115f4cd68 R14: ffff88022953a9e0 R15: 000000000061d440 Apr 2 21:16:59 alfa kernel: FS: 0000000000000000(0000) GS:ffff880028280000(0000) knlGS:0000000000000000 Apr 2 21:16:59 ...
Small hint - please put the subsytemthe bug occurred in in the subject line. I missed this in the firehose of lkml traffic because there wasnothing to indicate to me it was in XFS. Soemthing like: "Kernel crash in xfs_iflush_cluster" Won't get missed quite so easily.... This may be a fixed problem - what kernel are you running? Cheers, Dave. -- Dave Chinner david@fromorbit.com --
Hello, The actual version of kernel is 2.6.32.10. There is any significant fixes for me in the last (.11) or in the next (33.x)? Thanks, Janos ----- Original Message ----- From: "Dave Chinner" <david@fromorbit.com> To: "Janos Haar" <janos.haar@netcenter.hu> Cc: "Américo Wang" <xiyou.wangcong@gmail.com>; <linux-kernel@vger.kernel.org>; "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>; <linux-mm@kvack.org>; <xfs@oss.sgi.com>; <axboe@kernel.dk> Sent: Saturday, April 03, 2010 1:09 AM Subject: Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look --
The fixes for this bug are queued up already for the next 2.6.32.x release. Cheers, Dave. -- Dave Chinner david@fromorbit.com --
Dave, Thank you for your answer. Like i sad before, this is a productive server with important service. Can you please send the fix for me as soon as it is done even for testing it.... Or point me to the right direction to get it? Thanks a lot, Janos Haar ----- Original Message ----- From: "Dave Chinner" <david@fromorbit.com> To: "Janos Haar" <janos.haar@netcenter.hu> Cc: <xiyou.wangcong@gmail.com>; <linux-kernel@vger.kernel.org>; <kamezawa.hiroyu@jp.fujitsu.com>; <linux-mm@kvack.org>; <xfs@oss.sgi.com>; <axboe@kernel.dk> Sent: Sunday, April 04, 2010 12:37 PM Subject: Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look --
It's in 2.6.33 if you want to upgrade the kernel, or you if don't want to wait for the next 2.6.32.x kernel, you can apply this series of 19 patches yourself: http://oss.sgi.com/archives/xfs/2010-03/msg00125.html Cheers, Dave. -- Dave Chinner david@fromorbit.com --
----- Original Message ----- From: "Dave Chinner" <david@fromorbit.com> To: "Janos Haar" <janos.haar@netcenter.hu> Cc: <xiyou.wangcong@gmail.com>; <linux-kernel@vger.kernel.org>; <kamezawa.hiroyu@jp.fujitsu.com>; <linux-mm@kvack.org>; <xfs@oss.sgi.com>; <axboe@kernel.dk> Sent: Tuesday, April 06, 2010 12:45 AM Subject: Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look Generally, for this system, i am much more prefer the extra-stable series, but in this case i will try out the 2.6.33 because these 2 versions is close to each other, and i don't want to add 19 patches manually. :-) I will try it, and i will reply about the result in this week. Thanks for You and for all the people, who works on XFS and Linux. :-) Best Regards, --
Hello, Sorry, but still have the problem with 2.6.33.2. Apr 8 03:04:51 alfa kernel: BUG: unable to handle kernel paging request at 000000610000008c Apr 8 03:04:51 alfa kernel: IP: [<ffffffff811f17c4>] xfs_iflush_cluster+0x148/0x35a Apr 8 03:04:51 alfa kernel: PGD 22258a067 PUD 0 Apr 8 03:04:51 alfa kernel: Oops: 0000 [#1] SMP Apr 8 03:04:51 alfa kernel: last sysfs file: /sys/class/misc/rfkill/dev Apr 8 03:04:51 alfa kernel: CPU 2 Apr 8 03:04:51 alfa kernel: Pid: 3049, comm: xfssyncd Not tainted 2.6.33.2 #1 DP35DP/ Apr 8 03:04:51 alfa kernel: RIP: 0010:[<ffffffff811f17c4>] [<ffffffff811f17c4>] xfs_iflush_cluster+0x148/0x35a Apr 8 03:04:51 alfa kernel: RSP: 0018:ffff880228e3bca0 EFLAGS: 00010206 Apr 8 03:04:51 alfa kernel: RAX: 0000006100000000 RBX: ffff880153795750 RCX: 000000000000001a Apr 8 03:04:51 alfa kernel: RDX: 0000000000000020 RSI: 00000000003dfdf4 RDI: 0000000000000005 Apr 8 03:04:51 alfa kernel: RBP: ffff880228e3bd10 R08: ffff880228e3bc60 R09: ffff8801c5d6e1b8 Apr 8 03:04:51 alfa kernel: R10: 00000000003dfdf4 R11: 0000000000000005 R12: 000000000000001a Apr 8 03:04:51 alfa kernel: R13: ffff8800b1d920d8 R14: ffff88022a7cabe0 R15: 00000000003ddf80 Apr 8 03:04:51 alfa kernel: FS: 0000000000000000(0000) GS:ffff880028300000(0000) knlGS:0000000000000000 Apr 8 03:04:51 alfa kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Apr 8 03:04:51 alfa kernel: CR2: 000000610000008c CR3: 00000002222db000 CR4: 00000000000006e0 Apr 8 03:04:51 alfa kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Apr 8 03:04:51 alfa kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Apr 8 03:04:51 alfa kernel: Process xfssyncd (pid: 3049, threadinfo ffff880228e3a000, task ffff880228e66040) Apr 8 03:04:51 alfa kernel: Stack: Apr 8 03:04:51 alfa kernel: ffff8800b1d920d8 ffff8800466bc100 ffff880228c32580 ffffffffffffffe0 Apr 8 03:04:51 alfa kernel: <0> 0000000000000020 ffff880228e24930 ...
Yeah, these still a fix that needs to be back ported to .33 to solve this problem. It's in the series for 2.6.32.x, so maybe pulling the 2.6.32-stable-queue tree in the meantime is your best bet. Cheers, Dave. -- Dave Chinner david@fromorbit.com --
----- Original Message ----- From: "Dave Chinner" <david@fromorbit.com> To: "Janos Haar" <janos.haar@netcenter.hu> Cc: <xiyou.wangcong@gmail.com>; <linux-kernel@vger.kernel.org>; <kamezawa.hiroyu@jp.fujitsu.com>; <linux-mm@kvack.org>; <xfs@oss.sgi.com>; <axboe@kernel.dk> Sent: Thursday, April 08, 2010 4:58 AM Subject: Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look Ok, thank you. But where can i find this tree? Thanks, --
Perhaps Dave meant the stable-queue? http://git.kernel.org/?p=linux/kernel/git/stable/stable-queue.git Then again, 2.6.34-rc3 needs testing too! :-) Christian. -- BOFH excuse #98: The vendor put the bug there. --
Hello, I am just started to test the stable-queue patch series on 2.6.32.10. Now running, we will see... The 2.6.33.2 made 4 crashes in the last 3 days. :-( This was more worse than the original 2.6.32.10. (I am very interested, anyway, this is the last shot of this server. The owner giving me an ultimate. If the server crashes again in the next week, i need to replace the entire HW, the OS, and the services as well...) Thanks a lot for help. Best Regards, Janos Haar ----- Original Message ----- From: "Christian Kujau" <lists@nerdbynature.de> To: "Janos Haar" <janos.haar@netcenter.hu> Cc: "Dave Chinner" <david@fromorbit.com>; <axboe@kernel.dk>; "LKML" <linux-kernel@vger.kernel.org>; <xfs@oss.sgi.com>; <linux-mm@kvack.org>; <xiyou.wangcong@gmail.com>; <kamezawa.hiroyu@jp.fujitsu.com> Sent: Friday, April 09, 2010 11:37 PM Subject: Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look --
I would recommend you to use a distribution-released kernel, rather than a stable kernel from kernel.org, because usually the distribution maintains a longer supported kernel than kernel.org. Just a little suggestion. Hope it helps for you to choose Linux. ;) Thanks. -- Live like a child, think like the god. --
Hi, ----- Original Message ----- From: "Américo Wang" <xiyou.wangcong@gmail.com> To: "Janos Haar" <janos.haar@netcenter.hu> Cc: "Christian Kujau" <lists@nerdbynature.de>; <david@fromorbit.com>; <axboe@kernel.dk>; "LKML" <linux-kernel@vger.kernel.org>; <xfs@oss.sgi.com>; <linux-mm@kvack.org>; <xiyou.wangcong@gmail.com>; <kamezawa.hiroyu@jp.fujitsu.com> Sent: Saturday, April 10, 2010 10:06 AM Subject: Re: Kernel crash in xfs_iflush_cluster (was Somebody take a Personally, i am really like Linux, and use it since 1990. :-) I have set up about 30 servers or more... Usually i can't use the distro-release kernels, because these usually too old. Additionally i can find bugs on any software and in any kernel version.... B-) This is not the first time when i report bugs from the kernel, maybe the 4th or 5th time... (I was who helped to solve the original NBD deadlock problem as well about 2005.) Anyway, thanks for your suggestion. ;-) Cheers, --
Dave, Not the server looks stable, but only runs in 23 hour at this point. Now i can see these and similar messages: Apr 10 09:59:09 alfa kernel: Filesystem "sdb2": corrupt dinode 673160714, extent total = -1392508927, nblocks = 5. Unmount and run xfs_repair. Apr 10 09:59:09 alfa kernel: ffff880153797a00: 49 4e 81 a4 01 02 00 01 00 00 00 30 00 00 00 30 IN.........0...0 Apr 10 09:59:09 alfa kernel: Filesystem "sdb2": XFS internal error xfs_iformat(1) at line 332 of file fs/xfs/xfs_inode.c. Caller 0xffffffff811d70d6 Apr 10 09:59:09 alfa kernel: Apr 10 09:59:09 alfa kernel: Pid: 2324, comm: updatedb Not tainted 2.6.32.10 #3 Apr 10 09:59:09 alfa kernel: Call Trace: Apr 10 09:59:09 alfa kernel: [<ffffffff811cf87d>] xfs_error_report+0x41/0x43 Apr 10 09:59:09 alfa kernel: [<ffffffff811d70d6>] ? xfs_iread+0xb1/0x184 Apr 10 09:59:09 alfa kernel: [<ffffffff811cf8d1>] xfs_corruption_error+0x52/0x5e Apr 10 09:59:09 alfa kernel: [<ffffffff811d6c68>] xfs_iformat+0x10d/0x4ca Apr 10 09:59:09 alfa kernel: [<ffffffff811d70d6>] ? xfs_iread+0xb1/0x184 Apr 10 09:59:09 alfa kernel: [<ffffffff811d70d6>] xfs_iread+0xb1/0x184 Apr 10 09:59:09 alfa kernel: [<ffffffff811d3ee2>] xfs_iget+0x2c3/0x455 Apr 10 09:59:09 alfa kernel: [<ffffffff811eab8b>] xfs_lookup+0x82/0xb3 Apr 10 09:59:09 alfa kernel: [<ffffffff811f5a8f>] xfs_vn_lookup+0x45/0x86 Apr 10 09:59:09 alfa kernel: [<ffffffff810e3f73>] do_lookup+0xde/0x1ca Apr 10 09:59:09 alfa kernel: [<ffffffff810e65b6>] __link_path_walk+0x84e/0xcb3 Apr 10 09:59:09 alfa kernel: [<ffffffff810e4462>] ? path_init+0xaf/0x156 Apr 10 09:59:09 alfa kernel: [<ffffffff810e6a6e>] path_walk+0x53/0x9c Apr 10 09:59:09 alfa kernel: [<ffffffff810e6b9e>] do_path_lookup+0x2f/0xac Apr 10 09:59:09 alfa kernel: [<ffffffff810e7603>] user_path_at+0x57/0x91 Apr 10 09:59:09 alfa kernel: [<ffffffff810ec2e5>] ? dput+0x54/0x132 Apr 10 09:59:09 alfa kernel: [<ffffffff810df492>] ? cp_new_stat+0xfb/0x114 Apr 10 09:59:09 alfa kernel: [<ffffffff810df670>] ...
Hi, Ok, here comes the funny part: I have got several messages from the kernel about one of my XFS (sdb2) have corrupted inodes, but my xfs_repair (v. 2.8.11) says the FS is clean and shine. Should i upgrade my xfs_repair, or this is another bug? :-) Thanks, Janos ----- Original Message ----- From: "Dave Chinner" <david@fromorbit.com> To: "Janos Haar" <janos.haar@netcenter.hu> Cc: <xiyou.wangcong@gmail.com>; <linux-kernel@vger.kernel.org>; <kamezawa.hiroyu@jp.fujitsu.com>; <linux-mm@kvack.org>; <xfs@oss.sgi.com>; <axboe@kernel.dk> Sent: Thursday, April 08, 2010 4:58 AM Subject: Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look --
v2.8.11 is positively ancient. :/ I'd upgrade (current is 3.1.1) and re-run repair again. Cheers, Dave. -- Dave Chinner david@fromorbit.com --
----- Original Message ----- From: "Dave Chinner" <david@fromorbit.com> To: "Janos Haar" <janos.haar@netcenter.hu> Cc: <xiyou.wangcong@gmail.com>; <linux-kernel@vger.kernel.org>; <kamezawa.hiroyu@jp.fujitsu.com>; <linux-mm@kvack.org>; <xfs@oss.sgi.com>; <axboe@kernel.dk> Sent: Monday, April 12, 2010 2:11 AM Subject: Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look OK, i will get the new repair today. btw Since i tested the FS with the 2.8.11, today morning i found this in the log: ... Apr 12 00:41:10 alfa kernel: XFS mounting filesystem sdb2 # This was the point of check with xfs_repair v2.8.11 Apr 13 03:08:33 alfa kernel: xfs_da_do_buf: bno 32768 Apr 13 03:08:33 alfa kernel: dir: inode 474253931 Apr 13 03:08:33 alfa kernel: Filesystem "sdb2": XFS internal error xfs_da_do_buf(1) at line 2020 of file fs/xfs/xfs_da_btree.c. Caller 0xffffffff811c4fa6 Apr 13 03:08:33 alfa kernel: Apr 13 03:08:33 alfa kernel: Pid: 27304, comm: 01vegzet_runner Not tainted 2.6.32.10 #3 Apr 13 03:08:33 alfa kernel: Call Trace: Apr 13 03:08:33 alfa kernel: [<ffffffff811cf87d>] xfs_error_report+0x41/0x43 Apr 13 03:08:33 alfa kernel: [<ffffffff811c4fa6>] ? xfs_da_read_buf+0x2a/0x2c Apr 13 03:08:33 alfa kernel: [<ffffffff811c4c30>] xfs_da_do_buf+0x2a6/0x5aa Apr 13 03:08:33 alfa kernel: [<ffffffff811c4fa6>] xfs_da_read_buf+0x2a/0x2c Apr 13 03:08:33 alfa kernel: [<ffffffff811ca0f1>] ? xfs_dir2_leaf_lookup_int+0x104/0x259 Apr 13 03:08:33 alfa kernel: [<ffffffff811ca0f1>] xfs_dir2_leaf_lookup_int+0x104/0x259 Apr 13 03:08:33 alfa kernel: [<ffffffff811ca56e>] xfs_dir2_leaf_lookup+0x26/0xb5 Apr 13 03:08:33 alfa kernel: [<ffffffff811c6d60>] ? xfs_dir2_isleaf+0x21/0x52 Apr 13 03:08:33 alfa kernel: [<ffffffff811c74ea>] xfs_dir_lookup+0x104/0x157 Apr 13 03:08:33 alfa kernel: [<ffffffff811eab59>] xfs_lookup+0x50/0xb3 Apr 13 03:08:33 alfa kernel: [<ffffffff811f5a8f>] xfs_vn_lookup+0x45/0x86 Apr 13 03:08:33 alfa kernel: [<ffffffff810e4164>] ...
A corrupted directory. There have been several different types of
So the bad inodes are:
$ awk '/corrupt inode/ { print $10 } /dir: inode/ { print $8 }' messages | sort -n -u
474253931
474253936
474253937
474253938
474253939
474253940
474253941
474253943
474253945
474253946
474253947
474253948
474253949
474253950
474253951
673160704
673160708
673160712
673160713
It looks like the bad inodes are confined to two inode clusters. The
nature of the errors - bad block mappings and bad extent counts -
makes me think you might have bad memory in the machine:
$ awk '/xfs_da_do_buf: bno/ { printf "%x\n", $8 }' messages | sort -n -u
4d8000
5e0000
7f8001
8000
8001
10000
10001
20001
28001
38000
270001
370001
548001
568000
568001
600000
600001
618000
618001
628000
628001
650001
I think they should all be 0 or 1, and:
$ awk '/corrupt inode/ { split($13, a, ")"); printf "%x\n", a[1] }' messages | sort -n -u
fffffffffd000001
6b000001
1000001
75000001
I think they should all be 1, too.
I've seen this sort of error pattern before on a machine that had a
bad DIMM. If the corruption is on disk then the buffers were
corrupted between the time that the CPU writes to them and being
written to disk. If there is no corruption on disk, then the CPU is
reading bad data from memory...
If you run:
$ xfs_db -r -c "inode 474253940" -c p /dev/sdb2
Then I can can confirm whether there is corruption on disk or not.
Probably best to sample multiple of the inode numbers from the above
list of bad inodes.
FWIW, I'd strongly suggest backing up everything you can first
before running an updated xfs_repair....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
--
----- Original Message ----- From: "Dave Chinner" <david@fromorbit.com> To: "Janos Haar" <janos.haar@netcenter.hu> Cc: <xiyou.wangcong@gmail.com>; <linux-kernel@vger.kernel.org>; <kamezawa.hiroyu@jp.fujitsu.com>; <linux-mm@kvack.org>; <xfs@oss.sgi.com>; <axboe@kernel.dk> Sent: Tuesday, April 13, 2010 10:39 AM Subject: Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look Here is the log: http://download.netcenter.hu/bughunt/20100413/debug.log The xfs_db does segmentation fault. :-) Btw memory corruption: In the beginnig of march, one of my bets was memory problem too, but the server was offline for 7 days, and all the time runs the memtest86 on the hw, and passed all the 8GB 74 times without any bit error. I don't think it is memory problem, additionally the server can create big size .tar.gz files without crc problem. If i force my mind to think to hw memory problem, i can think only for the raid card's cache memory, wich i can't test with memtest86. Or the cache of the HDD's pcb... In the other hand, i have seen more people reported memory corruption about these kernel versions, can we check this and surely select wich is the problem? (hw or sw)? I mean, if i am right, the hw memory problem makes only 1-2 bit corruption Yes, i know that too. :-) Thanks, --
There are multiple fields in the inode that are corrupted. I am really surprised that xfs-repair - even an old version - is not Yup, it probably ran off into la-la land chasing corrupted Yes, it could be something like that, too, but the only way to test I haven't heard of any significant memory corruption problems in 2.6.32 or 2.6.33, but it is a possibility given the nature of the corruption. However, I may have only happened once and be completely unreproducable. I'd suggest fixing the existing corruption first, and then seeing if it re-appears. If it does reappear, then we know there's a RAM ECC guarantees correction of single bit errors and detection of double bit errors (which cause the kernel to panic, IIRC). I can't tell you what happens when larger errors occur, though... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
Dave, ----- Original Message ----- From: "Dave Chinner" <david@fromorbit.com> To: "Janos Haar" <janos.haar@netcenter.hu> Cc: <xiyou.wangcong@gmail.com>; <linux-kernel@vger.kernel.org>; <kamezawa.hiroyu@jp.fujitsu.com>; <linux-mm@kvack.org>; <xfs@oss.sgi.com>; <axboe@kernel.dk> Sent: Tuesday, April 13, 2010 1:34 PM Subject: Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look I think i know now the reason.... My case starting to turn into more and more interesting. (Just a little note for remember: tuesday night, i have run the old 2.8.11 xfs_repair on the partiton wich was reported as corrupt by the kernel, but it was clean. The system was not restarted!) Like you suggested, today, i have tried to make a backup from the data. During the copy, the kernel reported a lot of corrupted entries again, and finally the kernel crashed! (with the 19 patch pack) Unfortunately the kernel can't write the debug info into the syslog. The system restarted automatically, the service runs again, and i can't do another backup attempt because force of the owner. Today night, when the traffic was in the low period, i have stopped the service, umount the partition, and repeat the xfs_repair on the previously reported partition on more ways. Here you can see the results: xfs_repair 2.8.11 run #1: http://download.netcenter.hu/bughunt/20100413/repair2811-nr1.log xfs_repair 2.8.11 run #2: http://download.netcenter.hu/bughunt/20100413/repair2811-nr2.log echo 3 >/proc/sys/vm/drop_caches - performed xfs_repair 2.8.11 run #3: http://download.netcenter.hu/bughunt/20100413/repair2811-nr3.log xfs_reapir 3.1.1 run #1: http://download.netcenter.hu/bughunt/20100413/repair311-nr1.log xfs_reapir 3.1.1 run #2: sorry, i had no time to play more offline. :-( For me, it looks like the FS gets corrupted between tuesday night and today night. Note: because i am expecting kernel crashes, the dirty data flush was set for some miliseconds timeout only for prevent too much ...
So this successfully detected and repaired the corruption. I don't think this is new corruption - the corrupted inode numbers are the These two are clearing lost+found and rediscovering the diesconnected inodes that were discovered in the first pass. Nothing Can you reporduce the corruption again now that the filesystem has been repaired? I want to know (if the corruption appears again) If your hardware doesn't have ECC, then you can't rule out anything - even a dodgy power supply can cause this sort of transient problem. I'm not saying that this is the cause, but I've been assuming that you're actually running hardware with ECC on RAM, If you can take the performance hit, turn on the kernel memory leak detector and see if that catches anything. Cheers, Dave. -- Dave Chinner david@fromorbit.com --
Dave, The corruption + crash reproduced. (unfortunately) http://download.netcenter.hu/bughunt/20100413/messages-15 Apr 14 01:06:33 alfa kernel: XFS mounting filesystem sdb2 This was the point of the xfs_repair more times. Regards, Janos ----- Original Message ----- From: "Dave Chinner" <david@fromorbit.com> To: "Janos Haar" <janos.haar@netcenter.hu> Cc: <xiyou.wangcong@gmail.com>; <linux-kernel@vger.kernel.org>; <kamezawa.hiroyu@jp.fujitsu.com>; <linux-mm@kvack.org>; <xfs@oss.sgi.com>; <axboe@kernel.dk> Sent: Wednesday, April 14, 2010 2:16 AM Subject: Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look --
OK, the inodes that are corrupted are different, so there's still something funky going on here. I still would suggest replacing the RAID controller to rule that out as the cause. FWIW, do you have any other servers with similar h/w, s/w and workloads? If so, are they seeing problems? Can you recompile the kernel with CONFIG_XFS_DEBUG enabled and reboot into it before you repair and remount the filesystem again? (i.e. so that we know that we have started with a clean filesystem and the debug kernel) I'm hoping that this will catch the corruption much sooner, perhaps before it gets to disk. Note that this will cause the machine to panic when corruption is detected, and it is much,much more careful about checking in memory structures so there is a CPU overhead involved as well. Cheers, Dave. -- Dave Chinner david@fromorbit.com --
----- Original Message ----- From: "Dave Chinner" <david@fromorbit.com> To: "Janos Haar" <janos.haar@netcenter.hu> Cc: <xiyou.wangcong@gmail.com>; <linux-kernel@vger.kernel.org>; <kamezawa.hiroyu@jp.fujitsu.com>; <linux-mm@kvack.org>; <xfs@oss.sgi.com>; <axboe@kernel.dk> Sent: Thursday, April 15, 2010 11:23 AM Subject: Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look This was not a cheap card and i can't replace, because have only one, and the owner decided allready about i need to replace the entire server @ saturday. I have only 2 day to get useful debug information when the server is online. This is bad too for testing, becasue the workload will disappear, and we This is a web based game, wich generates a loooot of small files on the corrupted filesystem, and as far as i see, the corruption happens only @ writing, but not when reading. Because i can copy multiple times big gz files across the partitions, and compare, and test for crc, and there is a cron-tester wich tests 12GB gz files hourly but can't find any problem, this shows me, the corruption only happens when writing, and not on the content, but on the FS. This scores the RAID card problem more lower, am i right? :-) Additionally in the last 3 days i have tried 2 times to cp -aR the entire partition to another, and both times the corruption appears ON THE SOURCE and finally the kernel crashed. step 1. repair step 2 run the game (files generated...) step 3 start copy partition's data in background step 4 corruption reported by kernel step 5 kernel crashed during write Can this be a race between read and write? Btw i have 2 server with this game, the difference are these: - The game's language - The HW's structure similar, but totally different branded all the parts, except the Intel CPU. :-) - The workload is lower on the stable server - The stable server is not selected for replace. :-) The important matches: - The base OS is FC6 on both - The actual kernel on ...
----- Original Message ----- From: "Dave Chinner" <david@fromorbit.com> To: "Janos Haar" <janos.haar@netcenter.hu> Cc: <xiyou.wangcong@gmail.com>; <linux-kernel@vger.kernel.org>; <kamezawa.hiroyu@jp.fujitsu.com>; <linux-mm@kvack.org>; <xfs@oss.sgi.com>; <axboe@kernel.dk> Sent: Thursday, April 15, 2010 11:23 AM Subject: Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look News: (reminder from the actual state: xfs_repair fixed the fs, than kernel reported again the corruption and crashed, i wrote the provious letter to report this.) Yesterday i have stopped the service, and run xfs_repair (new version only) on 2 FS, but it was clean! (this shows me, the reported corruption was only in memory, or the kernel repaired it on the reboot.) (The XFS_Debug turned on before.) Today morning i have another messages in the syslog from the sdb2 again. At this point, i don't know what to think. http://download.netcenter.hu/bughunt/20100413/messages-16 Regards, --
