Since 2.6.18 I do not have suspend to RAM; now I am starting to lose suspen= d=20 to disk :) Environment - vanilla kernel (2.6.22-rc6 currently + squashfs + single=20 pata_ali patch to switch off DMA on CD-ROM), single root on reiserfs, libat= a=20 with pata_ali driver. Until 2.6.22-rc I never had problems with hibernation. With 2.6.22-rc syste= m=20 hung at least once in every rcX. Up to rc6 those lockups were absolutely=20 silent (black screen without reaction to any key). In rc6 I just got=20 something different. After resume I got on screem: swsusp: Marking nosave pages: 000000000009f000-0000000000100000 swsusp: Basic memory bitmaps created swsusp: Basic memory bitmaps freed After that it just sits there doing nothing. Ther was brief sound of HDD bu= t I=20 suspect it was related more to power-on. System was responding to power-on= =20 button press: ACPI Error (event-0305): No installed handler for fixed event [00000002=20 20070125] And SysRq was functioning. Unfortunately I do not have serial console so I= =20 copy manually stacks from several last screens of output; I have tried to=20 make a photo but right now my kbluetooth is refusing to work at all so I=20 cannot transfer them :( (but I suspect quality would be too bad anyway) laptop_mode D io_schedule+0xe/0x20 sync_buffer+0x35/0x40 __wait_on_bit+0x45/0x70 out_of_line_wait_on_bit+0x6c/0x80 __wait_on_buffer+0x27/0x30 search_by_key+0x15e/0x1250 [reiserfs] reiserfs_read_locked_inode+0x64/0x570 [reiserfs] reiserfs_iget+0x7e/0xa0 [reiserfs] reiserfs_lookup+0xc7/0x120 [reiserfs] do_lookup+0x138/0x180 __link_path_walk+0x787/0xce0 link_path_walk+0x44/0xc0 path_walk+0x18/0x20 do_path_lookup_0x88/0x210 __path_lookupintent_open+0x4d/0x90 path_lookup_open+0x1f/0x30 open_exec+0x28/0xb0 do_execve+0x36/0x1d0 sys_execve+0x2e/0x80 sysenter_past_esp+0x5f/0x99 90clock ...
I see you're using CFQ as the default IO scheduler. Can you please switch to AS and see if that changes anything? Greetings, Rafael -- "Premature optimization is the root of all evil." - Donald Knuth -
If you could explain how to ... (I never understood what those two numbers= =20 mean :) ) Here is disassembled function 4168 .section .sched.text 4169 .p2align 4,,15 4170 .globl io_schedule 4171 .type io_schedule,@function 4172 io_schedule: 4173 0cd0 55 pushl %ebp 4174 0cd1 89E5 movl %esp,%ebp 4175 4176 0cd3 FF05140A incl per_cpu__runqueues+2388 4176 0000 4177 4178 0cd9 E8FCFFFF call schedule 4178 FF 4179 4180 0cde FF0D140A decl per_cpu__runqueues+2388 4180 0000 4181 4182 0ce4 5D popl %ebp 4183 0ce5 C3 ret Sure, but given that I have no idea how to reproduce the lockup, we may nev= er=20 know whether it actually helped.
gdb vmlinux (gdb) l *io_schedule+0xe Regards, Michal -- LOG http://www.stardust.webpages.pl/log/ -
Michal has already done that. :-) Well, if the lockup never happens with AS, that will indicate something ... Greetings, Rafael -- "Premature optimization is the root of all evil." - Donald Knuth -
(gdb) l *io_schedule+0xe
0xc02aa84e is in io_schedule (include2/asm/atomic.h:110).
105 *
106 * Atomically decrements @v by 1.
107 */
108 static __inline__ void atomic_dec(atomic_t *v)
109 {
110 __asm__ __volatile__(
111 LOCK_PREFIX "decl %0"
112 :"+m" (v->counter));
113 }
Well, I was about to say that it is probably gone as it hit again. Now with=
=20
IDE, so we at least can rule out libata. But reiserfs is still in path, so =
I=20
Cc list. It is stock 2.6.22. I will switch to AS, but it took 2 weeks to=20
happen with CFS and in one week I will be off for a couple of weeks.
Here is hand-copied information about locks and blocked processes.
Showing all locks held in the system:
1 lock held by syslogd/2515
#0: (&inode->i_mutex){--..}, at: [<c02ab4d15>] mutex_lock+0x21/0x30
1 lock held by X/3800:
#0: (&mm->mmap_sem){----}, at: [<c02ae2e8>] do_page_fault+0x1e8/0x5e0
6 times migetty - 1 lock held by mingetty/3838:
#0: (&tty->atomic_read_lock){--..}, at: [<c02ab0915>]=20
mutex_lock_interruptible+0x21/0x30
2 times zsh - 1 lock held by zsh/4276:
#0: (&tty->atomic_read_lock){--..}, at: [<c02ab0915>]=20
mutex_lock_interruptible+0x21/0x30
1 lock held by consolehelper-g/21231:
#0: (&inode->i_mutex){--..}, at: [<c02ab4b1>] mutex_lock+0x21/0x30
1 lock held by kio_http/31282:
#0: (&inode->i_mutex){--..}, at: [<c02ab4b1>] mutex_lock+0x21/0x30
and list of blocked tasks:
syslogd
io_schedule+0xe/0x20
sync_buffer+0x35/0x40
__wait_on_bit+0x45/0x70
out_of_line_wait_on_bit+0x50/0x60
__wait_on_buffer+0x27/0x30
flush_commit_list+0x397/0x610 [reiserfs]
do_journal_end+0xadc/0xc90 [reiserfs]
journal_end_sync+0x5d/0x70 [reiserfs]
reiserfs_commit_for_inode+0x17e/0x1a0 [reiserfs]
reiserfs_sync_file+0x2d/0x70 [reiserfs]
do_fsync+0x28/0x40
sys_fsync+0xd/0x10
sysenter_past_esp+0x5f/0x99
X
io_schedule+0xe/0x20
sync_page+0x3a/0x50
__wait_on_bit_lock+0x3f/0x70
...I thought it is gone but it just happened again with 2.6.23-rc5. I thought = I=20 have been running AS but no, I did use CFQ. Now I definitely switched to AS= =20 default; let's see ...
Sorry for top posting, but this is MAYBE a related matter, i am not sure. the thing is, i am running with libata and reiserfs on a raid5 with 6 disks, and after i changed to libata it has worked excellently (before it used to give DMA errors and then go boom). however now i sometimes, if theres some load on the array, see that the hdd leds go fully on, and for ~10sec to 1 min, all IO just stops complete, and after the time, it resumes and works perfectly. any ideas? and i also bring this up as it may give a clue as to what causes this. - I also use CFQ -
Actually, I see laptop_mode being locked. laptop_mode does disk spindowns, which is somehow unusual. Does it happen w/o laptop mode? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -
I just had the same lockup on resume using AS with 2.6.23-rc5.
Hm. Does your root partition sit on reiserfs? -
yes - "single root on reiserfs"
And you never know when it happens ... I have no idea. It's probably related to your hardware configuration somehow, as it doesn't seem to be reproducible in general. Regards, Rafael -
=46or the record - it happened again under 2.6.25-rc5; it reall drives me nuts. Of course every time it happens I do not have infrastructure to use netconsole :( I am about to reformat the whole with ext3 as this seems to always hang somewhere in fs access and looks like not many people use reiser today.
Well, it may be related to having the root partition on reiserfs. I've never seen anything like this with non-root reiserfs partitions ... Thanks, Rafael --
This one looks the hell similar to what I posted and Ingo is using ext3 ... http://marc.info/?l=3Dlinux-kernel&m=3D120479427912295&w=3D2
Hm, this is not related to hibernation. Perhaps hibernation just makes it easier to trigger the bug. There's entire thread related to this problem, please have a look at it: http://lkml.org/lkml/2008/3/16/55 Thanks, Rafael --
As Alan pointed, the patch is likely to cause other issues; also I had the issue with libata as well :(
