Re: [possible regression] 2.6.22 reiserfs/libata sporadically hangs on resume from hibernation

Previous thread: How to enable dev_dbg messaging by Jay Cliburn on Friday, June 29, 2007 - 8:29 pm. (2 messages)

Next thread: Re: [SPARC64]: Add linux/pagemap.h to asm/tlb.h by Sam Ravnborg on Saturday, June 30, 2007 - 12:17 am. (1 message)
From: Andrey Borzenkov
Date: Friday, June 29, 2007 - 9:59 pm

Since 2.6.18 I do not have suspend to RAM; now I am starting to lose suspen=
d=20
to disk :)

Environment - vanilla kernel (2.6.22-rc6 currently + squashfs + single=20
pata_ali patch to switch off DMA on CD-ROM), single root on reiserfs, libat=
a=20
with pata_ali driver.

Until 2.6.22-rc I never had problems with hibernation. With 2.6.22-rc syste=
m=20
hung at least once in every rcX. Up to rc6 those lockups were absolutely=20
silent (black screen without reaction to any key). In rc6 I just got=20
something different. After resume I got on screem:

swsusp: Marking nosave pages: 000000000009f000-0000000000100000
swsusp: Basic memory bitmaps created
swsusp: Basic memory bitmaps freed

After that it just sits there doing nothing. Ther was brief sound of HDD bu=
t I=20
suspect it was related more to power-on. System was responding to power-on=
=20
button press:

ACPI Error (event-0305): No installed handler for fixed event [00000002=20
20070125]

And SysRq was functioning. Unfortunately I do not have serial console so I=
=20
copy manually stacks from several last screens of output; I have tried to=20
make a photo but right now my kbluetooth is refusing to work at all so I=20
cannot transfer them :( (but I suspect quality would be too bad anyway)

laptop_mode D
	io_schedule+0xe/0x20
	sync_buffer+0x35/0x40
	__wait_on_bit+0x45/0x70
	out_of_line_wait_on_bit+0x6c/0x80
	__wait_on_buffer+0x27/0x30
	search_by_key+0x15e/0x1250 [reiserfs]
	reiserfs_read_locked_inode+0x64/0x570 [reiserfs]
	reiserfs_iget+0x7e/0xa0 [reiserfs]
	reiserfs_lookup+0xc7/0x120 [reiserfs]
	do_lookup+0x138/0x180
	__link_path_walk+0x787/0xce0
	link_path_walk+0x44/0xc0
	path_walk+0x18/0x20
	do_path_lookup_0x88/0x210
	__path_lookupintent_open+0x4d/0x90
	path_lookup_open+0x1f/0x30
	open_exec+0x28/0xb0
	do_execve+0x36/0x1d0
	sys_execve+0x2e/0x80
	sysenter_past_esp+0x5f/0x99

90clock ...
From: Rafael J. Wysocki
Date: Saturday, June 30, 2007 - 1:37 pm

I see you're using CFQ as the default IO scheduler.  Can you please switch to
AS and see if that changes anything?

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth
-

From: Andrey Borzenkov
Date: Saturday, June 30, 2007 - 2:34 pm

If you could explain how to ... (I never understood what those two numbers=
=20
mean :) ) Here is disassembled function

 4168                   .section .sched.text
 4169                   .p2align 4,,15
 4170                   .globl io_schedule
 4171                   .type io_schedule,@function
 4172                   io_schedule:
 4173 0cd0 55           pushl %ebp
 4174 0cd1 89E5         movl %esp,%ebp
 4175
 4176 0cd3 FF05140A     incl per_cpu__runqueues+2388
 4176      0000
 4177
 4178 0cd9 E8FCFFFF     call schedule
 4178      FF
 4179
 4180 0cde FF0D140A     decl per_cpu__runqueues+2388
 4180      0000
 4181
 4182 0ce4 5D           popl %ebp
 4183 0ce5 C3           ret

Sure, but given that I have no idea how to reproduce the lockup, we may nev=
er=20
know whether it actually helped.
From: Michal Piotrowski
Date: Saturday, June 30, 2007 - 4:33 pm

gdb vmlinux

(gdb) l *io_schedule+0xe

Regards,
Michal

-- 
LOG
http://www.stardust.webpages.pl/log/
-

From: Rafael J. Wysocki
Date: Sunday, July 1, 2007 - 3:09 am

Michal has already done that. :-)


Well, if the lockup never happens with AS, that will indicate something ...

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth
-

From: Andrey Borzenkov
Date: Sunday, July 15, 2007 - 11:56 am

(gdb) l *io_schedule+0xe
0xc02aa84e is in io_schedule (include2/asm/atomic.h:110).
105      *
106      * Atomically decrements @v by 1.
107      */
108     static __inline__ void atomic_dec(atomic_t *v)
109     {
110             __asm__ __volatile__(
111                     LOCK_PREFIX "decl %0"
112                     :"+m" (v->counter));
113     }

Well, I was about to say that it is probably gone as it hit again. Now with=
=20
IDE, so we at least can rule out libata. But reiserfs is still in path, so =
I=20
Cc list. It is stock 2.6.22. I will switch to AS, but it took 2 weeks to=20
happen with CFS and in one week I will be off for a couple of weeks.

Here is hand-copied information about locks and blocked processes.

Showing all locks held in the system:

1 lock held by syslogd/2515
 #0: (&inode->i_mutex){--..}, at: [<c02ab4d15>] mutex_lock+0x21/0x30
1 lock held by X/3800:
 #0: (&mm->mmap_sem){----}, at: [<c02ae2e8>] do_page_fault+0x1e8/0x5e0
6 times migetty - 1 lock held by mingetty/3838:
 #0: (&tty->atomic_read_lock){--..}, at: [<c02ab0915>]=20
mutex_lock_interruptible+0x21/0x30
2 times zsh - 1 lock held by zsh/4276:
 #0: (&tty->atomic_read_lock){--..}, at: [<c02ab0915>]=20
mutex_lock_interruptible+0x21/0x30
1 lock held by consolehelper-g/21231:
 #0: (&inode->i_mutex){--..}, at: [<c02ab4b1>] mutex_lock+0x21/0x30
1 lock held by kio_http/31282:
 #0: (&inode->i_mutex){--..}, at: [<c02ab4b1>] mutex_lock+0x21/0x30

and list of blocked tasks:

syslogd
 io_schedule+0xe/0x20
 sync_buffer+0x35/0x40
 __wait_on_bit+0x45/0x70
 out_of_line_wait_on_bit+0x50/0x60
 __wait_on_buffer+0x27/0x30
 flush_commit_list+0x397/0x610 [reiserfs]
 do_journal_end+0xadc/0xc90 [reiserfs]
 journal_end_sync+0x5d/0x70 [reiserfs]
 reiserfs_commit_for_inode+0x17e/0x1a0 [reiserfs]
 reiserfs_sync_file+0x2d/0x70 [reiserfs]
 do_fsync+0x28/0x40
 sys_fsync+0xd/0x10
 sysenter_past_esp+0x5f/0x99

X
 io_schedule+0xe/0x20
 sync_page+0x3a/0x50
 __wait_on_bit_lock+0x3f/0x70
 ...
From: Andrey Borzenkov
Date: Sunday, September 2, 2007 - 4:29 am

I thought it is gone but it just happened again with 2.6.23-rc5. I thought =
I=20
have been running AS but no, I did use CFQ. Now I definitely switched to AS=
=20
default; let's see ...
From: Kasper Sandberg
Date: Sunday, September 2, 2007 - 7:49 am

Sorry for top posting, but this is MAYBE a related matter, i am not
sure.

the thing is, i am running with libata and reiserfs on a raid5 with 6
disks, and after i changed to libata it has worked excellently (before
it used to give DMA errors and then go boom).

however now i sometimes, if theres some load on the array, see that the
hdd leds go fully on, and for ~10sec to 1 min, all IO just stops
complete, and after the time, it resumes and works perfectly.

any ideas? and i also bring this up as it may give a clue as to what
causes this. - I also use CFQ


-

From: Pavel Machek
Date: Saturday, June 30, 2007 - 4:03 pm

Actually, I see laptop_mode being locked. laptop_mode does disk
spindowns, which is somehow unusual. Does it happen w/o laptop mode?
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-

From: Andrey Borzenkov
Date: Sunday, September 9, 2007 - 7:00 am

I just had the same lockup on resume using AS with 2.6.23-rc5.
From: Rafael J. Wysocki
Date: Sunday, September 9, 2007 - 10:07 am

Hm.  Does your root partition sit on reiserfs?
-

From: Andrey Borzenkov
Date: Sunday, September 9, 2007 - 10:40 am

yes - "single root on reiserfs"
From: Andrey Borzenkov
Date: Wednesday, November 21, 2007 - 10:12 am

[Empty message]
From: Rafael J. Wysocki
Date: Wednesday, November 21, 2007 - 12:47 pm

And you never know when it happens ...

I have no idea.  It's probably related to your hardware configuration somehow,
as it doesn't seem to be reproducible in general.

Regards,
Rafael
-

From: Andrey Borzenkov
Date: Wednesday, March 12, 2008 - 9:08 pm

=46or the record - it happened again under 2.6.25-rc5; it reall drives
me nuts. Of course every time it happens I do not have infrastructure to use
netconsole :( I am about to reformat the whole with ext3 as this seems
to always hang somewhere in fs access and looks like not many people
use reiser today.
From: Rafael J. Wysocki
Date: Thursday, March 13, 2008 - 1:46 pm

Well, it may be related to having the root partition on reiserfs.  I've never
seen anything like this with non-root reiserfs partitions ...

Thanks,
Rafael
--

From: Andrey Borzenkov
Date: Sunday, March 16, 2008 - 12:04 pm

This one looks the hell similar to what I posted and Ingo is using ext3 ...

http://marc.info/?l=3Dlinux-kernel&m=3D120479427912295&w=3D2
From: Rafael J. Wysocki
Date: Sunday, March 16, 2008 - 12:25 pm

Hm, this is not related to hibernation.

Perhaps hibernation just makes it easier to trigger the bug.

There's entire thread related to this problem, please have a look at it:
http://lkml.org/lkml/2008/3/16/55

Thanks,
Rafael
--

From: Andrey Borzenkov
Date: Sunday, March 16, 2008 - 9:00 pm

As Alan pointed, the patch is likely to cause other issues; also I had the
issue with libata as well :(
Previous thread: How to enable dev_dbg messaging by Jay Cliburn on Friday, June 29, 2007 - 8:29 pm. (2 messages)

Next thread: Re: [SPARC64]: Add linux/pagemap.h to asm/tlb.h by Sam Ravnborg on Saturday, June 30, 2007 - 12:17 am. (1 message)