Re: [Bisected Regression in 2.6.35] A full tmpfs filesystem causeshibernationto hang

Previous thread: idle patches for 2.6.36.merge by Len Brown on Saturday, August 14, 2010 - 10:22 pm. (4 messages)

Next thread: [PATCH] mm: code improvement of check_stack_guard_page function by jovi zhang on Saturday, August 14, 2010 - 10:30 pm. (1 message)
From: M. Vefa Bicakci
Date: Saturday, August 14, 2010 - 10:25 pm

Hello all,

I am using Debian Sid on a Toshiba Satellite A100 laptop. After testing
2.6.35 for a while, I noticed that sometimes my hibernation attempts
would fail. I should say that I never had such a problem before 2.6.35.
The hibernation process hangs with 2.6.35 after printing the following:

=== 8< ===
...
Freezing remaining freezable tasks ... (elapsed 0.01 seconds) done.
PM: Preallocating image memory...
=== >8 ===

After a short investigation, I found out that this only happens when my
tmpfs filesystem on /tmp had a lot of data in it. When my tmpfs is empty,
I have no problems.

So I wrote a short script which fills up the tmpfs on /tmp and tries to
hibernate, and I bisected the kernel using this script.

The end result is that the following commit causes this regression:

=== 8< ===
commit bb21c7ce18eff8e6e7877ca1d06c6db719376e3c
Author: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Date:   Fri Jun 4 14:15:05 2010 -0700

    vmscan: fix do_try_to_free_pages() return value when priority==0 reclaim failure

    ...
=== >8 ===

I have run 2.6.35-rc6, 2.6.35 and 2.6.35.1 with this commit reverted,
and I am happy to say that I haven't experienced any problems for at
least 17 days.

It looks like this change was included with 2.6.35-rc1. I am sorry
for not testing earlier.

I am willing to do testing in case anyone would like me to try patches.

Regards,

M. Vefa Bicakci
--

From: KOSAKI Motohiro
Date: Monday, August 16, 2010 - 7:37 pm

Wow.

I'm very surprised this report because 1) the above commit changed
do_try_to_free_pages() return value 2) but current hibernation code is
ignoring this return value. Hmm... I have to investigate this very
interesting issue.

Thanks this report, and please give me a bit time.



--

From: KOSAKI Motohiro
Date: Sunday, August 22, 2010 - 4:06 am

Hmm... 
I've tested hibernation case for a while. but I have no luck. I couldn't
reproduce your issue. Very sorry. Can you please help our debugging? 
If possible, I hope to run following three test.

1. Please let me know your machine & test script

% cat /proc/meminfo
% cat /proc/vmstat
% cat /proc/zoneinfo
% df
% cat your-fills-up-the-tmpfs-script

2. call shrink_all_memory() forcely and show result

% cat /proc/meminfo
% cat /proc/zoneinfo
# echo 1 > /proc/sys/vm/shrink_all_memory
# tail /var/log/messages
% cat /proc/meminfo
% cat /proc/zoneinfo


3. reset zone_reclaim_stat and rerun shrink_all_memory

# echo 1 > /proc/sys/vm/reset_reclaim_stat
% cat /proc/meminfo
% cat /proc/zoneinfo
# echo 1 > /proc/sys/vm/shrink_all_memory
# tail /var/log/messages
% cat /proc/meminfo
% cat /proc/zoneinfo


From: M. Vefa Bicakci
Date: Sunday, August 22, 2010 - 9:28 am

First of all, thanks a lot for spending time on this regression
I have been experiencing. I really appreciate it.

Sorry to hear that you weren't able to reproduce the issue. Well the
good (or bad?) news is that I am able to reproduce it with 2.6.35.3
with your patches applied.

I should note that after applying your patches and trying a hibernation
with a full tmpfs, a printk prints extra information on the screen just
before the hibernation process hangs. The last time I ran it, it printed:

=== 8< ===
shrink_all_memory: req: 342067 reclaimed: 27062 free: 340221
=== >8 ===

A piece of information that may be relevant or irrelevant is that my
swap space is on a dm-crypt volume.

Appended are the results of the tests you asked me to carry out.
If you'd like, I can send in private a tarball containing this
information in separate files.

Once again, thanks a lot for helping out.


Please note that I filled up the tmpfs filesystem between step 1

MemTotal:        3104484 kB
MemFree:         2817616 kB
Buffers:           31156 kB
Cached:           142124 kB
SwapCached:            0 kB
Active:           116464 kB
Inactive:         137424 kB
Active(anon):      80852 kB
Inactive(anon):    24820 kB
Active(file):      35612 kB
Inactive(file):   112604 kB
Unevictable:          32 kB
Mlocked:              32 kB
HighTotal:       2226632 kB
HighFree:        1994008 kB
LowTotal:         877852 kB
LowFree:          823608 kB
SwapTotal:       1999540 kB
SwapFree:        1999540 kB
Dirty:               116 kB
Writeback:             0 kB
AnonPages:         80636 kB
Mapped:            43768 kB
Shmem:             25068 kB
Slab:              15120 kB
SReclaimable:       7516 kB
SUnreclaim:         7604 kB
KernelStack:        1856 kB
PageTables:         2420 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     3551780 kB
Committed_AS:     337784 kB
VmallocTotal:     122880 kB
VmallocUsed:       16308 ...
From: KOSAKI Motohiro
Date: Wednesday, August 25, 2010 - 1:55 am

!! Your swap partition is smaller than physical memory. As far as I know,
swap partition need physcal-mem x 2 size. Can you please try to change 
swap configuration?

Rafael, please correct me if I'm talking wrong.

Thanks.



--

From: M. Vefa Bicakci
Date: Wednesday, August 25, 2010 - 3:11 am

[Please see my reply in the other thread.]


--

From: Rafael J. Wysocki
Date: Wednesday, August 25, 2010 - 10:31 am

No, we only save 50% of RAM (at most) during hibernation.

Thanks,
Rafael
--

From: M. Vefa Bicakci
Date: Sunday, August 22, 2010 - 11:27 pm

Hello,


No problem. I did the tests again according to your new instructions,
and I am appending the results.

Regards,



MemTotal:        3104484 kB
MemFree:         1258780 kB
Buffers:           30820 kB
Cached:          1693836 kB
SwapCached:            0 kB
Active:          1670344 kB
Inactive:         137960 kB
Active(anon):    1633940 kB
Inactive(anon):    26224 kB
Active(file):      36404 kB
Inactive(file):   111736 kB
Unevictable:          32 kB
Mlocked:              32 kB
HighTotal:       2226632 kB
HighFree:         437684 kB
LowTotal:         877852 kB
LowFree:          821096 kB
SwapTotal:       1999540 kB
SwapFree:        1999540 kB
Dirty:                28 kB
Writeback:             0 kB
AnonPages:         83676 kB
Mapped:            44220 kB
Shmem:           1576520 kB
Slab:              17136 kB
SReclaimable:       9480 kB
SUnreclaim:         7656 kB
KernelStack:        1832 kB
PageTables:         2440 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     3551780 kB
Committed_AS:    1892636 kB
VmallocTotal:     122880 kB
VmallocUsed:       16308 kB
VmallocChunk:      92764 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       4096 kB
DirectMap4k:       24568 kB

nr_free_pages 314850
nr_inactive_anon 6480
nr_active_anon 408485
nr_inactive_file 27935
nr_active_file 9101
nr_unevictable 8
nr_mlock 8
nr_anon_pages 20919
nr_mapped 11055
nr_file_pages 431089
nr_dirty 8
nr_writeback 0
nr_slab_reclaimable 2370
nr_slab_unreclaimable 1914
nr_page_table_pages 610
nr_kernel_stack 229
nr_unstable 0
nr_bounce 0
nr_vmscan_write 0
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 394054
pgpgin 150719
pgpgout 5312
pswpin 0
pswpout 0
pgalloc_dma 2
pgalloc_normal 116354
pgalloc_high 2637682
pgalloc_movable 0
pgfree 3069531
pgactivate 2339016
pgdeactivate 160
pgfault 699155
pgmajfault ...
From: KOSAKI Motohiro
Date: Tuesday, August 24, 2010 - 5:48 pm

tmpfs files are using about 1.5GB memory.




My code doesn't makes hang!


Hmm... Hmm...
To be honest, I have no idea why your hang was happen.

1) zone normal is not used. your system don't need additional reclaim
   at all.
2) reclaim logic seems to doesn't makes hang.


Can you please try following additional test?

# echo 8 > /proc/sysrq-trigger
# echo disk > /sys/power/state




--

From: KOSAKI Motohiro
Date: Wednesday, August 25, 2010 - 1:39 am

Can you please try to avoid to use /tmp. As I said,

mount -t tmpfs none /mnt/another_tmpfs
dd if=/dev/zero of=/mnt/another_tmpfs/tmp bs=1024k count=1600
shred -vn1 /mnt/another_tmpfs/tmp


That said, We need to know you issue is lots-anon-pages issue or 
filesystem-full issue.


--

From: M. Vefa Bicakci
Date: Wednesday, August 25, 2010 - 3:10 am

(Sorry, I am sending this again as I forgot to include LKML in the CC.)


Hello,

I have changed my configuration so that I have a 8 gigabyte large
swap partition, and I have also made the changes you suggested
to my hibernation/fill-tmpfs script so that a tmpfs other than
/tmp is used.

Unfortunately, nothing changed. I still get hangs after a few lines
are printed to the console. The last two lines are from your patch.
Here is an observation from an actual test which ended with a hang:

=== 8< ===
PM: Preallocating image memory
shrink_all_memory: start
shrink_all_memory: req: 375019 reclaimed: 48055 free: 326810
=== >8 ===

One thing I should note is that, before your commit, I never had
any problems even though my swap size was not two times the size
of my physical memory and even though I used tmpfs as /tmp.

If there is anything I can do to debug this problem please let me
know.

Regards,

M. Vefa Bicakci


--

From: M. Vefa Bicakci
Date: Thursday, August 26, 2010 - 3:36 am

Hello!

First of all, thanks a lot for your help - I really appreciate it.

I applied your new patches on top of your old patches. Hopefully that
was okay.

Unfortunately, it didn't work this time. Here's a sample output from the
new patch.

=== 8< ===
[58.050208] PM: Preallocating image memory...
[58.159881] shrink_all_memory start
[58.232411] PM: shrink memory: pass=1, req:312373 reclaimed:15864 free:358420
[58.342041] PM: shrink memory: pass=2, req:296509 reclaimed:21837 free:362167
[60.690035] PM: shrink memory: pass=3, req:274672 reclaimed:25982 free:348006
[61.754931] PM: shrink memory: pass=4, req:248690 reclaimed:49623 free:371589
[64.361714] PM: shrink memory: pass=5, req:199067 reclaimed:74683 free:396695
[64.361769] shrink_all_memory: req:124384 reclaimed:74683 free:396695
=== >8 ===

The interesting thing is that even though there is a lot of free memory at the
end, it still hangs. I also included the timestamps; note the one and two second
delays between the passes.

Please let me know if there is anything I can do.

Regards,

M. Vefa Bicakci
--

From: KOSAKI Motohiro
Date: Sunday, August 29, 2010 - 7:28 pm

Grr. I'm surprised this result ;-)
shrink_all_memory() finish to shrink memory successfully. but your
system still hang immediately after. I have no idea why this mysterious
occur. 

I prepared next debugging patch. It added prenty debug printk. I hope
it enlighten up which path makes system hang-up.

1. apply my new patch

2. Enable following PM debug option in Kconfig

  [*] Power Management support
  [*]   Power Management Debug Support
  [*]     Extra PM attributes in sysfs for low-level debugging/testing
  [*]     Verbose Power Management debugging

3. append following kernel boot option into grub configration file

	no_console_suspend=1

3. kernel build and reboot
4. some prepare
   # echo 8 > /proc/sysrq-trigger
   # cd /sys/power
   # echo 1 > pm_trace
   # echo 0 > pm_async


This is expected result because tmpfs shrink need swap-out. then

Please send me your .config and full dmesg.


Thanks many and many help us!
From: M. Vefa Bicakci
Date: Monday, August 30, 2010 - 9:54 am

Hello,

I have followed your instructions, with one exception: I have also
enabled CONFIG_PM_TRACE so that I would have /sys/power/pm_trace.

This time I had some more output, as expected. I double checked what
I typed while looking at the screen-shot I took with my camera. Here's
the output:

=== 8< ===
PM: Marking nosave pages: ...0009f000 - ...000100000
PM: basic memory bitmaps created
PM: Syncing filesystems ... done
Freezing user space processes ... (elapsed 0.01 seconds) done.
Freezing remaining freezable tasks ... (elapsed 0.01 seconds) done.
PM: Preallocating image memory...
shrink_all_memory start
PM: shrink memory: pass=1, req:310171 reclaimed:15492 free:360936
PM: shrink memory: pass=2, req:294679 reclaimed:28864 free:373981
PM: shrink memory: pass=3, req:265815 reclaimed:60311 free:405374
PM: shrink memory: pass=4, req:205504 reclaimed:97870 free:443024
PM: shrink memory: pass=5, req:107634 reclaimed:146948 free:492141
shrink_all_memory: req:107634 reclaimed:146948 free:492141
PM: preallocate_image_highmem 556658 278329
PM: preallocate_image_memory 103139 103139
PM: preallocate_highmem_fraction 183908 556658 760831 -> 183908
=== >8 ===

According to your patch, the next output should have been
"preallocate_image_memory ...", but it never gets printed, so the
hang point should be that function.

I am attaching my dmesg output which I got after the failed hibernation
attempt and my .config file. Please note that the attached .config file
is a trimmed version of the .config I usually use on my computer. I trimmed
it so that it compiles faster, but (mostly) has support for devices I might
use.

Thanks a lot for your help, and please let me know if I can do anything else.

Regards,

M. Vefa Bicakci
From: KOSAKI Motohiro
Date: Monday, August 30, 2010 - 11:35 pm

Great!
I've attached more verbose debug message patch and trial bug fixing patch.

From: KOSAKI Motohiro
Date: Monday, August 30, 2010 - 11:54 pm

Oops, please apply attached patch instead 0002-add-gfp_noretry.patch.

Thanks.
From: M. Vefa Bicakci
Date: Tuesday, August 31, 2010 - 4:25 am

Hello!

I have applied the patches you mentioned, and rebuilt and tested the
2.6.35.4 kernel. I am really happy to say that your patches (cumulatively)
fixed the issue!

Unfortunately, because the hibernation is rather quick, I am having a
hard time getting screen-shots with my camera. If you would like, I can
try to put some sleeps around the code so that I can get the output for
you.

For the record, the attached patch is the cumulative version of all of
your patches. It applies cleanly to 2.6.35.4, and most importantly, it
fixes the issue.

All in all, thanks a lot!

Is there anything else I can do? Would you like me to try a trimmed
version of your patch, maybe without the debugging parts and the 5-pass
swap-out procedure, which I am not sure is essential or not?

Thanks again,

M. Vefa Bicakci
From: KOSAKI Motohiro
Date: Tuesday, August 31, 2010 - 5:48 pm

Rafael, this log mean hibernate_preallocate_memory() has a bug.
It allocate memory as following order.
 1. preallocate_image_highmem()  (i.e. __GFP_HIGHMEM)
 2. preallocate_image_memory()   (i.e. GFP_KERNEL)
 3. preallocate_highmem_fraction (i.e. __GFP_HIGHMEM)
 4. preallocate_image_memory()   (i.e. GFP_KERNEL)

But, please imazine following scenario (as Vefa's scenario).
 - system has 3GB memory. 1GB is normal. 2GB is highmem.
 - all normal memory is free
 - 1.5GB memory of highmem are used for tmpfs. rest 500MB is free.

At that time, hibernate_preallocate_memory() works as following.

1. call preallocate_image_highmem(1GB)
2. call preallocate_image_memory(500M)		total 1.5GB allocated
3. call preallocate_highmem_fraction(660M)	total 2.2GB allocated

then, all of normal zone memory was exhaust. next preallocate_image_memory()
makes OOM, and oom_killer_disabled makes infinite loop.
(oom_killer_disabled careless is vmscan bug. I'll fix it soon)

The problem is, alloc_pages(__GFP_HIGHMEM) -> alloc_pages(GFP_KERNEL) is
wrong order. alloc_pages(__GFP_HIGHMEM) may allocate page from lower zone.
then, next alloc_pages(GFP_KERNEL) lead to OOM.

Please consider alloc_pages(GFP_KERNEL) -> alloc_pages(__GFP_HIGHMEM) order.
Even though vmscan fix can avoid infinite loop, OOM situation might makes
big slow down on highmem machine. It seems no good.


Thanks.


--

From: Rafael J. Wysocki
Date: Wednesday, September 1, 2010 - 3:02 pm

So, it looks like the problem will go away if we check if there are any normal
pages to allocate from before calling the last preallocate_image_memory()?


There's a problem with the ordering change that it wouldn't be clear how many
pages to request from the normal zone in step 1 and 3.

Thanks,
Rafael 

---
 kernel/power/snapshot.c |   11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

Index: linux-2.6/kernel/power/snapshot.c
===================================================================
--- linux-2.6.orig/kernel/power/snapshot.c
+++ linux-2.6/kernel/power/snapshot.c
@@ -1259,7 +1259,7 @@ int hibernate_preallocate_memory(void)
 {
 	struct zone *zone;
 	unsigned long saveable, size, max_size, count, highmem, pages = 0;
-	unsigned long alloc, save_highmem, pages_highmem;
+	unsigned long alloc, save_highmem, pages_highmem, size_normal;
 	struct timeval start, stop;
 	int error;
 
@@ -1296,6 +1296,7 @@ int hibernate_preallocate_memory(void)
 		else
 			count += zone_page_state(zone, NR_FREE_PAGES);
 	}
+	size_normal = count;
 	count += highmem;
 	count -= totalreserve_pages;
 
@@ -1344,7 +1345,13 @@ int hibernate_preallocate_memory(void)
 	size = preallocate_highmem_fraction(size, highmem, count);
 	pages_highmem += size;
 	alloc -= size;
-	pages += preallocate_image_memory(alloc);
+	/* Check if there are any non-highmem pages to allocate from. */
+	if (alloc_normal < size_normal) {
+		size_normal -= alloc_normal;
+		if (alloc > size_normal)
+			alloc = size_normal;
+		pages += preallocate_image_memory(alloc);
+	}
 	pages += pages_highmem;
 
 	/*
--

From: KOSAKI Motohiro
Date: Wednesday, September 1, 2010 - 5:31 pm

Looks like fine. but I have one question. hibernate_preallocate_memory() call
preallocate_image_memory() two times. Why do you only care latter one?

ok, I see. thanks for good correction my mistake.




--

From: Rafael J. Wysocki
Date: Thursday, September 2, 2010 - 12:57 pm

The first one is mandatory, ie. if we can't allocate the requested number of
pages at this point, we fail the entire hibernation.  In that case the
performance hit doesn't matter.

Thanks,
Rafael
--

From: Rafael J. Wysocki
Date: Thursday, September 2, 2010 - 1:24 pm

IOW, your patch at http://lkml.org/lkml/2010/9/2/262 is still necessary to
protect against the infinite loop in that case.

Thanks,
Rafael
--

From: KOSAKI Motohiro
Date: Thursday, September 2, 2010 - 5:13 pm

As far as I understand, we need distinguish two allocation failure.
  1) failure because no enough memory
	-> yes, hibernation should fail
 2) failure because already allocated enough lower zone memory
	-> why should we fail?

If the system has a lot of memory, scenario (2) is happen frequently than (1).
I think we need check alloc_highmem and alloc_normal variable and call
preallocate_image_highmem() again instead preallocate_image_memory()
if we've alread allocated enough lots normal memory.

nit?



--

From: Rafael J. Wysocki
Date: Thursday, September 2, 2010 - 6:07 pm

Actually I thought about that, but we don't really see hibernation fail for
this reason.  In all of the tests I carried out the requested 50% of highmem
had been allocated before allocations from the normal zone started to be
made, even if highmem was 100% full at that point.  So this appears to be
a theoretical issue and covering it would require us to change the algorithm
entirely (eg. it doesn't make sense to call preallocate_highmem_fraction() down
the road if that happens).

Thanks,
Rafael
--

From: KOSAKI Motohiro
Date: Thursday, September 2, 2010 - 6:53 pm

ok, thanks. probably I've catched your point. please feel free to use my reviewed-by
for your fix.

thanks.



--

From: Rafael J. Wysocki
Date: Friday, September 3, 2010 - 6:44 pm

Thanks.

In the meantime, though, I prepared a patch that should address the issue
entirely.  The patch is appended and if it looks good to you, I'd rather use it
instead of the previous one (it is still untested).

Rafael

---
From: Rafael J. Wysocki <rjw@sisk.pl>
Subject: PM / Hibernate: Avoid hitting OOM during preallocation of memory

There is a problem in hibernate_preallocate_memory() that it calls
preallocate_image_memory() with an argument that may be greater than
the number of available non-highmem memory pages.  This may trigger
the OOM condition which in turn can cause significant slowdown to
occur.

To avoid that, modify preallocate_image_memory() so that it checks
if there is a sufficient number of non-highmem pages to allocate from
before calling preallocate_image_pages() and change
hibernate_preallocate_memory() to try to allocate from highmem if
the number of pages allocated by preallocate_image_memory() is too
low.

Adjust free_unnecessary_pages() to take all possible memory
allocation patterns into account.

Reported-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
 kernel/power/snapshot.c |   66 +++++++++++++++++++++++++++++++++---------------
 1 file changed, 46 insertions(+), 20 deletions(-)

Index: linux-2.6/kernel/power/snapshot.c
===================================================================
--- linux-2.6.orig/kernel/power/snapshot.c
+++ linux-2.6/kernel/power/snapshot.c
@@ -1122,9 +1122,19 @@ static unsigned long preallocate_image_p
 	return nr_alloc;
 }
 
-static unsigned long preallocate_image_memory(unsigned long nr_pages)
+static unsigned long preallocate_image_memory(unsigned long nr_pages,
+					      unsigned long size_normal)
 {
-	return preallocate_image_pages(nr_pages, GFP_IMAGE);
+	unsigned long alloc;
+
+	if (size_normal <= alloc_normal)
+		return 0;
+
+	alloc = size_normal - alloc_normal;
+	if (nr_pages < alloc)
+		alloc = nr_pages;
+
+	return ...
From: KOSAKI Motohiro
Date: Sunday, September 5, 2010 - 7:08 pm

Yeah, this one looks nicer to me :)

Thanks, rafael!



--

From: M. Vefa Bicakci
Date: Monday, September 6, 2010 - 4:27 am

Dear Rafael Wysocki, Kosaki Motohiro and Minchan Kim,

Upon Kosaki Motohiro's kind request via an off-list e-mail,
I tested the following two patches separately with a vanilla
2.6.35.4 kernel:

Patch 1:
	http://lkml.org/lkml/2010/9/5/86

Patch 2:
	http://kerneltrap.org/mailarchive/linux-kernel/2010/9/4/4615426

The first of these was prepared by Minchan Kim, and it fixes
the issue; i.e. no hangs during hibernation with a full tmpfs.

However, the second patch, prepared by Rafael Wysocki, does *not*
fix the problem. I still experience hangs with a full tmpfs upon
hibernation.

As always, I am willing to test newer patches and help in debugging
this issue.

I really appreciate all of your help,

M. Vefa Bicakci
--

From: Rafael J. Wysocki
Date: Monday, September 6, 2010 - 11:43 am

What happens if you apply them both at the same time?

Thanks,
Rafael
--

From: M. Vefa Bicakci
Date: Monday, September 6, 2010 - 6:34 pm

Hello,

When I apply both of the patches, then I don't get any hangs with
hibernation. However, I do get another problem, which I am not sure
is related or not. I should note that I haven't experienced this
with only the vmscan.c patch, but maybe I haven't repeated my test
enough times.

One test consists of an automated run of 7 hibernate/thaw cycles. 

Here's what I got in dmesg in two of the iterations in one test.
Sorry for the long e-mail and the long lines.

=== 8< ===
[  166.512085] PM: Hibernation mode set to 'reboot'
[  166.516503] PM: Marking nosave pages: 000000000009f000 - 0000000000100000
[  166.517654] PM: Basic memory bitmaps created
[  166.518781] PM: Syncing filesystems ... done.
[  166.546308] Freezing user space processes ... (elapsed 0.01 seconds) done.
[  166.559596] Freezing remaining freezable tasks ... (elapsed 0.01 seconds) done.
[  166.571649] PM: Preallocating image memory... 
[  185.712457] iwl3945: page allocation failure. order:0, mode:0xd0
[  185.714564] Pid: 1225, comm: iwl3945 Not tainted 2.6.35.4-test-mm5v2-vmscan+snapshot-dirty #7
[  185.715741] Call Trace:
[  185.716853]  [<c019aa67>] ? __alloc_pages_nodemask+0x577/0x630
[  185.718126]  [<f8a562c5>] ? iwl3945_rx_allocate+0x75/0x240 [iwl3945]
[  185.719379]  [<c03f0516>] ? schedule+0x356/0x730
[  185.720556]  [<f8a56d50>] ? iwl3945_rx_replenish+0x20/0x50 [iwl3945]
[  185.721914]  [<f8a56dbc>] ? iwl3945_bg_rx_replenish+0x3c/0x50 [iwl3945]
[  185.723929]  [<c014b167>] ? worker_thread+0x117/0x1f0
[  185.725745]  [<f8a56d80>] ? iwl3945_bg_rx_replenish+0x0/0x50 [iwl3945]
[  185.727097]  [<c014ebd0>] ? autoremove_wake_function+0x0/0x40
[  185.728468]  [<c014b050>] ? worker_thread+0x0/0x1f0
[  185.730235]  [<c014e854>] ? kthread+0x74/0x80
[  185.731601]  [<c014e7e0>] ? kthread+0x0/0x80
[  185.732919]  [<c0103cb6>] ? kernel_thread_helper+0x6/0x10
[  185.734851] Mem-Info:
[  185.736144] DMA per-cpu:
[  185.737439] CPU    0: hi:    0, btch:   1 usd:   0
[  185.738635] CPU    1: hi:    0, btch:  ...
From: KOSAKI Motohiro
Date: Monday, September 6, 2010 - 6:58 pm

Hm, interesting.

Rafael's patch seems works intentionally. preallocate much much memory and
release over allocated memory. But on your system, iwl3945 allocate memory 
concurrently. If it try to allocate before the hibernation code release 
extra memory, It may get allocation failure.

So, I'm not sure wich behavior is desired.
  1) preallocate enough much memory
	pros) hibernate faster
	cons) failure risk of network card memory allocation
  2) preallocate small memory
	pros) hibernate slower
	cons) don't makes network card memory allocation

But, I wonder why this kernel thread is not frozen. afaik, hibernation
doesn't need network capability. Is this really intentional?

Rafael, Could you please explain the design of hibernation and your
intention?

Vefa, note: this allocation failure doesn't makes any problem. this mean
network card can't receive one network packet. But while hibernation,
we always can't receive network patchet. so no problem.



--

From: Rafael J. Wysocki
Date: Tuesday, September 7, 2010 - 2:44 pm

It's a kernel thread, we don't freeze them by default, only the ones that
directly request to be frozen.

BTW, please note that the card probably allocates from normal zone and that

The design of the preallocator is pretty straightforward.

First, if there's already enough free memory to make a copy of all memory in
use, we simply allocate as much memory as needed for that copy and return
(the size >= saveable condition).

Next, we preallocate as much memory as to accommodate the largest possible
image.  A little more than 50% of RAM is preallocated in this step (this causes
some pages that were in use before to be freed, so the resulting image size is
a little below 50% of RAM).

Next, there is the sysfs file /sys/power/image_size that represents the user's
desired size of the image.  If this number is much less than 50% of RAM,
we do our best to force the mm subsystem to free more pages so that the
resulting image size is possibly close to the desired one.  So, I guess, if
Vefa writes a greater number into /sys/power/image_size (this is in bytes),
the problems should go away. :-)

Still, I see a way to improve things in my patch.  Namely, I guess the number
returned by minimum_image_size() may also be regarded as the number of
non-highmem pages we can't free with good approximation.  Thus the
second argument of preallocate_image_memory() should be
size_normal - "the number returned by minimum_image_size()".

[BTW, there seems to be a bug in minimum_image_size(), because if
saveable < size, this means that the minimum image size is equal to saveable
rather than 0.  This shouldn't happen, though.]

Vefa, can you please test the patch below with and without the
patch at http://lkml.org/lkml/2010/9/5/86 (please don't try to change
/sys/power/image_size yet)?

Thanks,
Rafael


---
 kernel/power/snapshot.c |   75 +++++++++++++++++++++++++++++++++++-------------
 1 file changed, 55 insertions(+), 20 deletions(-)

Index: ...
From: M. Vefa Bicakci
Date: Wednesday, September 8, 2010 - 5:56 am

Dear Rafael Wysocki,

I applied the patch below to a clean 2.6.35.4 tree and tested 6 hibernate/thaw
cycles consecutively. I am happy to report that it works properly.

Then I applied the patch at http://lkml.org/lkml/2010/9/5/86 (the "vmscan.c
patch") on top of the tree I used above, and I also ran 6 hibernate/thaw
cycles. Again, I am happy to report that this combination of patches also
works properly.

I should note a few things though,

1) I don't think I ever changed /sys/power/image_size, so we can rule out the
possibility of that option changing the results.

2) With the patch below, for the *first* hibernation operation, the computer
enters a "thoughtful" state without any disk activity for 6-8 (maybe 10)
seconds after printing "Preallocating image memory". It works properly after
the wait however.

3) For some reason, with the patch below by itself, or in combination with the
above-mentioned vmscan.c patch, I haven't seen any page allocation errors
regarding the iwl3945 driver. To be honest I am not sure why this change
occurred, but I think you might know.

4) I made sure that I was not being impatient with the previous snapshot.c
patch, so I tested that on its own once again, and I confirmed that hibernation
hangs with the older version of the snapshot.c patch.

I am very happy that we are getting closer to a solution. Please let me know
if there is anything I need to test further.

Regards,


--

From: Rafael J. Wysocki
Date: Wednesday, September 8, 2010 - 2:34 pm

That probably is a result of spending time in the memory allocator trying to

I think we just keep enough free pages in the normal zone all the time for the

Below is the patch I'd like to apply.  It should work just like the previous
one (there are a few fixes that shouldn't affect the functionality in it), but
please test it if you can.

I think the slowdown you saw in 2) may be eliminated by increasing the
image_size value, so I'm going to prepare a patch that will compute the
value automatically during boot so that it's approximately 50% of RAM.

Thanks,
Rafael

---
From: Rafael J. Wysocki <rjw@sisk.pl>
Subject: PM / Hibernate: Avoid hitting OOM during preallocation of memory

There is a problem in hibernate_preallocate_memory() that it calls
preallocate_image_memory() with an argument that may be greater than
the total number of available non-highmem memory pages.  If that's
the case, the OOM condition is guaranteed to trigger, which in turn
can cause significant slowdown to occur during hibernation.

To avoid that, make preallocate_image_memory() adjust its argument
before calling preallocate_image_pages(), so that the total number of
saveable non-highem pages left is not less than the minimum size of
a hibernation image.  Change hibernate_preallocate_memory() to try to
allocate from highmem if the number of pages allocated by
preallocate_image_memory() is too low.

Modify free_unnecessary_pages() to take all possible memory
allocation patterns into account.

Reported-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
 kernel/power/snapshot.c |   85 ++++++++++++++++++++++++++++++++++++------------
 1 file changed, 65 insertions(+), 20 deletions(-)

Index: linux-2.6/kernel/power/snapshot.c
===================================================================
--- linux-2.6.orig/kernel/power/snapshot.c
+++ linux-2.6/kernel/power/snapshot.c
@@ -1122,9 +1122,19 @@ static unsigned long preallocate_image_p
 ...
From: M. Vefa Bicakci
Date: Saturday, September 11, 2010 - 11:12 am

Hello,

Sorry for the late reply. I have been busy the past few days.


It contains 524288000, so I think it is set to 500 MB. I believe that this is

I am not sure if this is a new thing with the new patch, but the behavior
seems to continue with the later hibernation operations too, not just the
first one. I haven't confirmed if I really didn't realize the problem in
the previous version of the patch, but it is very possible that I didn't
realize it since I used to automate my tests. (I didn't automate my tests
this time.)

However, considering that the kernel needs to worry about compacting 1500 MB
of data when hibernating with my tmpfs-is-full system, I guess these wait

I am happy to report that it works properly by only itself when applied to
a clean 2.6.35.4 tree. I haven't had any problems (aside from the "thoughtful

I would be glad to test that patch as well, to see if it brings speed-ups.
Actually, I might test hibernation with a larger value written to

I really appreciate your help. Thanks a lot!

--

From: Rafael J. Wysocki
Date: Saturday, September 11, 2010 - 12:06 pm

I think that would improve things, as it probably is impossible to reduce the
image size to 500 MB on your system.

Anyway, I'll let you know when the patch is ready.

Thanks,
Rafael
--


OK, please try the patch below on top of the previous one and see if it makes
hibernation run faster on your system.

Thanks,
Rafael

---
From: Rafael J. Wysocki <rjw@sisk.pl>
Subject: PM / Hibernate: Make default image size depend on total RAM size

The default hibernation image size is currently hard coded and euqal
to 500 MB, which is not a reasonable default on many contemporary
systems.  Make it equal 2/5 of the total RAM size (this is slightly
below the maximum, i.e. 1/2 of the total RAM size, and seems to be
generally suitable).

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
 Documentation/power/interface.txt |    2 +-
 kernel/power/main.c               |    1 +
 kernel/power/power.h              |    9 ++++++++-
 kernel/power/snapshot.c           |    7 ++++++-
 4 files changed, 16 insertions(+), 3 deletions(-)

Index: linux-2.6/kernel/power/snapshot.c
===================================================================
--- linux-2.6.orig/kernel/power/snapshot.c
+++ linux-2.6/kernel/power/snapshot.c
@@ -46,7 +46,12 @@ static void swsusp_unset_page_forbidden(
  * size will not exceed N bytes, but if that is impossible, it will
  * try to create the smallest image possible.
  */
-unsigned long image_size = 500 * 1024 * 1024;
+unsigned long image_size;
+
+void __init hibernate_image_size_init(void)
+{
+	image_size = ((totalram_pages * 2) / 5) * PAGE_SIZE;
+}
 
 /* List of PBEs needed for restoring the pages that were allocated before
  * the suspend and included in the suspend image, but have also been
Index: linux-2.6/kernel/power/power.h
===================================================================
--- linux-2.6.orig/kernel/power/power.h
+++ linux-2.6/kernel/power/power.h
@@ -14,6 +14,9 @@ struct swsusp_info {
 } __attribute__((aligned(PAGE_SIZE)));
 
 #ifdef CONFIG_HIBERNATION
+/* kernel/power/snapshot.c */
+extern void __init hibernate_image_size_init(void);
+
 #ifdef CONFIG_ARCH_HIBERNATION_HEADER
 /* Maximum size of architecture ...
From: M. Vefa Bicakci
Date: Monday, September 13, 2010 - 8:40 am

Dear Rafael Wysocki,

I think I have good news. I took a clean 2.6.35.4 tree, and first applied
the latest version of your larger snapshot.c patch, and then the patch you
appended to your final e-mail in this thread.

Here is a comparison of the timings from a kernel without your patch, and
one with it.

=== 8< ===
Sep 11 10:22:24 debian kernel: [  499.968989] PM: Allocated 2531300 kbytes in 52.66 seconds (48.06 MB/s)
Sep 11 10:44:08 debian kernel: [  764.379131] PM: Allocated 2531308 kbytes in 143.41 seconds (17.65 MB/s)
Sep 11 10:48:41 debian kernel: [  920.626386] PM: Allocated 2531300 kbytes in 66.44 seconds (38.09 MB/s)
Sep 11 10:53:37 debian kernel: [ 1092.919140] PM: Allocated 2531316 kbytes in 81.28 seconds (31.14 MB/s)
...
Sep 13 01:26:09 debian kernel: [   94.948054] PM: Allocated 1804008 kbytes in 28.72 seconds (62.81 MB/s)
Sep 13 01:29:58 debian kernel: [  176.678880] PM: Allocated 1803992 kbytes in 34.44 seconds (52.38 MB/s)
Sep 13 01:33:48 debian kernel: [  253.336405] PM: Allocated 1804000 kbytes in 27.35 seconds (65.95 MB/s)
=== >8 ===

I didn't have your latest patch applied on September 11, and it was applied
last night.

It looks like there is a good improvement. I think the data rates look
faster on Sept. 13 because the kernel spent less time "thinking" less
while compacting the memory image. (I don't think I have changed anything
in my configuration that could affect the data rates that much.)

Is it possible to have these patches applied to the 2.6.35 tree so that
the regression I reported is fixed? Should I e-mail Greg Kroah-Hartman
about this?

Once again, thank a lot to you, Kosaki Motohiro and Minchan Kim!

--

From: Rafael J. Wysocki
Date: Monday, September 13, 2010 - 10:52 am

The "snapshot.c" patch has just been included into the Linus' tree as


and I've already told Greg that it should go into 2.6.35.y.

The second patch, however, only changes the default value of image_size, so it
is not -stable material.

As a workaround, you can change the init scripts on your system to set
/sys/power/image_size to the same value that's in it when the second patch is
applied.

Thanks,
Rafael
--

From: Rafael J. Wysocki
Date: Monday, September 6, 2010 - 11:46 am

OK, I'll put it into my linux-next branch, then.

Probably, though, I should modify the changelog, because what it really does
is to check if it makes sense to try to allocat from non-highmem pages, but it
doesn't really prevent the OOM from occuring.

Thanks,
Rafael
--

From: Rafael J. Wysocki
Date: Monday, September 6, 2010 - 12:54 pm

For completness, below is the patch with the new changelog.

Thanks,
Rafael

---
From: Rafael J. Wysocki <rjw@sisk.pl>
Subject: PM / Hibernate: Avoid hitting OOM during preallocation of memory

There is a problem in hibernate_preallocate_memory() that it calls
preallocate_image_memory() with an argument that may be greater than
the total number of non-highmem memory pages that haven't been
already preallocated.  If that's the case, the OOM condition is
guaranteed to trigger, which in turn can cause significant slowdown
to occur.

To avoid that, make preallocate_image_memory() adjust its argument
before calling preallocate_image_pages(), so that it doesn't exceed
the number of non-highmem pages that weren't preallocated previously.
Change hibernate_preallocate_memory() to try to allocate from highmem
if the number of pages allocated by preallocate_image_memory() is too
low.  Modify free_unnecessary_pages() to take all possible memory
allocation patterns into account.

Reported-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
 kernel/power/snapshot.c |   66 +++++++++++++++++++++++++++++++++---------------
 1 file changed, 46 insertions(+), 20 deletions(-)

Index: linux-2.6/kernel/power/snapshot.c
===================================================================
--- linux-2.6.orig/kernel/power/snapshot.c
+++ linux-2.6/kernel/power/snapshot.c
@@ -1122,9 +1122,19 @@ static unsigned long preallocate_image_p
 	return nr_alloc;
 }
 
-static unsigned long preallocate_image_memory(unsigned long nr_pages)
+static unsigned long preallocate_image_memory(unsigned long nr_pages,
+					      unsigned long size_normal)
 {
-	return preallocate_image_pages(nr_pages, GFP_IMAGE);
+	unsigned long alloc;
+
+	if (size_normal <= alloc_normal)
+		return 0;
+
+	alloc = size_normal - alloc_normal;
+	if (nr_pages < alloc)
+		alloc = nr_pages;
+
+	return preallocate_image_pages(alloc, GFP_IMAGE);
 }
 
 #ifdef ...
Previous thread: idle patches for 2.6.36.merge by Len Brown on Saturday, August 14, 2010 - 10:22 pm. (4 messages)

Next thread: [PATCH] mm: code improvement of check_stack_guard_page function by jovi zhang on Saturday, August 14, 2010 - 10:30 pm. (1 message)