Hello, the nasty memory-corrupting hibernation bug https://bugzilla.kernel.org/show_bug.cgi?id=15753 is back since 2.6.35.5. 2.6.35.4 works fine, 2.6.35.5 crashes after two days. It seems to be caused by b77c254d8d66e5e9aa81239fedba9f3d568097d9. -- Ondrej Zary --
Can you, _please_, always put all of the basic commit info (subject, author, who signed it off) in bug reports? In this particular case I don't have a -stable tree handy at the moment and can't even see what the commit is. Thanks, Rafael --
On Wed, 17 Nov 2010 21:53:52 +0100
You can simply do a google search for the commit ID. It is
commit b77c254d8d66e5e9aa81239fedba9f3d568097d9
Author: Hugh Dickins <hughd@google.com>
Date: Thu Sep 9 16:38:09 2010 -0700
swap: prevent reuse during hibernation
--
So I guess we should Cc Hugh ... --
Oops, my bad. I was in hurry and somehow missed that he's not on the Cc list. -- Ondrej Zary --
Embarrassing: I suspect that I've been confused, not for the first
time, by the fork-like nature of hibernation and its images.
I wonder if this patch below fixes it, Ondrej?
(And is it kernel swsusp or user swsusp that you're using? May not
matter at all, but will help us to think more clearly about it,
if the corruption remains after this patch.)
Rafael, do you agree that this patch was actually required even for
your original commit 452aa6999e6703ffbddd7f6ea124d3968915f3e3
mm/pm: force GFP_NOIO during suspend/hibernation and resume?
Or am I still just as confused? Or if not, are there more forking
places which require a similar patch?
Not signing it off yet,
Hugh
---
kernel/power/hibernate.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
--- 2.6.37-rc2/kernel/power/hibernate.c 2010-11-01 13:01:31.000000000 -0700
+++ linux/kernel/power/hibernate.c 2010-11-17 15:23:36.000000000 -0800
@@ -348,16 +348,19 @@ int hibernation_snapshot(int platform_mo
goto Recover_platform;
error = create_image(platform_mode);
- /* Control returns here after successful restore */
+ /* Control returns here after successful restore,
+ * and also after creating the image in memory (or failing to do so).
+ */
Resume_devices:
/* We may need to release the preallocated image pages here. */
- if (error || !in_suspend)
+ if (error || !in_suspend) {
swsusp_free();
+ set_gfp_allowed_mask(saved_mask);
+ }
dpm_resume_end(in_suspend ?
(error ? PMSG_RECOVER : PMSG_THAW) : PMSG_RESTORE);
- set_gfp_allowed_mask(saved_mask);
resume_console();
Close:
platform_end(platform_mode);
--
It seems to work with this patch. 2nd day and no crashes yet. 2.6.35.4 is OK 2.6.35.4 + b77c254d8d66e5e9aa81239fedba9f3d568097d9 is bad -- Ondrej Zary --
Can you check if the problem is also fixed by the patch below, please?
Rafael
---
kernel/power/hibernate.c | 36 ++++++++++++++++++++++++++----------
kernel/power/power.h | 1 +
kernel/power/user.c | 1 +
3 files changed, 28 insertions(+), 10 deletions(-)
Index: linux-2.6/kernel/power/hibernate.c
===================================================================
--- linux-2.6.orig/kernel/power/hibernate.c
+++ linux-2.6/kernel/power/hibernate.c
@@ -29,6 +29,21 @@
#include "power.h"
+static gfp_t saved_gfp_mask;
+
+static void hibernate_restrict_gfp_mask(void)
+{
+ saved_gfp_mask = clear_gfp_allowed_mask(GFP_IOFS);
+}
+
+void hibernate_restore_gfp_mask(void)
+{
+ if (saved_gfp_mask) {
+ set_gfp_allowed_mask(saved_gfp_mask);
+ saved_gfp_mask = 0;
+ }
+}
+
static int nocompress = 0;
static int noresume = 0;
static char resume_file[256] = CONFIG_PM_STD_PARTITION;
@@ -327,7 +342,6 @@ static int create_image(int platform_mod
int hibernation_snapshot(int platform_mode)
{
int error;
- gfp_t saved_mask;
error = platform_begin(platform_mode);
if (error)
@@ -339,7 +353,7 @@ int hibernation_snapshot(int platform_mo
goto Close;
suspend_console();
- saved_mask = clear_gfp_allowed_mask(GFP_IOFS);
+ hibernate_restrict_gfp_mask();
error = dpm_suspend_start(PMSG_FREEZE);
if (error)
goto Recover_platform;
@@ -348,7 +362,10 @@ int hibernation_snapshot(int platform_mo
goto Recover_platform;
error = create_image(platform_mode);
- /* Control returns here after successful restore */
+ /*
+ * Control returns here (1) after the image has been created or the
+ * image creation has failed and (2) after a successful restore.
+ */
Resume_devices:
/* We may need to release the preallocated image pages here. */
@@ -357,7 +374,10 @@ int hibernation_snapshot(int platform_mo
dpm_resume_end(in_suspend ?
(error ? PMSG_RECOVER : PMSG_THAW) : PMSG_RESTORE);
- set_gfp_allowed_mask(saved_mask);
+
+ if ...The patch does not apply to 2.6.35.4, 2.6.35.5 and also 2.6.36. What version -- Ondrej Zary --
The patch was against the current mainline.
The one below was rebased on top of 2.6.36, so please test it with this kernel.
Thanks,
Rafael
---
kernel/power/hibernate.c | 36 ++++++++++++++++++++++++++----------
kernel/power/power.h | 1 +
kernel/power/user.c | 2 ++
3 files changed, 29 insertions(+), 10 deletions(-)
Index: suspend-2.6/kernel/power/hibernate.c
===================================================================
--- suspend-2.6.orig/kernel/power/hibernate.c
+++ suspend-2.6/kernel/power/hibernate.c
@@ -29,6 +29,21 @@
#include "power.h"
+static gfp_t saved_gfp_mask;
+
+static void hibernate_restrict_gfp_mask(void)
+{
+ saved_gfp_mask = clear_gfp_allowed_mask(GFP_IOFS);
+}
+
+void hibernate_restore_gfp_mask(void)
+{
+ if (saved_gfp_mask) {
+ set_gfp_allowed_mask(saved_gfp_mask);
+ saved_gfp_mask = 0;
+ }
+}
+
static int noresume = 0;
static char resume_file[256] = CONFIG_PM_STD_PARTITION;
dev_t swsusp_resume_device;
@@ -326,7 +341,6 @@ static int create_image(int platform_mod
int hibernation_snapshot(int platform_mode)
{
int error;
- gfp_t saved_mask;
error = platform_begin(platform_mode);
if (error)
@@ -338,7 +352,7 @@ int hibernation_snapshot(int platform_mo
goto Close;
suspend_console();
- saved_mask = clear_gfp_allowed_mask(GFP_IOFS);
+ hibernate_restrict_gfp_mask();
error = dpm_suspend_start(PMSG_FREEZE);
if (error)
goto Recover_platform;
@@ -347,7 +361,10 @@ int hibernation_snapshot(int platform_mo
goto Recover_platform;
error = create_image(platform_mode);
- /* Control returns here after successful restore */
+ /*
+ * Control returns here (1) after the image has been created or the
+ * image creation has failed and (2) after a successful restore.
+ */
Resume_devices:
/* We may need to release the preallocated image pages here. */
@@ -356,7 +373,10 @@ int hibernation_snapshot(int platform_mo
dpm_resume_end(in_suspend ?
(error ? PMSG_RECOVER : ...Sorry for going quiet for so long. Thanks for testing, Ondrej. I completely agree with Rafael that my patch was inadequate: it turned out to be good enough for your useful feedback on the main path through hibernation, but missed a number of details. We do need something like Rafael's patch - thanks. (And I realize that it's tiresome and frustrating for you to be trying first that and then this etc; but we had bugs here precisely because this is a confusing area, so I think it is worth some effort Trivial point, I suppose, but it bothers me that PM is accumulating wrappers around wrappers around gfp_allowed_mask. Looks like clear_gfp_allowed_mask and set_gfp_allowed_mask (oddly asymmetrical) were not really what we need. How about scrapping them, and putting I'm worried that it's hard to find and maintain the places that need this restoration - and if we miss one, we won't find out about it for ages. It would help a lot if the gfp restoration always accompanies some other essential stage - thaw_processes() looks to be right, so could we skip conditionally restoring here if we do it then? I suggest that in part because I cannot find where you restore in the case that hibernate()'s swsusp_write() fails - that was the But this could be left until software_resume()'s thaw_processes()? Right, here you have one next to thaw_processes(). But the SNAPSHOT ioctls are a nightmarish maze to me, That might be good safety, but it does strongly suggest that you Hugh --
Good point. I'm not totally sure if coupling this with thaw_processes() is the right thing yet, but I guess it is, except for the s2disk-specific things Well, sorry about that. A few of them should just be dropped, but they are lingering so that oldish user space works with the new kernels. This is because of how s2disk works. If it fails to allocate enough swap, it tries again with minimum image size without thawing the other tasks. Now, because hibernation_snapshot() saves the "original" GFP mask, it has Thanks, Rafael --
Below is an updated patch in which I tried to address your comments.
I didn't find it very useful to couple pm_restore_gfp_mask() with the thawing
of tasks, but nevertheless I think all of the spots where it's needed are
covered now.
The patch has only been compile-tested for now, so caveat emptor.
Thanks,
Rafael
---
include/linux/gfp.h | 4 ++--
kernel/power/hibernate.c | 22 ++++++++++++----------
kernel/power/suspend.c | 5 ++---
kernel/power/user.c | 2 ++
mm/page_alloc.c | 18 +++++++++++-------
5 files changed, 29 insertions(+), 22 deletions(-)
Index: linux-2.6/kernel/power/hibernate.c
===================================================================
--- linux-2.6.orig/kernel/power/hibernate.c
+++ linux-2.6/kernel/power/hibernate.c
@@ -327,7 +327,6 @@ static int create_image(int platform_mod
int hibernation_snapshot(int platform_mode)
{
int error;
- gfp_t saved_mask;
error = platform_begin(platform_mode);
if (error)
@@ -339,7 +338,7 @@ int hibernation_snapshot(int platform_mo
goto Close;
suspend_console();
- saved_mask = clear_gfp_allowed_mask(GFP_IOFS);
+ pm_restrict_gfp_mask();
error = dpm_suspend_start(PMSG_FREEZE);
if (error)
goto Recover_platform;
@@ -348,7 +347,10 @@ int hibernation_snapshot(int platform_mo
goto Recover_platform;
error = create_image(platform_mode);
- /* Control returns here after successful restore */
+ /*
+ * Control returns here (1) after the image has been created or the
+ * image creation has failed and (2) after a successful restore.
+ */
Resume_devices:
/* We may need to release the preallocated image pages here. */
@@ -357,7 +359,10 @@ int hibernation_snapshot(int platform_mo
dpm_resume_end(in_suspend ?
(error ? PMSG_RECOVER : PMSG_THAW) : PMSG_RESTORE);
- set_gfp_allowed_mask(saved_mask);
+
+ if (error || !in_suspend)
+ pm_restore_gfp_mask();
+
resume_console();
Close:
platform_end(platform_mode);
@@ -452,17 ...On Wed, 1 Dec 2010 01:38:53 +0100 if (atomic_dec_return(&gfp_mask_save_mode_counter)) if (atomic_inc_return(&gfp_mask_save_mode_counter) > 1) WARN_ONCE() or some ? Thanks, -Kame --
What exactly would that be useful for? Rafael --
Please note that pm_restore_gfp_mask() can be legitimately called before pm_restrict_gfp_mask() via the SNAPSHOT_CREATE_IMAGE hibernate ioctl, so the test you're suggesting wouldn't really work. Thanks, Rafael --
On Wed, 1 Dec 2010 23:23:31 +0100 Hm, I just wonder some tests not for breaking gfp_allowed_mask by - double call - forget to restore That will be fatal. I'm not very interested in implementation detail Thanks, -Kame --
That's easy. It's sufficient to do WARN_ON(saved_gfp_mask) at the beginning of pm_restrict_gfp_mask(). This also takes care of the "forget to restore" case to some extent, because pm_restrict_gfp_mask() will generally be only This is kind of difficult in general. You can't detect a missing pm_restore_gfp_mask() other than by checking if pm_restrict_gfp_mask() is not Well ... Thanks, Rafael --
So far as I can see, yes, they are: I'm still a bit worried about future changes leaving the wrong gfp mask behind by mistake; but I trust you more than I trust me on that, so you can add my Acked-by: Hugh Dickins <hughd@google.com> when it comes to sending in the patch. Thanks, --
Thanks a lot! Andrew, are the changes in page_alloc.c fine from your standpoint? --
(Sorry for the late response).
Well, not exactly.
First, IMO set_gfp_allowed_mask(saved_mask) should not be called before
dpm_resume_end(), because IO allocations may try to use suspended devices in
theory (although that's not likely due to the swsusp_free() before). So the
right thing to do appears to be:
if (error || !in_suspend) {
swsusp_free();
dpm_resume_end(in_suspend ?
(error ? PMSG_RECOVER : PMSG_THAW) : PMSG_RESTORE);
if (error || !in_suspend)
set_gfp_allowed_mask(saved_mask);
resume_console();
Second, since we don't call set_gfp_allowed_mask(saved_mask) in the
(in_suspend && !error) case, hibernation_platform_enter() also should be
updated (I think we don't need to use set_gfp_allowed_mask() in it at all).
There's one more subtlety. Namely, the saving of an image by s2disk
may be aborted and in that case we need to restore the original
gfp_allowed_mask too, but I need to look deeper into the code to see how to do
it cleanly.
Thanks,
Rafael
--
