Re: [PATCH 1/3] Add flag to identify block swap devices

Previous thread: Re: [PATCH] permit to online CPUs before local memory comes online by minskey guo on Friday, May 7, 2010 - 12:19 am. (1 message)

Next thread: [git pull] more PCMCIA bugfixes for 2.6.34-rc6 by Dominik Brodowski on Friday, May 7, 2010 - 12:44 am. (1 message)
From: Nitin Gupta
Date: Friday, May 7, 2010 - 12:25 am

(tested on mainline but should apply to linux-next cleanly)

 * Changelog: v2 vs initial patches
   - directly add swap free callback to block_device_operations
     instead of using 'notifiers' for various swap events.

ramzswap driver creates RAM based block devices which can be
used (only) as swap disks. Pages swapped to these disks are
compressed and stored in memory itself.

However, these devices do not get any notification when a swap
slot is freed (swap_map[i] reaches 0). So, we cannot free memory
allocated corresponding to this swap slot. Such stale data can
quickly accumulate in (compressed) memory defeating the whole
purpose of such devices.

To overcome this problem, we now add a callback in struct
block_device_operations which is called as soon as a swap
slot is freed.

Nitin Gupta (3):
  Add flag to identify block swap devices
  Add swap slot free callback to block_device_operations
  ramzswap: Handler for swap slot free callback

 drivers/staging/ramzswap/TODO           |    5 -----
 drivers/staging/ramzswap/ramzswap_drv.c |   22 +++++++++++++---------
 include/linux/blkdev.h                  |    2 ++
 include/linux/swap.h                    |    1 +
 mm/swapfile.c                           |    5 +++++
 5 files changed, 21 insertions(+), 14 deletions(-)
 delete mode 100644 drivers/staging/ramzswap/TODO

--

From: Nitin Gupta
Date: Friday, May 7, 2010 - 12:25 am

Added SWP_BLKDEV flag to distinguish block and regular file backed
swap devices. We could also check if a swap is entire block device,
rather than a file, by:
S_ISBLK(swap_info_struct->swap_file->f_mapping->host->i_mode)
but, I think, simply checking this flag is more convenient.

Signed-off-by: Nitin Gupta <ngupta@vflare.org>
---
 include/linux/swap.h |    1 +
 mm/swapfile.c        |    1 +
 2 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1f59d93..ec2b7a4 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -146,6 +146,7 @@ enum {
 	SWP_DISCARDING	= (1 << 3),	/* now discarding a free cluster */
 	SWP_SOLIDSTATE	= (1 << 4),	/* blkdev seeks are cheap */
 	SWP_CONTINUED	= (1 << 5),	/* swap_map has count continuation */
+	SWP_BLKDEV	= (1 << 6),	/* its a block device */
 					/* add others here before... */
 	SWP_SCANNING	= (1 << 8),	/* refcount in scan_swap_map */
 };
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 6cd0a8f..ecb069e 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1884,6 +1884,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 		if (error < 0)
 			goto bad_swap;
 		p->bdev = bdev;
+		p->flags |= SWP_BLKDEV;
 	} else if (S_ISREG(inode->i_mode)) {
 		p->bdev = inode->i_sb->s_bdev;
 		mutex_lock(&inode->i_mutex);
-- 
1.6.6.1

--

From: jassi brar
Date: Friday, May 7, 2010 - 1:03 am

This might make it convenient for now but is likely to increase complexity and
redundancy. Why not define a macro/inline to figure that out?
--

From: Nitin Gupta
Date: Friday, May 7, 2010 - 1:16 am

Accessing such long pointer chain is maybe not good thing to do for
every swap_entry_free() call? Simple checking a flag is perhaps slightly
faster? I also can't see how this flag can later increase complexity
compared to creating new macro for this check.

Thanks,
Nitin
--

From: jassi brar
Date: Friday, May 7, 2010 - 1:56 am

I am not sure about effects on the speed, that needs to be seen.
 But here are points against using a new flag....

a) Defining a new flag hogs up precious real-estate of remaining just 2 bits
   (apparently the new flags are to go before SWP_SCANNING)

b) Over time, some dependent code might evolve to depend upon this flag, while
     others might use the indirection to deduct if it is block device
or not. That
    will make it complicated to make any relevant change in future as we'll have
    to keep track of more than one way to check the status. Creating a new
   macro doesn't create a new path to reach the status, but a FLAG does.
--

From: Pekka Enberg
Date: Friday, May 7, 2010 - 2:24 am

I think the flag is cleaner and I'm not convinced by the above
speculative arguments that we should switch to a static inline
function.

Hugh, Linus, would you mind voicing your opinion which way to go here?
--

From: Nigel Cunningham
Date: Friday, May 7, 2010 - 2:32 am

Hi again.


The more I read your patches, the more I think either I'm seriously 
confused (entirely possible!) or you are.

Don't you want to distinguish RAM backed swap from swap that's either a 
partition or a file? If that's the case, you should also be setting 
SWP_BLKDEV in the S_ISREG part that follows iff the p->bdev is a regular 
file.

Regards,

Nigel
--

From: Nitin Gupta
Date: Friday, May 7, 2010 - 12:25 am

This callback is required when RAM based devices are used as swap disks.
One such device is ramzswap which is used as compressed in-memory swap
disk.  For such devices, we need a callback as soon as a swap slot is no
longer used to allow freeing memory allocated for this slot.  Without this
callback, stale data can quickly accumulate in memory defeating the whole
purpose of such devices.

Signed-off-by: Nitin Gupta <ngupta@vflare.org>
---
 include/linux/blkdev.h |    2 ++
 mm/swapfile.c          |    4 ++++
 2 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 6690e8b..413284a 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1287,6 +1287,8 @@ struct block_device_operations {
 						unsigned long long);
 	int (*revalidate_disk) (struct gendisk *);
 	int (*getgeo)(struct block_device *, struct hd_geometry *);
+	/* this callback is with swap_lock and sometimes page table lock held */
+	void (*swap_slot_free_notify) (struct block_device *, unsigned long);
 	struct module *owner;
 };
 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index ecb069e..f5ccc47 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -574,6 +574,7 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
 
 	/* free if no reference */
 	if (!usage) {
+		struct gendisk *disk = p->bdev->bd_disk;
 		if (offset < p->lowest_bit)
 			p->lowest_bit = offset;
 		if (offset > p->highest_bit)
@@ -583,6 +584,9 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
 			swap_list.next = p->type;
 		nr_swap_pages++;
 		p->inuse_pages--;
+		if ((p->flags & SWP_BLKDEV) &&
+				disk->fops->swap_slot_free_notify)
+			disk->fops->swap_slot_free_notify(p->bdev, offset);
 	}
 
 	return usage;
-- 
1.6.6.1

--

From: Nigel Cunningham
Date: Friday, May 7, 2010 - 2:22 am

Hi.


Is this p->flags & SWP_BLKDEV logic reversed? (Don't you want the 
notifier called for devices that aren't backed by a block device?)

I also wonder whether leaving the p->flags & SWP_BLKDEV part out might 
be a good idea. Other potential notifier users?

Regards,

Nigel
--

From: Nitin Gupta
Date: Friday, May 7, 2010 - 2:48 am

No, the logic here is correct: ramzswap is a block device for which
we want this callback. Though its a RAM backed, it is still a block
device.


For regular files, 'offset' used here makes little sense. For block devices,
its simply offset in real device. Also, I doubt if *files* would ever
like to have such a callback.

Thanks,
Nitin
--

From: Nigel Cunningham
Date: Friday, May 7, 2010 - 3:40 am

Hi.



Okay.

Entire series Acked-by: Nigel Cunningham <nigel@tuxonice.net>

Thanks.

Nigel
--

From: Nitin Gupta
Date: Friday, May 7, 2010 - 12:25 am

Install handler for swap_slot_free_notify callback which is called
when a swap slot is no longer used. This handler immediately frees
memory allocated corresponding to the given swap slot.

Signed-off-by: Nitin Gupta
---
 drivers/staging/ramzswap/TODO           |    5 -----
 drivers/staging/ramzswap/ramzswap_drv.c |   22 +++++++++++++---------
 2 files changed, 13 insertions(+), 14 deletions(-)
 delete mode 100644 drivers/staging/ramzswap/TODO

diff --git a/drivers/staging/ramzswap/TODO b/drivers/staging/ramzswap/TODO
deleted file mode 100644
index 8d64e28..0000000
--- a/drivers/staging/ramzswap/TODO
+++ /dev/null
@@ -1,5 +0,0 @@
-TODO:
-	- Add support for swap notifiers
-
-Please send patches to Greg Kroah-Hartman <greg@kroah.com> and
-Nitin Gupta <ngupta@vflare.org>
diff --git a/drivers/staging/ramzswap/ramzswap_drv.c b/drivers/staging/ramzswap/ramzswap_drv.c
index ee5eb12..ab15276 100644
--- a/drivers/staging/ramzswap/ramzswap_drv.c
+++ b/drivers/staging/ramzswap/ramzswap_drv.c
@@ -795,14 +795,6 @@ static int ramzswap_write(struct ramzswap *rzs, struct bio *bio)
 
 	src = rzs->compress_buffer;
 
-	/*
-	 * System swaps to same sector again when the stored page
-	 * is no longer referenced by any process. So, its now safe
-	 * to free the memory that was allocated for this page.
-	 */
-	if (rzs->table[index].page || rzs_test_flag(rzs, index, RZS_ZERO))
-		ramzswap_free_page(rzs, index);
-
 	mutex_lock(&rzs->lock);
 
 	user_mem = kmap_atomic(page, KM_USER0);
@@ -1295,9 +1287,21 @@ out:
 	return ret;
 }
 
+void ramzswap_slot_free_notify(struct block_device *bdev, unsigned long index)
+{
+	struct ramzswap *rzs;
+
+	rzs = bdev->bd_disk->private_data;
+	ramzswap_free_page(rzs, index);
+	rzs_stat64_inc(rzs, &rzs->stats.notify_free);
+
+	return;
+}
+
 static struct block_device_operations ramzswap_devops = {
 	.ioctl = ramzswap_ioctl,
-	.owner = THIS_MODULE,
+	.swap_slot_free_notify = ramzswap_slot_free_notify,
+	.owner = THIS_MODULE
 };
 
 static int ...
From: Pekka Enberg
Date: Friday, May 7, 2010 - 12:44 am

The series looks good to me:

Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
--

From: Linus Torvalds
Date: Friday, May 7, 2010 - 7:51 am

Yeah, I think I'm finally ok with it too.

	Acked-by: Linus Torvalds <torvalds@linux-foundation.org>

Thanks,

		Linus
--

From: Andrew Morton
Date: Friday, May 7, 2010 - 12:55 pm

On Fri,  7 May 2010 12:55:04 +0530

I didn't even realise that ramzswap had snuck in via the staging
backdoor.


<hasty review>

Looking at the changelogs I'm seeing no information about the
effectiveness of ramzswap - how much memory it saves.  As that's the
entire point of the driver, that would be a rather important thing to
have included in the commit comments.  We cannot make the decision to
merge ramzswap without this info.

The various lengthy pr_info() messages could do with some attention
from a native English speaker (sorry).

setup_swap_header() should use kmap_atomic().


The driver appears to be controlled by some nasty-looking ioctl against
some fd.  None of it is documented anywhere.  It should be.  You're
proposing here a permanent extension to the kernel ABI which we will
need to maintain for ever.  That's a big deal and it is the very first
thing reviewers will look at, before even considering the code.

The compat implications of the ioctl args will need to be examined. 

RZSIO_GET_STATS looks to be hopeless from a long-term maintainability
POV.  It's debug code and it would be better to move it into a debugfs
file, where we can then add and remove things at will.

The code would benefit from a lot more comments.  For example, take
add_backing_swap_extent().  What is its role in the overall operation
of the driver?  What are its locking preconditions?  The meaning of its
return value?  Generally, aim to be telling the reader the things whcih
he needs to know and which aren't obvious from the code itself.

add_backing_swap_extent() does an alloc_page(__GFP_ZERO).  That means
it's GFP_ATOMIC and not __GFP_HIGH.  But afacit it could have used the
much stronger GFP_KERNEL.  Was the lakc of GFP_KERNEL deliberate? 
There's no way for me to tell due to the lack of comments.  If it _was_
deliberate then there should be a comment telling me why.  If it wasn't
deliberate then, well, bug.

ramzswap_drv.h could use the __aligned() macro rather than ...
From: Nitin Gupta
Date: Friday, May 7, 2010 - 9:05 pm

Documentation (drivers/staging/ramzswap/ramzswap.txt) points to the project
home page which has lots of performance numbers -- both positive and
negative cases.

Lot of your points are regarding lack of documentation and code comments.
Instead of replying to all such points individually, let me first send

ramzswap.txt points to 'rzscontrol' manpage where the effect of each of
these ioctls is documented:
http://compcache.googlecode.com/hg/sub-projects/rzscontrol/man/rzscontrol.html



You cannot use /dev/ramzswap{0,1,2...} devices for anything other than
swap devices since they can only handle page-aligned I/O requests.
This restriction highly simplifies the code as handling compression/
decompression for sub-page requests is hard and wasteful.

If someone needs a generic in-memory compressed block device, they


I will soon send patches for adding all these code comments. I hope that
will make code more reviewable.


In the meantime, would it be possible to commit these swap free notify
patches?


Thanks for your detailed comments.
Nitin

--

From: Pekka Enberg
Date: Friday, May 7, 2010 - 11:29 pm

Hi Andrew,


There's some benchmarks at ramzswap pages:

http://code.google.com/p/compcache/wiki/Performance

http://code.google.com/p/compcache/wiki/SwapDiskVsRamz

[ snip bunch of comments from Andrew that need to be addressed,



We need it because the slab allocator is not a good fit for this special 
purpose driver due to fragmentation. Nitin, you had a nice web page 
showing all the relevant numbers but I can't find it anymore.

Andrew, FWIW, I'm ok with xvmalloc() for this particular driver. There 
was some discussion on making it more generic but I don't see it as a 

I agree that the whole graduation step from staging to kernel proper is 
not well-defined. Any suggestions? That said, I hope that doesn't stop 
us from merging this patch series because the lack of notifiers cripples 
the current ramzswap performance.

			Pekka
--

From: Pekka Enberg
Date: Saturday, May 8, 2010 - 12:26 am

I don't see the xvmalloc vs. kmalloc fragmentation numbers there. I 
thought you had some?
--

From: Nitin Gupta
Date: Saturday, May 8, 2010 - 12:32 am

TLSF vs kmalloc (SLUB):
http://code.google.com/p/compcache/wiki/AllocatorsComparison

By design, xvmalloc is very similar to TLSF. In fact, xvmalloc has less
metadata overhead than TLSF. So, we will surely get similar results for
xvmalloc vs kmalloc comparison too.

Thanks,
Nitin

--

From: Pekka Enberg
Date: Saturday, May 8, 2010 - 12:05 am

Another option would be sysfs if you want the stats to always be available.
--

Previous thread: Re: [PATCH] permit to online CPUs before local memory comes online by minskey guo on Friday, May 7, 2010 - 12:19 am. (1 message)

Next thread: [git pull] more PCMCIA bugfixes for 2.6.34-rc6 by Dominik Brodowski on Friday, May 7, 2010 - 12:44 am. (1 message)