(tested on mainline but should apply to linux-next cleanly)
* Changelog: v2 vs initial patches
- directly add swap free callback to block_device_operations
instead of using 'notifiers' for various swap events.
ramzswap driver creates RAM based block devices which can be
used (only) as swap disks. Pages swapped to these disks are
compressed and stored in memory itself.
However, these devices do not get any notification when a swap
slot is freed (swap_map[i] reaches 0). So, we cannot free memory
allocated corresponding to this swap slot. Such stale data can
quickly accumulate in (compressed) memory defeating the whole
purpose of such devices.
To overcome this problem, we now add a callback in struct
block_device_operations which is called as soon as a swap
slot is freed.
Nitin Gupta (3):
Add flag to identify block swap devices
Add swap slot free callback to block_device_operations
ramzswap: Handler for swap slot free callback
drivers/staging/ramzswap/TODO | 5 -----
drivers/staging/ramzswap/ramzswap_drv.c | 22 +++++++++++++---------
include/linux/blkdev.h | 2 ++
include/linux/swap.h | 1 +
mm/swapfile.c | 5 +++++
5 files changed, 21 insertions(+), 14 deletions(-)
delete mode 100644 drivers/staging/ramzswap/TODO
--
Added SWP_BLKDEV flag to distinguish block and regular file backed
swap devices. We could also check if a swap is entire block device,
rather than a file, by:
S_ISBLK(swap_info_struct->swap_file->f_mapping->host->i_mode)
but, I think, simply checking this flag is more convenient.
Signed-off-by: Nitin Gupta <ngupta@vflare.org>
---
include/linux/swap.h | 1 +
mm/swapfile.c | 1 +
2 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1f59d93..ec2b7a4 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -146,6 +146,7 @@ enum {
SWP_DISCARDING = (1 << 3), /* now discarding a free cluster */
SWP_SOLIDSTATE = (1 << 4), /* blkdev seeks are cheap */
SWP_CONTINUED = (1 << 5), /* swap_map has count continuation */
+ SWP_BLKDEV = (1 << 6), /* its a block device */
/* add others here before... */
SWP_SCANNING = (1 << 8), /* refcount in scan_swap_map */
};
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 6cd0a8f..ecb069e 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1884,6 +1884,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
if (error < 0)
goto bad_swap;
p->bdev = bdev;
+ p->flags |= SWP_BLKDEV;
} else if (S_ISREG(inode->i_mode)) {
p->bdev = inode->i_sb->s_bdev;
mutex_lock(&inode->i_mutex);
--
1.6.6.1
--
This might make it convenient for now but is likely to increase complexity and redundancy. Why not define a macro/inline to figure that out? --
Accessing such long pointer chain is maybe not good thing to do for every swap_entry_free() call? Simple checking a flag is perhaps slightly faster? I also can't see how this flag can later increase complexity compared to creating new macro for this check. Thanks, Nitin --
I am not sure about effects on the speed, that needs to be seen.
But here are points against using a new flag....
a) Defining a new flag hogs up precious real-estate of remaining just 2 bits
(apparently the new flags are to go before SWP_SCANNING)
b) Over time, some dependent code might evolve to depend upon this flag, while
others might use the indirection to deduct if it is block device
or not. That
will make it complicated to make any relevant change in future as we'll have
to keep track of more than one way to check the status. Creating a new
macro doesn't create a new path to reach the status, but a FLAG does.
--
I think the flag is cleaner and I'm not convinced by the above speculative arguments that we should switch to a static inline function. Hugh, Linus, would you mind voicing your opinion which way to go here? --
Hi again. The more I read your patches, the more I think either I'm seriously confused (entirely possible!) or you are. Don't you want to distinguish RAM backed swap from swap that's either a partition or a file? If that's the case, you should also be setting SWP_BLKDEV in the S_ISREG part that follows iff the p->bdev is a regular file. Regards, Nigel --
This callback is required when RAM based devices are used as swap disks.
One such device is ramzswap which is used as compressed in-memory swap
disk. For such devices, we need a callback as soon as a swap slot is no
longer used to allow freeing memory allocated for this slot. Without this
callback, stale data can quickly accumulate in memory defeating the whole
purpose of such devices.
Signed-off-by: Nitin Gupta <ngupta@vflare.org>
---
include/linux/blkdev.h | 2 ++
mm/swapfile.c | 4 ++++
2 files changed, 6 insertions(+), 0 deletions(-)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 6690e8b..413284a 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1287,6 +1287,8 @@ struct block_device_operations {
unsigned long long);
int (*revalidate_disk) (struct gendisk *);
int (*getgeo)(struct block_device *, struct hd_geometry *);
+ /* this callback is with swap_lock and sometimes page table lock held */
+ void (*swap_slot_free_notify) (struct block_device *, unsigned long);
struct module *owner;
};
diff --git a/mm/swapfile.c b/mm/swapfile.c
index ecb069e..f5ccc47 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -574,6 +574,7 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
/* free if no reference */
if (!usage) {
+ struct gendisk *disk = p->bdev->bd_disk;
if (offset < p->lowest_bit)
p->lowest_bit = offset;
if (offset > p->highest_bit)
@@ -583,6 +584,9 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
swap_list.next = p->type;
nr_swap_pages++;
p->inuse_pages--;
+ if ((p->flags & SWP_BLKDEV) &&
+ disk->fops->swap_slot_free_notify)
+ disk->fops->swap_slot_free_notify(p->bdev, offset);
}
return usage;
--
1.6.6.1
--
Hi. Is this p->flags & SWP_BLKDEV logic reversed? (Don't you want the notifier called for devices that aren't backed by a block device?) I also wonder whether leaving the p->flags & SWP_BLKDEV part out might be a good idea. Other potential notifier users? Regards, Nigel --
No, the logic here is correct: ramzswap is a block device for which we want this callback. Though its a RAM backed, it is still a block device. For regular files, 'offset' used here makes little sense. For block devices, its simply offset in real device. Also, I doubt if *files* would ever like to have such a callback. Thanks, Nitin --
Hi. Okay. Entire series Acked-by: Nigel Cunningham <nigel@tuxonice.net> Thanks. Nigel --
Install handler for swap_slot_free_notify callback which is called
when a swap slot is no longer used. This handler immediately frees
memory allocated corresponding to the given swap slot.
Signed-off-by: Nitin Gupta
---
drivers/staging/ramzswap/TODO | 5 -----
drivers/staging/ramzswap/ramzswap_drv.c | 22 +++++++++++++---------
2 files changed, 13 insertions(+), 14 deletions(-)
delete mode 100644 drivers/staging/ramzswap/TODO
diff --git a/drivers/staging/ramzswap/TODO b/drivers/staging/ramzswap/TODO
deleted file mode 100644
index 8d64e28..0000000
--- a/drivers/staging/ramzswap/TODO
+++ /dev/null
@@ -1,5 +0,0 @@
-TODO:
- - Add support for swap notifiers
-
-Please send patches to Greg Kroah-Hartman <greg@kroah.com> and
-Nitin Gupta <ngupta@vflare.org>
diff --git a/drivers/staging/ramzswap/ramzswap_drv.c b/drivers/staging/ramzswap/ramzswap_drv.c
index ee5eb12..ab15276 100644
--- a/drivers/staging/ramzswap/ramzswap_drv.c
+++ b/drivers/staging/ramzswap/ramzswap_drv.c
@@ -795,14 +795,6 @@ static int ramzswap_write(struct ramzswap *rzs, struct bio *bio)
src = rzs->compress_buffer;
- /*
- * System swaps to same sector again when the stored page
- * is no longer referenced by any process. So, its now safe
- * to free the memory that was allocated for this page.
- */
- if (rzs->table[index].page || rzs_test_flag(rzs, index, RZS_ZERO))
- ramzswap_free_page(rzs, index);
-
mutex_lock(&rzs->lock);
user_mem = kmap_atomic(page, KM_USER0);
@@ -1295,9 +1287,21 @@ out:
return ret;
}
+void ramzswap_slot_free_notify(struct block_device *bdev, unsigned long index)
+{
+ struct ramzswap *rzs;
+
+ rzs = bdev->bd_disk->private_data;
+ ramzswap_free_page(rzs, index);
+ rzs_stat64_inc(rzs, &rzs->stats.notify_free);
+
+ return;
+}
+
static struct block_device_operations ramzswap_devops = {
.ioctl = ramzswap_ioctl,
- .owner = THIS_MODULE,
+ .swap_slot_free_notify = ramzswap_slot_free_notify,
+ .owner = THIS_MODULE
};
static int ...The series looks good to me: Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> --
Yeah, I think I'm finally ok with it too. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Thanks, Linus --
On Fri, 7 May 2010 12:55:04 +0530 I didn't even realise that ramzswap had snuck in via the staging backdoor. <hasty review> Looking at the changelogs I'm seeing no information about the effectiveness of ramzswap - how much memory it saves. As that's the entire point of the driver, that would be a rather important thing to have included in the commit comments. We cannot make the decision to merge ramzswap without this info. The various lengthy pr_info() messages could do with some attention from a native English speaker (sorry). setup_swap_header() should use kmap_atomic(). The driver appears to be controlled by some nasty-looking ioctl against some fd. None of it is documented anywhere. It should be. You're proposing here a permanent extension to the kernel ABI which we will need to maintain for ever. That's a big deal and it is the very first thing reviewers will look at, before even considering the code. The compat implications of the ioctl args will need to be examined. RZSIO_GET_STATS looks to be hopeless from a long-term maintainability POV. It's debug code and it would be better to move it into a debugfs file, where we can then add and remove things at will. The code would benefit from a lot more comments. For example, take add_backing_swap_extent(). What is its role in the overall operation of the driver? What are its locking preconditions? The meaning of its return value? Generally, aim to be telling the reader the things whcih he needs to know and which aren't obvious from the code itself. add_backing_swap_extent() does an alloc_page(__GFP_ZERO). That means it's GFP_ATOMIC and not __GFP_HIGH. But afacit it could have used the much stronger GFP_KERNEL. Was the lakc of GFP_KERNEL deliberate? There's no way for me to tell due to the lack of comments. If it _was_ deliberate then there should be a comment telling me why. If it wasn't deliberate then, well, bug. ramzswap_drv.h could use the __aligned() macro rather than ...
Documentation (drivers/staging/ramzswap/ramzswap.txt) points to the project home page which has lots of performance numbers -- both positive and negative cases. Lot of your points are regarding lack of documentation and code comments. Instead of replying to all such points individually, let me first send ramzswap.txt points to 'rzscontrol' manpage where the effect of each of these ioctls is documented: http://compcache.googlecode.com/hg/sub-projects/rzscontrol/man/rzscontrol.html You cannot use /dev/ramzswap{0,1,2...} devices for anything other than swap devices since they can only handle page-aligned I/O requests. This restriction highly simplifies the code as handling compression/ decompression for sub-page requests is hard and wasteful. If someone needs a generic in-memory compressed block device, they I will soon send patches for adding all these code comments. I hope that will make code more reviewable. In the meantime, would it be possible to commit these swap free notify patches? Thanks for your detailed comments. Nitin --
Hi Andrew, There's some benchmarks at ramzswap pages: http://code.google.com/p/compcache/wiki/Performance http://code.google.com/p/compcache/wiki/SwapDiskVsRamz [ snip bunch of comments from Andrew that need to be addressed, We need it because the slab allocator is not a good fit for this special purpose driver due to fragmentation. Nitin, you had a nice web page showing all the relevant numbers but I can't find it anymore. Andrew, FWIW, I'm ok with xvmalloc() for this particular driver. There was some discussion on making it more generic but I don't see it as a I agree that the whole graduation step from staging to kernel proper is not well-defined. Any suggestions? That said, I hope that doesn't stop us from merging this patch series because the lack of notifiers cripples the current ramzswap performance. Pekka --
Oops, missed this part: xvmalloc performance numbers: http://code.google.com/p/compcache/wiki/xvMalloc http://code.google.com/p/compcache/wiki/xvMallocPerformance Thanks, Nitin --
I don't see the xvmalloc vs. kmalloc fragmentation numbers there. I thought you had some? --
TLSF vs kmalloc (SLUB): http://code.google.com/p/compcache/wiki/AllocatorsComparison By design, xvmalloc is very similar to TLSF. In fact, xvmalloc has less metadata overhead than TLSF. So, we will surely get similar results for xvmalloc vs kmalloc comparison too. Thanks, Nitin --
Another option would be sysfs if you want the stats to always be available. --
