The zram module creates RAM based block devices named /dev/zram<id> (<id> = 0, 1, ...). Pages written to these disks are compressed and stored in memory itself. One of the major changes done is the replacement of ioctls with sysfs interface. One of the advantages of this approach is we no longer depend on the userspace tool (rzscontrol) which was used to set various parameters and check statistics. Maintaining updated version of rzscontrol as changes were done to ioctls, statistics exported etc. was a major pain. Another significant change is the introduction of percpu stats and compression buffers. Earlier, we had per-device buffers protected by a mutex. This was a major bottleneck on multi-core systems. With these changes, benchmarks with fio[1] showed a speedup of about 20% for write performance on dual-core system (see patch 4/10 description for details). For easier testing, a single patch against 2.6.35-git8 has been uploaded at: http://compcache.googlecode.com/hg/sub-projects/mainline/zram_2.6.36-rc0.patch [1] fio I/O benchmarking tool: http://freshmeat.net/projects/fio/ Nitin Gupta (10): Replace ioctls with sysfs interface Remove need for explicit device initialization Use percpu stats Use percpu buffers Reduce per table entry overhead by 4 bytes Block discard support Increase compressed page size threshold Some cleanups Update zram documentation Document sysfs entries Documentation/ABI/testing/sysfs-block-zram | 99 ++++ drivers/staging/zram/Kconfig | 12 - drivers/staging/zram/Makefile | 2 +- drivers/staging/zram/xvmalloc.c | 22 +- drivers/staging/zram/zram.txt | 58 ++- drivers/staging/zram/zram_drv.c | 670 ++++++++++++++-------------- drivers/staging/zram/zram_drv.h | 138 ++---- drivers/staging/zram/zram_ioctl.h | 41 -- drivers/staging/zram/zram_sysfs.c | 253 +++++++++++ 9 files changed, 760 insertions(+), 535 ...
Creates per-device sysfs nodes in /sys/block/zram<id>/
Currently following stats are exported:
- disksize
- num_reads
- num_writes
- invalid_io
- zero_pages
- orig_data_size
- compr_data_size
- mem_used_total
By default, disksize is set to 0. So, to start using
a zram device, fist write a disksize value and then
initialize device by writing any positive value to
initstate. For example:
# initialize /dev/zram0 with 50MB disksize
echo 50*1024*1024 | bc > /sys/block/zram0/disksize
echo 1 > /sys/block/zram0/initstate
When done using a disk, issue reset to free its memory
by writing any positive value to reset node:
echo 1 > /sys/block/zram0/reset
This change also obviates the need for 'rzscontrol' utility.
Signed-off-by: Nitin Gupta <ngupta@vflare.org>
---
drivers/staging/zram/Kconfig | 12 --
drivers/staging/zram/Makefile | 2 +-
drivers/staging/zram/zram_drv.c | 186 ++++++++--------------------
drivers/staging/zram/zram_drv.h | 55 ++-------
drivers/staging/zram/zram_ioctl.h | 41 ------
drivers/staging/zram/zram_sysfs.c | 242 +++++++++++++++++++++++++++++++++++++
6 files changed, 306 insertions(+), 232 deletions(-)
delete mode 100644 drivers/staging/zram/zram_ioctl.h
create mode 100644 drivers/staging/zram/zram_sysfs.c
diff --git a/drivers/staging/zram/Kconfig b/drivers/staging/zram/Kconfig
index 4654ae2..da079f8 100644
--- a/drivers/staging/zram/Kconfig
+++ b/drivers/staging/zram/Kconfig
@@ -15,15 +15,3 @@ config ZRAM
See zram.txt for more information.
Project home: http://compcache.googlecode.com/
-
-config ZRAM_STATS
- bool "Enable statistics for compressed RAM disks"
- depends on ZRAM
- default y
- help
- Enable statistics collection for compressed RAM devices. Statistics
- are exported through ioctl interface, so you have to use zramconfig
- program to get them. This adds only a minimal overhead.
-
- If unsure, say Y.
-
diff --git ...Looks good to me (but I'm not a sysfs guy). These could probably use atomic_inc(), atomic64_inc(), and friends, no? --
Yes, I think we could use them. Anyways, they are replaced by percpu stats in patch 3, so probably this can be left as-is. Thanks, Nitin --
Maybe I'm just a weirdo, but I don't really use modules much. That effectively means that I'm stuck at boot with one zram device. Making it a read-only module param also means that someone can't add a second at runtime while the first is still in use. It doesn't seem to be used very pervasively, but there is a module_param_cb() function so you can register callbacks when the param gets updated. Might come in handy for this. -- Dave --
Currently, the user has to explicitly write a positive value to
initstate sysfs node before the device can be used. This event
triggers allocation of per-device metadata like memory pool,
table array and so on.
We do not pre-initialize all zram devices since the 'table' array,
mapping disk blocks to compressed chunks, takes considerable amount
of memory (8 bytes per page). So, pre-initializing all devices will
be quite wasteful if only few or none of the devices are actually
used.
This explicit device initialization from user is an odd requirement and
can be easily avoided. We now initialize the device when first write is
done to the device.
Signed-off-by: Nitin Gupta <ngupta@vflare.org>
---
drivers/staging/zram/zram_drv.c | 35 +++++++++++++++++++++++------------
drivers/staging/zram/zram_drv.h | 2 ++
drivers/staging/zram/zram_sysfs.c | 26 ++++----------------------
3 files changed, 29 insertions(+), 34 deletions(-)
diff --git a/drivers/staging/zram/zram_drv.c b/drivers/staging/zram/zram_drv.c
index 3f698a5..c5f84ee 100644
--- a/drivers/staging/zram/zram_drv.c
+++ b/drivers/staging/zram/zram_drv.c
@@ -207,9 +207,15 @@ static int zram_read(struct zram *zram, struct bio *bio)
u32 index;
struct bio_vec *bvec;
- zram_stat64_inc(zram, &zram->stats.num_reads);
+ if (unlikely(!zram->init_done)) {
+ set_bit(BIO_UPTODATE, &bio->bi_flags);
+ bio_endio(bio, 0);
+ return 0;
+ }
+ zram_stat64_inc(zram, &zram->stats.num_reads);
index = bio->bi_sector >> SECTORS_PER_PAGE_SHIFT;
+
bio_for_each_segment(bvec, bio, i) {
int ret;
size_t clen;
@@ -275,16 +281,20 @@ out:
static int zram_write(struct zram *zram, struct bio *bio)
{
- int i;
+ int i, ret;
u32 index;
struct bio_vec *bvec;
- zram_stat64_inc(zram, &zram->stats.num_writes);
+ if (unlikely(!zram->init_done)) {
+ ret = zram_init_device(zram);
+ if (ret)
+ goto out;
+ }
+ zram_stat64_inc(zram, &zram->stats.num_writes);
index = bio->bi_sector >> ...AFAICT, most hardware block device drivers do things like this in the probe function. Why can't we do that for zram as well and drop the ->init_done and ->init_lock parts? --
I think probe is only for PCI devices? Maybe should hook into open function in struct block_device_operations. That way, we can also drop init_done and init_lock parts. Thanks, Nitin --
Also remove references to removed stats (ex: good_comress).
Signed-off-by: Nitin Gupta <ngupta@vflare.org>
---
drivers/staging/zram/zram_drv.c | 77 ++++++++++++++++--------------------
drivers/staging/zram/zram_drv.h | 33 +++++++++-------
drivers/staging/zram/zram_sysfs.c | 46 +++++++++++++++-------
3 files changed, 84 insertions(+), 72 deletions(-)
diff --git a/drivers/staging/zram/zram_drv.c b/drivers/staging/zram/zram_drv.c
index c5f84ee..f111b86 100644
--- a/drivers/staging/zram/zram_drv.c
+++ b/drivers/staging/zram/zram_drv.c
@@ -38,33 +38,27 @@ struct zram *devices;
/* Module params (documentation at end) */
unsigned int num_devices;
-static void zram_stat_inc(u32 *v)
+static void zram_add_stat(struct zram *zram,
+ enum zram_stats_index idx, s64 val)
{
- *v = *v + 1;
+ struct zram_stats_cpu *stats;
+
+ preempt_disable();
+ stats = __this_cpu_ptr(zram->stats);
+ u64_stats_update_begin(&stats->syncp);
+ stats->count[idx] += val;
+ u64_stats_update_end(&stats->syncp);
+ preempt_enable();
}
-static void zram_stat_dec(u32 *v)
+static void zram_inc_stat(struct zram *zram, enum zram_stats_index idx)
{
- *v = *v - 1;
+ zram_add_stat(zram, idx, 1);
}
-static void zram_stat64_add(struct zram *zram, u64 *v, u64 inc)
+static void zram_dec_stat(struct zram *zram, enum zram_stats_index idx)
{
- spin_lock(&zram->stat64_lock);
- *v = *v + inc;
- spin_unlock(&zram->stat64_lock);
-}
-
-static void zram_stat64_sub(struct zram *zram, u64 *v, u64 dec)
-{
- spin_lock(&zram->stat64_lock);
- *v = *v - dec;
- spin_unlock(&zram->stat64_lock);
-}
-
-static void zram_stat64_inc(struct zram *zram, u64 *v)
-{
- zram_stat64_add(zram, v, 1);
+ zram_add_stat(zram, idx, -1);
}
static int zram_test_flag(struct zram *zram, u32 index,
@@ -131,7 +125,7 @@ static void zram_set_disksize(struct zram *zram, size_t totalram_bytes)
static void zram_free_page(struct zram *zram, size_t index)
{
- u32 clen;
+ int clen;
void *obj;
struct ...Acked-by: Pekka Enberg <penberg@kernel.org> --
That reimplements include/linux/percpu_counter.h, poorly. Please see the June discussion "[PATCH v2 1/2] tmpfs: Quick token library to allow scalable retrieval of tokens from token jar" for some discussion. --
I read the discussion you pointed out but still fail to see how percpu_counters, with all their overhead, are better than simple pcpu variable used in current version. What is the advantage? Thanks, Nitin --
What overhead? Send numbers. Then extrapolate those numbers to a Firstly, they'd have saved all the time you spent duplicating them. Secondly, getting additional users of the standard facility results in more testing and perhaps enhancement of that facility, thus benefiting other users too. Thirdly, using the standard facility permits your code to leverage enhancements which others add. Fourthly, they would result in a smaller kernel. You didn't really need me to teach you the benefits of code reuse did you? Please do not merge this code unless there is a good reason to do so and it has been shown that the standard facility cannot be suitably fixed or enhanced to address the deficiency. --
Maybe do #define zram_add_stat(zram, index, val) this_cpu_add(zram->stats->count[index], val) instead? It creates an add in a single "atomic" per cpu instruction and deals with the fallback scenarios for processors that cannot handle 64 bit adds. --
Yes, this_cpu_add() seems sufficient. I can't recall why I used u64_stats_* but if it's not required for atomic access to 64-bit then why was it added to the mainline in the first place? Thanks, Nitin --
Because we wanted to have fast 64bit counters, even on 32bit arches, and this has litle to do with 'atomic' on one entity, but a group of counters. (check drivers/net/loopback.c, lines 91-94). No lock prefix used in fast path. We also wanted readers to read correct values, not a value being changed by a writer, with inconsistent 32bit halves. SNMP applications want monotonically increasing counters. this_cpu_add()/this_cpu_read() doesnt fit. Even for single counter, this_cpu_read(64bit) is not using an RMW (cmpxchg8) instruction, so you can get very strange results when low order 32bit wraps. --
How about fixing it so that everyone benefits? --
IMHO, this_cpu_read() is fine as is : a _read_ operation. Dont pretend it can be used in every context, its not true. --
The problem only exists on 32 bit platforms using 64 bit counters. If you would provide this functionality for the fallback case of 64 bit counters (here x86) in 32 bit arch code then you could use the this_cpu_* operations in all context without your special code being replicated in ohter places. The additional advantage would be that for the 64bit case you would have much faster and more compact code. --
My implementation is portable and use existing infrastructure, at the
time it was coded. BTW, its fast on 64bit too. As fast as previous
implementation. No extra code added. Please double check.
If you believe you can do better, please do so.
Of course, we added 64bit network stats to all 32bit arches only because
cost was acceptable. (I say all 32bit arches, because you seem to think
only x86 was the target)
Using this_cpu_{add|res}() fallback using atomic ops or spinlocks would
be slower than actual implemenation (smp_wmb() (nops on x86) and
increments).
--
Well if you use the this_cpu_add() on 64 bit you have a single instruction Ok then add your 64 bit on 32 bit implementation as a default for all 32 bit configurations. The point is to not have special implementations for particular counters. It would be great if the logic you developed for the network counters could be used in general for all who have issues with 64 That would be bad. But then I do not see anyone around that wants to implement such fallback behavior. The current generic fallback for 64 ops (if the arch does not provide it which 32 bit arches generally do not) is to switch off preempt and then perform the 64 bit op. There is also a irqsafe version as well. Have a look at inclue/linux/percpu.h. --
Sorry, was having other zram issues on ppc64 that prevented me from testing this. Stupid bug, removed the &. Anton zram: Free percpu data on module exit. Signed-off-by: Anton Blanchard <anton@samba.org> --- Index: powerpc.git/drivers/staging/zram/zram_drv.c =================================================================== --- powerpc.git.orig/drivers/staging/zram/zram_drv.c 2010-08-31 15:15:59.344290847 +1000 +++ powerpc.git/drivers/staging/zram/zram_drv.c 2010-09-01 12:35:02.964893575 +1000 @@ -483,8 +483,7 @@ void zram_reset_device(struct zram *zram xv_destroy_pool(zram->mem_pool); zram->mem_pool = NULL; - /* Reset stats */ - memset(&zram->stats, 0, sizeof(zram->stats)); + free_percpu(zram->stats); zram->disksize = zram_default_disksize(); mutex_unlock(&zram->init_lock); --
I'm getting an oops when running mkfs on zram:
NIP [d0000000030e0340] .zram_inc_stat+0x58/0x84 [zram]
[c00000006d58f720] [d0000000030e091c] .zram_make_request+0xa8/0x6a0 [zram]
[c00000006d58f840] [c00000000035795c] .generic_make_request+0x390/0x434
[c00000006d58f950] [c000000000357b14] .submit_bio+0x114/0x140
[c00000006d58fa20] [c000000000361778] .blkdev_issue_discard+0x1ac/0x250
[c00000006d58fb10] [c000000000361f68] .blkdev_ioctl+0x358/0x7fc
[c00000006d58fbd0] [c0000000001c1c1c] .block_ioctl+0x6c/0x90
[c00000006d58fc70] [c0000000001984c4] .do_vfs_ioctl+0x660/0x6d4
[c00000006d58fd70] [c0000000001985a0] .SyS_ioctl+0x68/0xb0
Since disksize no longer starts as 0 it looks like we can call
zram_make_request before the device has been initialised. The patch below
fixes the immediate problem but this would go away if we move the
initialisation function elsewhere (as suggested in another thread).
Signed-off-by: Anton Blanchard <anton@samba.org>
---
Index: powerpc.git/drivers/staging/zram/zram_drv.c
===================================================================
--- powerpc.git.orig/drivers/staging/zram/zram_drv.c 2010-09-01 12:35:14.286515175 +1000
+++ powerpc.git/drivers/staging/zram/zram_drv.c 2010-09-01 12:35:24.167930504 +1000
@@ -441,6 +441,12 @@ static int zram_make_request(struct requ
int ret = 0;
struct zram *zram = queue->queuedata;
+ if (unlikely(!zram->init_done)) {
+ set_bit(BIO_UPTODATE, &bio->bi_flags);
+ bio_endio(bio, 0);
+ return 0;
+ }
+
if (unlikely(!valid_io_request(zram, bio))) {
zram_inc_stat(zram, ZRAM_STAT_INVALID_IO);
bio_io_error(bio);
--
On Wed, 1 Sep 2010 13:51:35 +1000 So... what happened with this and your other bugfix in zram_reset_device()? --
Each zram device maintains an array (table) that maps
index within the device to the location of corresponding
compressed chunk. Currently we store 'struct page' pointer,
offset with page and various flags separately which takes
12 bytes per table entry. Now all these are encoded in a
single 'phys_add_t' value which results in savings of 4 bytes
per entry (except on PAE systems).
Unfortunately, cleanups related to some variable renames
were mixed in this patch. So, please bear some additional
noise.
Signed-off-by: Nitin Gupta <ngupta@vflare.org>
---
drivers/staging/zram/zram_drv.c | 256 ++++++++++++++++++++------------------
drivers/staging/zram/zram_drv.h | 24 +---
2 files changed, 140 insertions(+), 140 deletions(-)
diff --git a/drivers/staging/zram/zram_drv.c b/drivers/staging/zram/zram_drv.c
index c16e09a..efe9c93 100644
--- a/drivers/staging/zram/zram_drv.c
+++ b/drivers/staging/zram/zram_drv.c
@@ -42,6 +42,13 @@ struct zram *devices;
/* Module params (documentation at end) */
unsigned int num_devices;
+/*
+ * We do not allocate any memory for zero-filled pages.
+ * Rather, we simply mark them in corresponding table
+ * entry by setting this bit.
+ */
+#define ZRAM_ZERO_PAGE_MARK_BIT (1 << 0)
+
static void zram_add_stat(struct zram *zram,
enum zram_stats_index idx, s64 val)
{
@@ -65,37 +72,62 @@ static void zram_dec_stat(struct zram *zram, enum zram_stats_index idx)
zram_add_stat(zram, idx, -1);
}
-static int zram_test_flag(struct zram *zram, u32 index,
- enum zram_pageflags flag)
+static int page_zero_filled(void *ptr)
{
- return zram->table[index].flags & BIT(flag);
+ unsigned int pos;
+ unsigned long *page;
+
+ page = (unsigned long *)ptr;
+
+ for (pos = 0; pos != PAGE_SIZE / sizeof(*page); pos++) {
+ if (page[pos])
+ return 0;
+ }
+
+ return 1;
}
-static void zram_set_flag(struct zram *zram, u32 index,
- enum zram_pageflags flag)
+static int zram_is_zero_page(struct zram *zram, u32 index)
...The noise makes this patch pretty difficult to review properly. Care to spilt the patch into two pieces? --
Ok, I will split them as separate patches. Thanks for all the reviews and Acks. Nitin --
The 'discard' bio discard request provides information to
zram disks regarding blocks which are no longer in use by
filesystem. This allows freeing memory allocated for such
blocks.
When zram devices are used as swap disks, we already have
a callback (block_device_operations->swap_slot_free_notify).
So, the discard support is useful only when used as generic
(non-swap) disk.
Signed-off-by: Nitin Gupta <ngupta@vflare.org>
---
drivers/staging/zram/zram_drv.c | 25 +++++++++++++++++++++++++
drivers/staging/zram/zram_sysfs.c | 11 +++++++++++
2 files changed, 36 insertions(+), 0 deletions(-)
diff --git a/drivers/staging/zram/zram_drv.c b/drivers/staging/zram/zram_drv.c
index efe9c93..0f9785f 100644
--- a/drivers/staging/zram/zram_drv.c
+++ b/drivers/staging/zram/zram_drv.c
@@ -420,6 +420,20 @@ out:
return 0;
}
+static void zram_discard(struct zram *zram, struct bio *bio)
+{
+ size_t bytes = bio->bi_size;
+ sector_t sector = bio->bi_sector;
+
+ while (bytes >= PAGE_SIZE) {
+ zram_free_page(zram, sector >> SECTORS_PER_PAGE_SHIFT);
+ sector += PAGE_SIZE >> SECTOR_SHIFT;
+ bytes -= PAGE_SIZE;
+ }
+
+ bio_endio(bio, 0);
+}
+
/*
* Check if request is within bounds and page aligned.
*/
@@ -451,6 +465,12 @@ static int zram_make_request(struct request_queue *queue, struct bio *bio)
return 0;
}
+ if (unlikely(bio_rw_flagged(bio, BIO_RW_DISCARD))) {
+ zram_inc_stat(zram, ZRAM_STAT_DISCARD);
+ zram_discard(zram, bio);
+ return 0;
+ }
+
switch (bio_data_dir(bio)) {
case READ:
ret = zram_read(zram, bio);
@@ -606,6 +626,11 @@ static int create_device(struct zram *zram, int device_id)
blk_queue_io_min(zram->disk->queue, PAGE_SIZE);
blk_queue_io_opt(zram->disk->queue, PAGE_SIZE);
+ zram->disk->queue->limits.discard_granularity = PAGE_SIZE;
+ zram->disk->queue->limits.max_discard_sectors = UINT_MAX;
+ zram->disk->queue->limits.discard_zeroes_data = 1;
+ queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, zram->queue);
+
...So freeing the page here will guarantee zeroed return on read? And since you set PAGE_SIZE as the discard granularity, the above loop could be coded more readable with the knowledge that ->bi_size is always a multiple of the page size. -- Jens Axboe Confidentiality Notice: This e-mail message, its contents and any attachments to it are confidential to the intended recipient, and may contain information that is privileged and/or exempt from disclosure under applicable law. If you are not the intended recipient, please immediately notify the sender and destroy the original e-mail message and any attachments (and any copies that may have been made) from your system or otherwise. Any unauthorized use, copying, disclosure or distribution of this information is strictly prohibited. --
For reads on freed/unwritten sectors, it simply returns success and does not touch the bio page. Is it better to zero the page in such Ok, I will cleanup it up. Thanks for comments. Nitin --
Well, you told the kernel that you return zeroes on discarded ranges:
zram->disk->queue->limits.discard_zeroes_data = 1;
So yes, if you intend to keep that, then you need to zero the
incoming pages that have been explicitly trimmed by a discard.
--
Jens Axboe
Confidentiality Notice: This e-mail message, its contents and any attachments to it are confidential to the intended recipient, and may contain information that is privileged and/or exempt from disclosure under applicable law. If you are not the intended recipient, please immediately notify the sender and destroy the original e-mail message and any attachments (and any copies that may have been made) from your system or otherwise. Any unauthorized use, copying, disclosure or distribution of this information is strictly prohibited.
--
Compression takes much more time than decompression. So, its quite wasteful in terms of both CPU cycles and memory usage to have a very low compressed page size threshold and thereby storing such not-so-well compressible pages as-is (uncompressed). So, increasing it from PAGE_SIZE/2 to PAGE_SIZE/8*7. A low threshold was useful when we had "backing swap" support where we could forward such pages to the backing device (applicable only when zram was used as swap disk). It is not yet configurable through sysfs but may be exported in future, along with threshold for average compression ratio. Signed-off-by: Nitin Gupta <ngupta@vflare.org> --- drivers/staging/zram/zram_drv.h | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/staging/zram/zram_drv.h b/drivers/staging/zram/zram_drv.h index 65e512d..bcc51ea 100644 --- a/drivers/staging/zram/zram_drv.h +++ b/drivers/staging/zram/zram_drv.h @@ -47,7 +47,7 @@ static const unsigned default_disksize_perc_ram = 25; * Pages that compress to size greater than this are stored * uncompressed in memory. */ -static const unsigned max_zpage_size = PAGE_SIZE / 4 * 3; +static const unsigned max_zpage_size = PAGE_SIZE / 8 * 7; /* * NOTE: max_zpage_size must be less than or equal to: -- 1.7.2.1 --
Hi Nitin,
The description makes sense but lacks any real data. What kind of
workloads did you test this with? Where does it help most? How much?
Pekka
--
- xvmalloc: Remove unnecessary stat_{inc,dec} and increment
pages_stored directly
- xvmalloc: Initialize pointers with NULL instead of 0
- zram: Remove verbose message when use sets insane disksize
- zram: Mark some messages as pr_debug
- zram: Refine some comments
Signed-off-by: Nitin Gupta <ngupta@vflare.org>
---
drivers/staging/zram/xvmalloc.c | 22 ++++-------------
drivers/staging/zram/zram_drv.c | 49 +++++++++++++-------------------------
drivers/staging/zram/zram_drv.h | 26 ++++----------------
3 files changed, 28 insertions(+), 69 deletions(-)
diff --git a/drivers/staging/zram/xvmalloc.c b/drivers/staging/zram/xvmalloc.c
index 3fdbb8a..6268f65 100644
--- a/drivers/staging/zram/xvmalloc.c
+++ b/drivers/staging/zram/xvmalloc.c
@@ -20,16 +20,6 @@
#include "xvmalloc.h"
#include "xvmalloc_int.h"
-static void stat_inc(u64 *value)
-{
- *value = *value + 1;
-}
-
-static void stat_dec(u64 *value)
-{
- *value = *value - 1;
-}
-
static int test_flag(struct block_header *block, enum blockflags flag)
{
return block->prev & BIT(flag);
@@ -187,7 +177,7 @@ static void insert_block(struct xv_pool *pool, struct page *page, u32 offset,
slindex = get_index_for_insert(block->size);
flindex = slindex / BITS_PER_LONG;
- block->link.prev_page = 0;
+ block->link.prev_page = NULL;
block->link.prev_offset = 0;
block->link.next_page = pool->freelist[slindex].page;
block->link.next_offset = pool->freelist[slindex].offset;
@@ -217,7 +207,7 @@ static void remove_block_head(struct xv_pool *pool,
pool->freelist[slindex].page = block->link.next_page;
pool->freelist[slindex].offset = block->link.next_offset;
- block->link.prev_page = 0;
+ block->link.prev_page = NULL;
block->link.prev_offset = 0;
if (!pool->freelist[slindex].page) {
@@ -232,7 +222,7 @@ static void remove_block_head(struct xv_pool *pool,
*/
tmpblock = get_ptr_atomic(pool->freelist[slindex].page,
pool->freelist[slindex].offset, ...Signed-off-by: Nitin Gupta <ngupta@vflare.org> --- Documentation/ABI/testing/sysfs-block-zram | 99 ++++++++++++++++++++++++++++ 1 files changed, 99 insertions(+), 0 deletions(-) create mode 100644 Documentation/ABI/testing/sysfs-block-zram diff --git a/Documentation/ABI/testing/sysfs-block-zram b/Documentation/ABI/testing/sysfs-block-zram new file mode 100644 index 0000000..c8b3b48 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-block-zram @@ -0,0 +1,99 @@ +What: /sys/block/zram<id>/disksize +Date: August 2010 +Contact: Nitin Gupta <ngupta@vflare.org> +Description: + The disksize file is read-write and specifies the disk size + which represents the limit on the *uncompressed* worth of data + that can be stored in this disk. + +What: /sys/block/zram<id>/initstate +Date: August 2010 +Contact: Nitin Gupta <ngupta@vflare.org> +Description: + The disksize file is read-only and shows the initialization + state of the device. + +What: /sys/block/zram<id>/reset +Date: August 2010 +Contact: Nitin Gupta <ngupta@vflare.org> +Description: + The disksize file is write-only and allows resetting the + device. The reset operation frees all the memory assocaited + with this device. + +What: /sys/block/zram<id>/num_reads +Date: August 2010 +Contact: Nitin Gupta <ngupta@vflare.org> +Description: + The num_reads file is read-only and specifies the number of + reads (failed or successful) done on this device. + +What: /sys/block/zram<id>/num_writes +Date: August 2010 +Contact: Nitin Gupta <ngupta@vflare.org> +Description: + The num_writes file is read-only and specifies the number of + writes (failed or successful) done on this device. + +What: /sys/block/zram<id>/invalid_io +Date: August 2010 +Contact: Nitin Gupta <ngupta@vflare.org> +Description: + The invalid_io file is read-only and specifies the number of + non-page-size-aligned I/O requests issued to this device. + +What: /sys/block/zram<id>/notify_free +Date: August ...
Acked-by: Pekka Enberg <penberg@kernel.org> --
Update zram documentation to reflect transition form ioctl to sysfs interface. Signed-off-by: Nitin Gupta <ngupta@vflare.org> --- drivers/staging/zram/zram.txt | 58 +++++++++++++++++++++++++--------------- 1 files changed, 36 insertions(+), 22 deletions(-) diff --git a/drivers/staging/zram/zram.txt b/drivers/staging/zram/zram.txt index 520edc1..5f75d29 100644 --- a/drivers/staging/zram/zram.txt +++ b/drivers/staging/zram/zram.txt @@ -5,33 +5,35 @@ Project home: http://compcache.googlecode.com/ * Introduction -The zram module creates RAM based block devices: /dev/ramX (X = 0, 1, ...). -Pages written to these disks are compressed and stored in memory itself. -These disks allow very fast I/O and compression provides good amounts of -memory savings. +The zram module creates RAM based block devices named /dev/zram<id> +(<id> = 0, 1, ...). Pages written to these disks are compressed and stored +in memory itself. These disks allow very fast I/O and compression provides +good amounts of memory savings. Some of the usecases include /tmp storage, +use as swap disks, various caches under /var and maybe many more :) -See project home for use cases, performance numbers and a lot more. - -Individual zram devices are configured and initialized using zramconfig -userspace utility as shown in examples below. See zramconfig man page for -more details. +Statistics for individual zram devices are exported through sysfs nodes at +/sys/block/zram<id>/ * Usage Following shows a typical sequence of steps for using zram. -1) Load Modules: +1) Load Module: modprobe zram num_devices=4 - This creates 4 (uninitialized) devices: /dev/zram{0,1,2,3} + This creates 4 devices: /dev/zram{0,1,2,3} (num_devices parameter is optional. Default: 1) -2) Initialize: - Use zramconfig utility to configure and initialize individual - zram devices. For example: - zramconfig /dev/zram0 --init # uses default value of disksize_kb - zramconfig /dev/zram1 --disksize_kb=102400 # 100MB ...
This also removes the (mutex) lock which was used to protect these buffers. Tests with fio on dual core showed improvement of about 20% for write speed. fio job description: [global] bs=4k ioengine=libaio iodepth=8 size=1g direct=1 runtime=60 directory=/mnt/zdisk verify=meta verify=0x1234567890abcdef numjobs=2 [seq-write] rw=write Speedup was negligible with iodepth of 4 or less. Maybe gains will be more visible with higher number of cores. Results with above job file: Before: WRITE: io=2,048MB, aggrb=70,725KB/s WRITE: io=2,048MB, aggrb=71,938KB/s WRITE: io=2,048MB, aggrb=70,461KB/s After: WRITE: io=2,048MB, aggrb=86,691KB/s WRITE: io=2,048MB, aggrb=87,025KB/s WRITE: io=2,048MB, aggrb=88,700KB/s Speedup: ~20% Read performance remained almost the same since the read path was lockless earlier also (LZO decompressor does not require any additional buffer). Signed-off-by: Nitin Gupta <ngupta@vflare.org> --- drivers/staging/zram/zram_drv.c | 136 +++++++++++++++++++++++++-------------- drivers/staging/zram/zram_drv.h | 4 - 2 files changed, 88 insertions(+), 52 deletions(-) diff --git a/drivers/staging/zram/zram_drv.c b/drivers/staging/zram/zram_drv.c index f111b86..c16e09a 100644 --- a/drivers/staging/zram/zram_drv.c +++ b/drivers/staging/zram/zram_drv.c @@ -17,13 +17,14 @@ #include <linux/module.h> #include <linux/kernel.h> -#include <linux/bio.h> #include <linux/bitops.h> #include <linux/blkdev.h> #include <linux/buffer_head.h> +#include <linux/cpu.h> #include <linux/device.h> #include <linux/genhd.h> #include <linux/highmem.h> +#include <linux/notifier.h> #include <linux/slab.h> #include <linux/lzo.h> #include <linux/string.h> @@ -31,6 +32,9 @@ #include "zram_drv.h" +static DEFINE_PER_CPU(unsigned char *, compress_buffer); +static DEFINE_PER_CPU(unsigned char *, compress_workmem); + /* Globals */ static int zram_major; struct zram *devices; @@ -290,10 +294,10 @@ static int ...
The per-CPU buffer thing with this preempt_disable() trickery looks overkill to me. Most block device drivers seem to use mempool_alloc() for this sort of thing. Is there some reason you can't use that here? --
Other block drivers are allocating relatively small structs using mempool_alloc(). However, in case of zram, these buffers are quite large (compress_workmem is 64K!). So, allocating them on every write would probably be much slower than using a pre-allocated per-cpu buffer. Thanks, Nitin --
Hi Nitin,
The mempool API is precisely for that - using pre-allocated buffers
instead of allocating every time. The preempt_disable() games make the
code complex and have the downside of higher scheduling latencies so why
not give mempools a try?
Pekka
--
Hi, mempool_alloc() first calls alloc_fn with ~(__GFP_WAIT | __GFP_IO) and *then* falls down to pre-allocated buffers. So, it will always be slower than directly using pre-allocated buffers as is done currently. One trick we can use is to have alloc_fn such that it always returns failure with ~__GFP_WAIT and do actual allocation otherwise. But still it seems like unnecessary cost. Thanks, Nitin --
Hi Nitin,
We can always extend the mempool API with mempool_prealloc() function if
that turns out to be a problem. The per-CPU buffer with
preempt_disable() trickery isn't really the proper thing to do here. It
doesn't make much sense to disable preemption for compression that's
purely CPU bound.
Pekka
--
I applied the first 2 and last 2 of these patches and they will show up in the linux-next tree tomorrow. I stopped there due to Andrew's complaints about the per-cpu variable stuff. Please resolve this and redo the patch set and I will be glad to apply them. thanks, greg k-h --
Hi Nitin, I gave zram a try on a ppc64 box with a 64kB PAGE_SIZE. It looks like the xvmalloc allocator fails when we add in a large enough block (in this case 65532 bytes). flindex ends up as 127 which is larger than BITS_PER_LONG. We continually call grow_block inside find_block and fail: zram: Error allocating memory for compressed page: 0, size=467 Anton --
I'm curious why there are no killable processes on the system; it seems like the triggering task here, cat, would at least be killable itself. Could you post the tasklist dump that preceeds this (or, if you've disabled it try echo 1 > /proc/sys/vm/oom_dump_tasks first)? It's possible that if you have enough swap that none of the eligible tasks actually have non-zero badness scores either because they are being run as root or because the amount of RAM or swap is sufficiently high such that (task's rss + swap) / (total rss + swap) is never non-zero. And, since root tasks have a 3% bonus, it's possible these are all root tasks and no single task uses more than 3% of rss and swap. While this may not be the issue in your case, and can be confirmed with the tasklist dump if you can get it, we need to protect against these situations where eligible tasks may not be killed. Andrew, I'd like to propose this patch for 2.6.36-rc-series since the worst case is that the machine will panic if there are an exceptionally large number of tasks, each with little memory usage at the time of oom. oom: always return a badness score of non-zero for eligible tasks A task's badness score is roughly a proportion of its rss and swap compared to the system's capacity. The scale ranges from 0 to 1000 with the highest score chosen for kill. Thus, this scale operates on a resolution of 0.1% of RAM + swap. Admin tasks are also given a 3% bonus, so the badness score of an admin task using 3% of memory, for example, would still be 0. It's possible that an exceptionally large number of tasks will combine to exhaust all resources but never have a single task that uses more than 0.1% of RAM and swap (or 3.0% for admin tasks). This patch ensures that the badness score of any eligible task is never 0 so the machine doesn't unnecessarily panic because it cannot find a task to kill. Signed-off-by: David Rientjes <rientjes@google.com> --- mm/oom_kill.c | 9 +++++++-- 1 files ...
That was one odd part here. I didn't disable the tasklist dump, and It's a 64GB machine with ~30GB of swap and very little RSS. Your hypothesis seems correct. Just grepping through /proc/[0-9]*/oom_score shows nothing other than 0's. Trying this again, I just hung the system instead of OOM'ing straight away like last time. Your patch makes a lot of sense to me in any case where there aren't large-RSS tasks around using memory. That definitely applies here because of the amount in the compcache store and might also apply with ramfs and hugetlbfs. -- Dave --
Hmm, could you very that /proc/sys/vm/oom_dump_tasks is set? Perhaps it's getting cleared by something else before you use zram. The sysctl should Presumably you're not using a large amount of swap, either, or that would with the patch, you should still be calling the oom killer and instead of panicking it will go on a serial killing spree because everything that it wasn't judging as a candidate before (oom_score of 0) now is if it's truly killable (oom_score of 1). The patch is definitely needed for correctness since an oom_score of 0 implies the task is unkillable. We're apparently hanging in the exit path for the oom killed task or something is constantly respawning threads that repeatedly get killed. It appears as though nothing is actually a worthwhile target for the oom Agreed, we'll need to address hugepages specifically because they don't get accounted for in rss but do free memory when the task is killed. --
Nope. There's very little happening on the system except for me toying I'll give the patch a shot and see if I get any better behavior. But, I really do think the root cause here is compcache exhausting the system when you feed incompressible pages into it. We can kill all the tasks we want, but I think it'll continue to gobble memory up as fast as we free it. We either need to put some upper bounds on the amount of memory that compcache uses for its backing store, or reintroduce the code that lets They do sometimes. But, if they're preallocated, or stuck in a linked file on the filesystem, killing the task doesn't do any good. -- Dave --
Ok, I assume you aren't getting the typical "cat invoked oom-killer..." message, the memory state dump, etc., either, so there's something strange with your log level such that nothing under KERN_WARNING is getting through or you can't access the actual kernel log due to the panic. I can That certainly seems to be the case and is the true topic of this thread, so I don't want to hijack it any further since it's outside the scope of the oom killer :) But I'm still curious as to why the machine is hanging and not eventually panicking when we run out of killable tasks. It seems as though something is hanging in the exit path, meaning memory reserves aren't even safe from compcache, or there's something wrong in the oom killer retry logic, or you're simply forking more tasks, perhaps as a response to threads getting killed by the kernel, than we can kill. We'd certainly prefer to panic the machine if no work is getting done than simply killing everything that gets forked. The problem before was that we panicked too early before we killed anything and now we don't know when Indeed you're right, I meant s/hugepages/transparent hugepages/, sorry. It appears as though they get included in the rss of the allocating task, though, via MM_ANONPAGES, so this is already represented in the task's badness score. Thanks for trying the patch out, Dave, I hope we can add your Tested-by line and it can get pushed to the rc-series. --
Hi Dave, Sorry for late reply. Since last month I couldn't get any chance to work on this project. Ability to write out zram (compressed) memory to a backing disk seems really useful. However considering lkml reviews, I had to drop this feature. Anyways, I guess I will try to push this feature again. Also, please do not use linux-next/mainline version of compcache. Instead just use version in the project repository here: hg clone https://compcache.googlecode.com/hg/ compcache This is updated much more frequently and has many more bug fixes over the mainline. It will also be easier to fix bugs/add features much more quickly in this repo rather than sending them to lkml which can take long time. Thanks, Nitin --
I'd argue that zram is pretty useless without some ability to write to a backing store, unless you *really* know what is going to be stored in it and you trust the user. Otherwise, it's just too easy to OOM the system. I've been investigating backing the xvmalloc space with a tmpfs file. Instead of keeping page/offset pairs, you just keep a linear address inside the tmpfile file. There's an extra step needed to look up and lock the page cache page into place each time you go into the xvmalloc store, but it does seem to basically work. The patches are really rough and not quite functional, but I'm happy to share if you want to see them That looks like just a clone of the code needed to build the module. Kernel developers are pretty used to _some_ kernel tree being the authoritative source. Also, having it in a kernel tree makes it possible to get testing in places like linux-next, and it makes it easier for people to make patches or kernel trees on top of your work. There's not really a point to the code being in -staging if it isn't somewhat up-to-date or people can't generate patches to it. It sounds to me like we need to take it out of -staging. -- Dave --
I will try sending patches to sync mainline and hg code (along with some changes in pipeline), or maybe just take it out of -staging and send fresh patch series. Thanks, Nitin --
Or move it to a git tree. Then generating patches becomes tivial for all of us and keeping staging upto date becomes easier for you. --
What? No, the reason we put this into the kernel was so that _everyone_ could work on it, including the original developers. Going off and doing development somewhere else just isn't ok. Should I just delete this driver from the staging tree as you don't seem to want to work with Yes, developing in your own sandbox can always be faster, but there is no feedback loop. sad, greg k-h --
Getting it out of -staging wasn't my intent. Community is the reason I was finding it real hard to find time to properly discuss each patch over LKML, so I thought of shifting focus to local project repository and then later go through proper reviews. Thanks, Nitin --
So, should I delete the version in staging, or are you going to send patches to sync it up with your development version? thanks, greg k-h --
Deleting it from staging would not help much. Much more helpful would be to sync at least the mainline and linux-next version of the driver so it's easier to develop against these kernel trees. Initially, I thought -staging means that any reviewed change can quickly make it to *both* linux-next and more importantly -staging in mainline. Working/ Testing against mainline is much smoother than against linux-next. Thanks, Nitin --
Hi, We can't push the patches immediately to mainline because we need to respect the merge window. You shouldn't need to rely on linux-next for testing, though, but work directly against Greg's staging tree. Greg, where's the official tree at, btw? The tree at http://kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/ seems empty. Pekka --
Oops, I need to update the MAINTAINERS file, the proper place for the staging tree is now in git, at git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-next-2.6.git which feeds directly into the linux-next tree. hope this helps, greg k-h --
It makes it quickly to linux-next, right? Nothing goes "quickly" to mainline, other than the normal development Sure it's "easier", but that's not how kernel development is done, sorry. See Documentation/development-process/ for details. thanks, greg k-h --
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing | Re: Git in a Nutshell guide |
| John Benes | Re: master has some toys |
| Matthias Lederhofer | [PATCH 4/7] introduce GIT_WORK_TREE to specify the work tree |
| Alexander Sulfrian | [RFC/PATCH] RE: git calls SSH_ASKPASS even if DISPLAY is not set |
| Junio C Hamano | Re: Rss produced by git is not valid xml? |
| Linux Kernel Mailing List | iSeries: fix section mismatch in iseries_veth |
| Linux Kernel Mailing List | ixbge: remove TX lock and redo TX accounting. |
