Hello,
Prompted by a report from a user I have bisected a performance loss
apparently introduced by commit 54a6eb5c (mm: use two zonelist that are
filtered by GFP mask). The test is simple sequential writes to a 4 disk
raid5 array. Performance should be about 20% greater than 2.6.25 due to
commit 8b3e6cdc (md: introduce get_priority_stripe() to improve raid456
write performance). The sample data below shows sporadic performance
starting at 54a6eb5c. The '+' indicates where I hand applied 8b3e6cdc.
revision 2.6.25.8-fc8 2.6.25.9+ dac1d27b+ 18ea7e71+ 54a6eb5c+ 2.6.26-rc1 2.6.26-rc8
138 168 169 167 177 149 144
140 168 172 170 109 138 142
142 165 169 164 119 138 129
144 168 169 171 120 139 135
142 165 174 166 165 122 154
MB/s (avg) 141 167 171 168 138 137 141
% change 0% 18% 21% 19% -2% -3% 0%
result base good good good [bad] bad bad
Notable observations:
1/ This problem does not reproduce when ARCH=x86_64, i.e. 2.6.26-rc8 and
54a6eb5c show consistent performance at 170MB/s.
2/ Single drive performance appears to be unaffected
3/ A quick test shows that raid0 performance is also sporadic:
2147483648 bytes (2.1 GB) copied, 7.72408 s, 278 MB/s
2147483648 bytes (2.1 GB) copied, 7.78478 s, 276 MB/s
2147483648 bytes (2.1 GB) copied, 11.0323 s, 195 MB/s
2147483648 bytes (2.1 GB) copied, 8.41244 s, 255 MB/s
2147483648 bytes (2.1 GB) copied, 30.7649 s, 69.8 MB/s
System/Test configuration:
(2) Intel(R) Xeon(R) CPU 5150
mem=1024M
CONFIG_HIGHMEM4G=y (full config attached)
mdadm --create /dev/md0 /dev/sd[b-e] -n 4 -l 5 --assume-clean
for i in `seq 1 5`; do dd if=/dev/zero of=/dev/md0 bs=1024k ...(Christoph's address corrected and Andys added to cc)
That is not good at all as this patch is not a straight-forward revert but
I'm very curious as to why this doesn't affect x86_64. HIGHMEM is one
possibility if GFP_KERNEL is a major factor and it has to scan over the
unusable zone a lot. However, another remote possibility is that many function
calls are more expensive on x86 than on x86_64 (this is a wild guess based
on the registers available). Spectulative patch is below.
If 8b3e6cdc is reverted from 2.6.26-rc8, what do the figures look like?
i.e. is the zonelist filtering looking like a performance regression or is
Are these synced writes? i.e. is it possible the performance at the end
There was a deporkify patch which replaced inline function with normal
functions. I worried at the time that all the function calls in an
iterator may cause a performnace problem but I couldn't measure it so
assumed the reduction in text size was a plus. This is a partial
reporkify patch that should reduce the number of function calls that
take place at the cost of larger text. Can you try it out applied
against 2.6.26-rc8 please?
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.26-rc8-clean/include/linux/mmzone.h linux-2.6.26-rc8-repork/include/linux/mmzone.h
--- linux-2.6.26-rc8-clean/include/linux/mmzone.h 2008-06-24 18:58:20.000000000 -0700
+++ linux-2.6.26-rc8-repork/include/linux/mmzone.h 2008-07-01 00:49:17.000000000 -0700
@@ -742,6 +742,15 @@ static inline int zonelist_node_idx(stru
#endif /* CONFIG_NUMA */
}
+static inline int zref_in_nodemask(struct zoneref *zref, nodemask_t *nodes)
+{
+#ifdef CONFIG_NUMA
+ return node_isset(zonelist_node_idx(zref), *nodes);
+#else
+ return 1;
+#endif /* CONFIG_NUMA */
+}
+
/**
* next_zones_zonelist - Returns the next zone at or below highest_zoneidx within the allowed nodemask using a cursor within a zonelist as a starting point
* @z - The cursor used as a starting point for the search
@@ -754,10 +763,26 @@ ...Looking at the commit in question (54a6eb5c) there is one slight anomoly
in the conversion. When nr_free_zone_pages() was converted to the new
iterators it started using the offset parameter to limit the zones
traversed; which is not unreasonable as that appears to be the
parameters purpose. However, if we look at the original implementation
of this function (reproduced below) we can see it actually did nothing
with this parameter:
static unsigned int nr_free_zone_pages(int offset)
{
/* Just pick one node, since fallback list is circular */
unsigned int sum = 0;
struct zonelist *zonelist = node_zonelist(numa_node_id(), GFP_KERNEL);
struct zone **zonep = zonelist->zones;
struct zone *zone;
for (zone = *zonep++; zone; zone = *zonep++) {
unsigned long size = zone->present_pages;
unsigned long high = zone->pages_high;
if (size > high)
sum += size - high;
}
return sum;
}
This version of the routine would only ever accumulate the free space as
found on zones in the GFP_KERNEL zonelist, all zones other than HIGHMEM,
regardless of its parameters.
Looking at its callers, there are only two:
unsigned int nr_free_buffer_pages(void)
{
return nr_free_zone_pages(gfp_zone(GFP_USER));
}
unsigned int nr_free_pagecache_pages(void)
{
return nr_free_zone_pages(gfp_zone(GFP_HIGHUSER_MOVABLE));
}
Before this commit both would return the same value. Following it,
nr_free_pagecache_pages() would now contain the number of pages in
HIGHMEM as well. This is used to initialise vm_total_pages, which is
used in a number of the heuristics related to reclaim.
vm_total_pages = nr_free_pagecache_pages();
If the issue was low memory pressure (which would not be an unreasonable
conjecture given the large number of internal I/O's we may generate
with RAID) we may well make different reclaim decisions before and after
this commit. It should also be noted that as this discrepancy would only
appear where there is HIGHMEM we might ...This looks kinda promising and depends heavily on how this patch was tested in isolation. Dan, can you post the patch you use on 2.6.25 because the commit in question should not have applied cleanly please? To be clear, 2.6.25 used the offset parameter correctly to get a zonelist with the right zones in it. However, with two-zonelist, there is only one that gets filtered so using GFP_KERNEL to find a zone is equivilant as it gets filtered based on offset. However, if this patch was tested in isolation, it could result in bogus values of vm_total_pages. Dan, can you confirm in your dmesg logs that the line like the following has similar values please? Built 1 zonelists in Zone order, mobility grouping on. Total pages: 258544 Right now though 2.6.26-rc8 and 2.6.25 report the same values for "Total pages" in the dmesg at least under qemu running with 1GB. If -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
The system is booted with mem=1024M on the kernel command line and with
or without Andy's patch this reports:
Built 1 zonelists in Zone order, mobility grouping on. Total pages: 227584
Performance is still sporadic with the change. Moreover this condition
is reproducing even with CONFIG_NOHIGHMEM=y.
Let us take commit 8b3e6cdc out of the equation and just look at raid0
performance:
revision 2.6.25.8-fc8 54a6eb5c 54a6eb5c-nohighmem 2.6.26-rc8
279 278 273 277
281 278 275 277
281 113 68.7 66.8
279 69.2 277 73.7
278 75.6 62.5 80.3
MB/s (avg) 280 163 191 155
% change 0% -42% -32% -45%
result base bad bad bad
These numbers are taken from the results of:
for i in `seq 1 5`; do dd if=/dev/zero of=/dev/md0 bs=1024k count=2048; done
Where md0 is created by:
mdadm --create /dev/md0 /dev/sd[b-e] -n 4 -l 0
I will try your debug patch next Mel, and then try to collect more data
with blktrace.
--
Dan
--
Ok, based on your other mail, 54a6eb5c here is a bisection point. The good figures are on par with the "good" kernel with some disasterous runs leading to a bad average. The thing is, the bad results are way worse than could be accounted for by two-zonelist alone. In fact, the figures look suspiciously like only 1 disk is in use as they are roughly quartered. Can you think of anything that would explain that? Can you also confirm that using a bisection point before two-zonelist runs steadily and with high performance as expected please? This is to rule out some other RAID patch being a factor. It would be worth running vmstat during the tests so we can see if IO rates are dropping from an overall system perspective. If possible, oprofile data from the same time would be helpful to see does it show up I tried reproducing this but I don't have the necessary hardware to even come close to reproducing your test case :( . I got some dbench results with oprofile but found no significant differences between 2.6.25 and 2.6.26-rc8. However, I did find the results of dbench varied less between runs with the "repork" patch that made next_zones_zonelist() an inline function. Have you tried that patch yet? -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
raid0 in contrast to raid5 is fairly simple it just remaps the bio. Then it is up to the block layer to do some merging before it hits the block device driver. Not much room for error, especially since single drive performance seems to be unaffected. The data seems to show that The immediately preceeding commit 18ea7e71 always runs with high performance, and raid0 was virtually untouched during the merge window: $ git shortlog v2.6.25..v2.6.26-rc8 drivers/md/raid0.c Neil Brown (1): I watched /proc/vmstat during a dd run, and in the good case nr_writeback never drops below 21,000. In the bad case it may briefly break a few hundred pages. This picture generated by seekwatcher shows that there are gaps in i/o submission: http://userweb.kernel.org/~djbw/md0-bad2.png Compared to a good run: http://userweb.kernel.org/~djbw/md0-good.png What is interesting is that I was only able to get this data with the following command: ./seekwatcher -W -t md0.trace -o md0-bad2.png -p 'dd if=/dev/zero of=/dev/md0 bs=1024k count=2048' -d /dev/md0 -d /dev/sdb When I tried to watch all four disks, i.e. 5 blktrace tasks running instead of 2: ./seekwatcher -W -t md0.trace -o md0-bad2.png -p 'dd if=/dev/zero of=/dev/md0 bs=1024k count=2048' -d /dev/md0 -d /dev/sdb -d /dev/sdc -d /dev/sdd -d /dev/sde ...it always measured good performance, as if having more tasks running stimulates the vm to keep writeback going at a high rate. Could writeback be getting stalled on one cpu or more cpu's? Oprofile only seems to show that we are idling in the failure case: ---good--- samples % image name symbol name 18229 20.7171 vmlinux-18ea7e71 __copy_from_user_ll_nocache_nozero 4518 5.1347 vmlinux-18ea7e71 clear_user 2137 2.4287 vmlinux-18ea7e71 get_page_from_freelist 1929 2.1923 vmlinux-18ea7e71 constant_test_bit 1745 1.9832 vmlinux-18ea7e71 clear_bit 1320 1.5002 vmlinux-18ea7e71 kmem_cache_alloc 1290 1.4661 ...
One-line-summary: kswapd is almost certainly broken on !NUMA and a candidate
patch is below
Yes, we're stalling because the two-zonelist patch is breaking kswapd on
!NUMA machines. The impact is that kswapd never reclaims and processes always
directly reclaim. The result is you see large stalls.
Alexander, I believe this may also be why lockdep is triggering and you're
seeing periodic stalls in the XFS tests. The lockdep warning is a red herring
but the fact it's easier to trigger could be because kswapd is not working
and direct reclaim triggers the warning easily.
Alexander and Dan, could you test the patch below and let me know if it
fixes your respective problems please?
===
Subject: [PATCH] Do not overwrite nr_zones on !NUMA when initialising zlcache_ptr
With the two-zonelist patches on !NUMA machines, there really is only one
zonelist as __GFP_THISNODE is meaningless. However, during initialisation, the
assumption is made that two zonelists exist when initialising zlcache_ptr. The
result is that pgdat->nr_zones is always 0. As kswapd uses this value to
determine what reclaim work is necessary, the result is that kswapd never
reclaims. This causes processes to stall frequently in low-memory situations
as they always direct reclaim. This patch initialises zlcache_ptr correctly.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
page_alloc.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.26-rc8-clean/mm/page_alloc.c linux-2.6.26-rc8-fix-kswapd-on-numa/mm/page_alloc.c
--- linux-2.6.26-rc8-clean/mm/page_alloc.c 2008-06-24 18:58:20.000000000 -0700
+++ linux-2.6.26-rc8-fix-kswapd-on-numa/mm/page_alloc.c 2008-07-02 21:14:16.000000000 -0700
@@ -2328,7 +2328,8 @@ static void build_zonelists(pg_data_t *p
static void build_zonelist_cache(pg_data_t *pgdat)
{
pgdat->node_zonelists[0].zlcache_ptr = NULL;
- pgdat->node_zonelists[1].zlcache_ptr = NULL;
+ if ...This makes no sense. That whole thing is inside a #ifdef CONFIG_NUMA ... numa code .. #else ... this code .. #endif so CONFIG_NUMA will _not_ be set, and NUMA_BUILD is always 0. So why do that if (NUMA_BUILD) .. at all, when it is known to be false? So the patch may be correct, but wouldn't it be better to just remove the line entirely, instead of moving it into a conditional that cannot be true? Also, I'm not quite seeing why those zonelists should be zeroed out at all. Shouldn't a non-NUMA setup always aim to have node_zonelists[0] == node_zonelists[1] == all appropriate zones? I have to say, the whole mmzoen thing is confusing. The code makes my eyes bleed. I can't really follow it. Linus --
Because I'm a muppet and a bit cross-eyed from looking at this until the
problem would reveal itself. I knew this is needed fixing but choose the
node_zonelists[1] doesn't exist at all on non-NUMA so there is only
I'm looking at this too long to comment with anything other than a blank
stare :/
====
Subject: [PATCH] Do not overwrite nr_zones on !NUMA when initialising zlcache_ptr
With the two-zonelist patches on !NUMA machines, there really is only one
zonelist as __GFP_THISNODE is meaningless. However, during initialisation, the
assumption is made that two zonelists exist when initialising zlcache_ptr. The
result is that pgdat->nr_zones is always 0. As kswapd uses this value to
determine what reclaim work is necessary, the result is that kswapd never
reclaims. This causes processes to stall frequently in low-memory situations
as they always direct reclaim. This patch initialises zlcache_ptr correctly.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
page_alloc.c | 1 -
1 file changed, 1 deletion(-)
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.26-rc8-clean/mm/page_alloc.c linux-2.6.26-rc8-fix-kswapd-on-numa/mm/page_alloc.c
--- linux-2.6.26-rc8-clean/mm/page_alloc.c 2008-06-24 18:58:20.000000000 -0700
+++ linux-2.6.26-rc8-fix-kswapd-on-numa/mm/page_alloc.c 2008-07-02 21:49:09.000000000 -0700
@@ -2328,7 +2328,6 @@ static void build_zonelists(pg_data_t *p
static void build_zonelist_cache(pg_data_t *pgdat)
{
pgdat->node_zonelists[0].zlcache_ptr = NULL;
- pgdat->node_zonelists[1].zlcache_ptr = NULL;
}
#endif /* CONFIG_NUMA */
--
Bug squished. # for i in `seq 1 5`; do dd if=/dev/zero of=/dev/md0 bs=1024k count=2048; done 2048+0 records in 2048+0 records out 2147483648 bytes (2.1 GB) copied, 7.73352 s, 278 MB/s 2048+0 records in 2048+0 records out 2147483648 bytes (2.1 GB) copied, 7.6845 s, 279 MB/s 2048+0 records in 2048+0 records out 2147483648 bytes (2.1 GB) copied, 7.74428 s, 277 MB/s 2048+0 records in 2048+0 records out 2147483648 bytes (2.1 GB) copied, 7.65959 s, 280 MB/s 2048+0 records in 2048+0 records out 2147483648 bytes (2.1 GB) copied, 7.73107 s, 278 MB/s Tested-by: Dan Williams <dan.j.williams@intel.com> --
What a convoluted description. Simply put: We clobber the nr_zones field because we write beyond the bounds of the node_zonelists[] array in struct pglist_data. --
Subject: [PATCH] Do not clobber pgdat->nr_zones during memory initialisation
The nr_zones field is getting clobbered due to a write beyond the bounds of
the node_zonelists[] array in struct pglist_data.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
mm/page_alloc.c | 1 -
1 file changed, 1 deletion(-)
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.26-rc8-clean/mm/page_alloc.c linux-2.6.26-rc8-fix-kswapd-on-numa/mm/page_alloc.c
--- linux-2.6.26-rc8-clean/mm/page_alloc.c 2008-06-24 18:58:20.000000000 -0700
+++ linux-2.6.26-rc8-fix-kswapd-on-numa/mm/page_alloc.c 2008-07-02 21:49:09.000000000 -0700
@@ -2328,7 +2328,6 @@ static void build_zonelists(pg_data_t *p
static void build_zonelist_cache(pg_data_t *pgdat)
{
pgdat->node_zonelists[0].zlcache_ptr = NULL;
- pgdat->node_zonelists[1].zlcache_ptr = NULL;
}
#endif /* CONFIG_NUMA */
--
Heh. I already applied it as ObviouslyCorrect(tm), but did the
simplification I already pointed out (and which your second version
already had) and rewrote your commit message a bit. So it's now committed
as follows..
Linus
---
commit 494de90098784b8e2797598cefdd34188884ec2e
Author: Mel Gorman <mel@csn.ul.ie>
Date: Thu Jul 3 05:27:51 2008 +0100
Do not overwrite nr_zones on !NUMA when initialising zlcache_ptr
The non-NUMA case of build_zonelist_cache() would initialize the
zlcache_ptr for both node_zonelists[] to NULL.
Which is problematic, since non-NUMA only has a single node_zonelists[]
entry, and trying to zero the non-existent second one just overwrote the
nr_zones field instead.
As kswapd uses this value to determine what reclaim work is necessary,
the result is that kswapd never reclaims. This causes processes to
stall frequently in low-memory situations as they always direct reclaim.
This patch initialises zlcache_ptr correctly.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Tested-by: Dan Williams <dan.j.williams@intel.com>
[ Simplified patch a bit ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
mm/page_alloc.c | 1 -
1 files changed, 0 insertions(+), 1 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2f55295..f32fae3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2328,7 +2328,6 @@ static void build_zonelists(pg_data_t *pgdat)
static void build_zonelist_cache(pg_data_t *pgdat)
{
pgdat->node_zonelists[0].zlcache_ptr = NULL;
- pgdat->node_zonelists[1].zlcache_ptr = NULL;
}
#endif /* CONFIG_NUMA */
--
Side note: master.kernel.org is going through maintenance today, so I haven't actually been able to push it out yet, in case anybody wonders why it's not showing up on the public sites. It will all be out there later today. Linus --
Perfect. The log even looks like it was written by a sane person. Thanks -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
Great news. Dan, thanks a lot for reporting and persisting with the testing of various bits and pieces to get this pinned down. It is greatly appreciated. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
? I have not tested this patch (54a6eb5c) in isolation. The patch that was applied to 2.6.25 was the raid5 performance enhancement patch. It is no longer in the equation since the problem is reproducible with stock raid0. -- Dan --
