kernel 2.6.22: what IS the VM doing?

Previous thread: ethernet driver: hard_start_xmit by Harsha on Thursday, August 30, 2007 - 7:24 am. (1 message)

Next thread: Re: hda: set_drive_speed_status: status=0x51 { DriveReady SeekComplete Error } by John Sigler on Thursday, August 30, 2007 - 8:30 am. (6 messages)
To: Linux kernel Mailing List <linux-kernel@...>
Date: Thursday, August 30, 2007 - 7:54 am

Using SMP kernel 2.6.22.6pre-CFS-v20.5 on Pentium D (IA-32).
I think this bug (or whatever you want to call it) got triggered
when you first allocate several megabytes of memory in a kernel module
and then free them, and then run e.g. X and when memory gets tight,
you end up with this situation...

Top 2 /proc/vmstat Biggest Winners:

pgrefill_normal:49900/second
pgrefill_high:20810/second

$ vmstat 1

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 1 326020 861476 0 7384 236 0 392 0 68 1100 1 1 45 54 0
0 1 326020 861860 0 7216 64 140 64 148 68 1137 0 0 48 51 0
0 1 326020 862696 0 6772 96 0 108 4 55 1069 0 1 48 51 0
0 3 326024 861352 0 8080 752 300 2540 308 191 1202 1 3 37 59 0
0 2 326024 857464 0 9340 2768 0 5364 40 228 1589 0 2 20 77 0
0 3 326396 857772 0 8408 1280 656 2580 664 203 1589 1 3 20 78 0
0 3 326376 855768 0 9864 1448 0 3632 0 167 1488 1 2 11 87 0
0 3 326084 854000 0 10712 1848 0 3228 0 180 1515 1 3 1 95 0
0 1 326084 854204 0 10052 628 0 776 0 98 1324 1 1 38 61 0
0 1 326084 855476 0 8644 100 0 332 52 81 1065 1 1 48 51 0
0 2 326084 856444 0 8180 88 0 396 0 84 1195 1 2 38 60 0
0 2 319072 849068 0 9572 7008 0 8548 0 1820 8460 1 18 15 68 0
2 3 303888 835724 0 8864 14900 0 15100 8 3800 15863 1 27 29 44 0

around here I rand swapoff -a

0 2 289024 822352 0 7724 14688 0 14840 4 3706 17628 1 25 13 61 0
0 4 270728 805104 0 7644 17384 0 17656 8 4388 19984 0 25 19 55 0
0 3 255216 789884 0 7296 15732 0 15892 8 3932 19146 1 22 24 54 0
2 3 241188 776476 0 6748 14192 0 14732 4 3550 16584 1 21 0 78 0
2 4 224708...

To: Linux kernel Mailing List <linux-kernel@...>
Date: Tuesday, September 4, 2007 - 9:37 pm

That means the pageout code was scanning about 70000 pages
per second on your system during peak stress. You may have
run into a scalability problem in the Linux kernel, where it
wants to clear the referenced bit off all the anonymous pages
before swapping something out.

To make matters worse, that unlucky page gets chosen because
it was the page where kswapd started scanning. It has little
to do with being the least recently used page, because every
anonymous page tends to have its referenced bit set by the time
we start scanning.

On truly enormous systems, say with 256GB of memory, kswapd
sometimes needs to scan hundreds of thousands or even millions

Is the system trying to evict pages like crazy when your
system becomes unusable?

If so, I wonder if kswapd is simply doing the wrong thing
and trying to evict data from all zones, simply because the
highmem zone is low on free pages...

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is. Each group
calls the other unpatriotic.
-

To: Linux kernel Mailing List <linux-kernel@...>
Date: Wednesday, September 5, 2007 - 7:45 am

Thanks for analysis...

Why turning off swap did not make any difference?

*shrug*

--
"If we put the Pentagon's personnel managers in charge of the Sahara
Desert, they would run out of sand in five years." -Sayen Report

-

To: Linux kernel Mailing List <linux-kernel@...>
Date: Wednesday, September 5, 2007 - 12:24 pm

The attached patch should make sure kswapd does not free an
excessive number of pages in zone_normal just because the
pages in zone_highmem are difficult to free.

It does give kswapd a large margin to continue putting equal
pressure on all zones in normal situations.

Sami, could you try out this patch to see if it helps your
situation?

Signed-off-by: Rik van Riel <riel@redhat.com>

To: Linux kernel Mailing List <linux-kernel@...>
Date: Wednesday, September 5, 2007 - 1:33 pm

On Wed, Sep 05, 2007 at 12:24:26 -0400, Rik van Riel wrote:

Thanks, Rik. bzImage is ready, I probably reboot inside one
month for a reason or other 8-)

--
Do what you love because life is too short for anything else.

-

To: Linux kernel Mailing List <linux-kernel@...>
Date: Wednesday, September 5, 2007 - 6:48 pm

The more I look at the bug, the more I see that it is probably
not very easy to reproduce on demand. I have, however, a full
explanation on why it happens and why the patch should fix it,
so I will submit it for inclusion in -mm.

Sami, thank you for the detailed bug report.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is. Each group
calls the other unpatriotic.
-

To: Linux kernel Mailing List <linux-kernel@...>
Cc: Rik van Riel <riel@...>
Date: Friday, September 14, 2007 - 1:17 pm

Well, I now booted to x86_64 kernel.

I can still reproduce this.
When I unload ipset modules, kernel resumes "normal" operation, i.e.,
not swapping like mad.

sysrq-m, normal operation:

[172074.989053] SysRq : Show Memory
[172074.989063] Mem-info:
[172074.989071] DMA per-cpu:
[172074.989078] CPU 0: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
[172074.989083] CPU 1: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
[172074.989089] DMA32 per-cpu:
[172074.989094] CPU 0: Hot: hi: 186, btch: 31 usd: 153 Cold: hi: 62, btch: 15 usd: 10
[172074.989101] CPU 1: Hot: hi: 186, btch: 31 usd: 34 Cold: hi: 62, btch: 15 usd: 14
[172074.989109] Active:123048 inactive:55194 dirty:5 writeback:0 unstable:0
[172074.989111] free:15001 slab:27961 mapped:15063 pagetables:2996 bounce:0
[172074.989118] DMA free:5560kB min:32kB low:40kB high:48kB active:404kB inactive:736kB present:8620kB pages_scanned:0 all_unreclaimable? no
[172074.989124] lowmem_reserve[]: 0 968 968
[172074.989143] DMA32 free:54444kB min:3964kB low:4952kB high:5944kB active:491788kB inactive:220040kB present:991996kB pages_scanned:0 all_unreclaimable? no
[172074.989150] lowmem_reserve[]: 0 0 0
[172074.989166] DMA: 276*4kB 121*8kB 30*16kB 6*32kB 0*64kB 0*128kB 1*256kB 1*512kB 2*1024kB 0*2048kB 0*4096kB = 5560kB
[172074.989205] DMA32: 9467*4kB 1222*8kB 121*16kB 22*32kB 3*64kB 1*128kB 1*256kB 3*512kB 0*1024kB 1*2048kB 0*4096kB = 54444kB
[172074.989249] Swap cache: add 613353, delete 556659, find 441592/473681, race 0+5
[172074.989255] Free swap = 2751640kB
[172074.989260] Total swap = 3911784kB
[172074.989265] Free swap: 2751640kB
[172074.993693] 255744 pages of RAM
[172074.993699] 6060 reserved pages
[172074.993702] 79933 pages shared
[172074.993706] 56719 pages swap cached

then it goes bad:

[172373.542933] net/ipv4/netfilter/ip_set_nethash.c: retry: rehashing of set blockedp2pnew triggered: hashsize grows from 262144 to 288358
[172373.554837] net/...

To: Linux kernel Mailing List <linux-kernel@...>
Cc: Rik van Riel <riel@...>
Date: Friday, November 2, 2007 - 10:36 am

I now have 2GB of extra RAM, so 3GB in total, on x86_64 system.
If ipset tries to allocate 512 KB or more, kernel
goes into swapping frenzy, of which system does not recover
in over 30 minutes unless I press sysrq+k.

some kernel settings...:
vm.dirty_ratio = 4
vm.dirty_background_ratio = 2
vm.swappiness = 10
vm.vfs_cache_pressure = 10
vm.overcommit_memory = 2

SysRq : Show Blocked State
task PC stack pid father
kswapd0 D 0000000000000000 0 258 2
ffff8100be333ca0 0000000000000046 0000000000000000 0000000000000286
ffff8100be333c50 ffffffff00000000 ffff8100be560080 ffffffff807af3c0
0000000000000064 00000001285af6e6 00000000000000ff ffffffff802463f8
Call Trace:
[<ffffffff802463f8>] __mod_timer+0xb8/0xd0
[<ffffffff80657da3>] schedule_timeout+0x63/0xd0
[<ffffffff80246120>] process_timeout+0x0/0x10
[<ffffffff806576f8>] io_schedule_timeout+0x28/0x40
[<ffffffff8027f5fc>] congestion_wait+0x8c/0xb0
[<ffffffff802524e0>] autoremove_wake_function+0x0/0x40
[<ffffffff80279ad4>] throttle_vm_writeout+0x54/0xc0
[<ffffffff8027d6f3>] shrink_zone+0xe3/0x140
[<ffffffff8027dea0>] kswapd+0x510/0x5b0
[<ffffffff802524e0>] autoremove_wake_function+0x0/0x40
[<ffffffff8027d990>] kswapd+0x0/0x5b0
[<ffffffff802520dd>] kthread+0x4d/0x80
[<ffffffff8020cae8>] child_rip+0xa/0x12
[<ffffffff80252090>] kthread+0x0/0x80
[<ffffffff8020cade>] child_rip+0x0/0x12

irqbalance D 0000000000aa7f00 0 2110 1
ffff8100b8c4fcd8 0000000000000082 0000000000000000 0000000000000000
ffff8100a5e57f00 0000000800700006 ffff8100be5f91c0 ffff810060182140
ffff8100b8c4fc88 0000000000000282 ffff8100b8c4fcb8 ffffffff8040e1f6
Call Trace:
[<ffffffff8040e1f6>] __up_read+0x46/0xb0
[<ffffffff806579d8>] io_schedule+0x28/0x40
[<ffffffff802736e7>] sync_page+0x37/0x50
[<ffffffff80657e9e>] __wait_on_bit_lock+0x4e/0x80
[<ffffffff802736b0>] sync...

Previous thread: ethernet driver: hard_start_xmit by Harsha on Thursday, August 30, 2007 - 7:24 am. (1 message)

Next thread: Re: hda: set_drive_speed_status: status=0x51 { DriveReady SeekComplete Error } by John Sigler on Thursday, August 30, 2007 - 8:30 am. (6 messages)