Linux: Inode Cache Performance

Submitted by Jeremy
on December 27, 2004 - 1:33pm

An interesting dicussion on the lkml examined the efficiency of the inode cache in the 2.4 Linux kernel [forum], discussing several tunables primarily helpful to systems serving large NFS or Samba mounts. In particular, a slowdown was reported on such a system easily reproducible by doing a find / while cat'ing large files to /dev/null. In a discussion between 2.4 maintainer Marcelo Tosatti [interview], 2.6 maintainer Andrew Morton [interview] and VM maintainer Andrea Arcangeli [interview], it was decided that this was likely due to too small of an inode cache hash table resulting in a large number of collisions. For the work case in question, some tunables looked to prove helpful. Going forward, effort might be made in 2.6 or beyond to improve the inode cache.

The inode cache is an in-memory hash table used by the Virtual File System (VFS), defined in fs/inode.c. Each file on a Unix filesystem has a unique inode, a data structure (defined in include/linux/fs.h) containing information about the file such as who owns it, when it was last modified, and which device it resides on. This data is required to navigate a mounted filesystem, and to speed up this process once an inode has been read it is stored in the inode cache for future reference. As the hash table fills up, multiple inodes are increasingly likely to hash to the same bucket resulting in collisions. This increasing number of collisions will ultimately result in slower lookups, which is Andrea's explanation for the slowdown reported in the following thread.


From: James Pearson [email blocked]
To:  linux-kernel
Subject: Reducing inode cache usage on 2.4?
Date: 	Fri, 17 Dec 2004 17:26:20 +0000

I have an NFS server with 1Gb RAM running a 2.4.26 kernel with 2 XFS 
file systems with about 2 million files in total.

Occasionally I get reports that the server is 'sticky' (slow 
read/writes) and the inode cache appears to consume most of the 
available memory and doesn't appear to reduce - a typical /proc/slabinfo 
output is below.

If I run a simple application that grabs memory on the server, the inode 
and other caches are reduced and the server becomes more responsive 
(i.e. data rates to/from the server are restored to 'normal').

Is there anyway I can purge the cached inode data, or any kernel 
parameters I can tweak to limit the inode cache or flush it more frequently?

Or am I looking in completely the wrong place i.e. the inode cache is 
not the problem?

Thanks

James Pearson

/proc/slabinfo:

slabinfo - version: 1.1 (SMP)
kmem_cache           104    104    148    4    4    1 :  252  126
nfs_write_data         0      0    352    0    0    1 :  124   62
nfs_read_data          0      0    352    0    0    1 :  124   62
nfs_page               0      0     96    0    0    1 :  252  126
ip_fib_hash           10    226     32    2    2    1 :  252  126
clip_arp_cache         0      0    128    0    0    1 :  252  126
ip_mrt_cache           0      0     96    0    0    1 :  252  126
tcp_tw_bucket         40     40     96    1    1    1 :  252  126
tcp_bind_bucket      143    226     32    2    2    1 :  252  126
tcp_open_request      59     59     64    1    1    1 :  252  126
inet_peer_cache       55    236     64    4    4    1 :  252  126
ip_dst_cache         520    520    192   26   26    1 :  252  126
arp_cache             47    210    128    7    7    1 :  252  126
blkdev_requests     5120   5160     96  129  129    1 :  252  126
xfs_chashlist      35838  40560     20  240  240    1 :  252  126
xfs_ili             6664   8652    140  309  309    1 :  252  126
xfs_ifork              0      0     56    0    0    1 :  252  126
xfs_efi_item          15     15    260    1    1    1 :  124   62
xfs_efd_item          15     15    260    1    1    1 :  124   62
xfs_buf_item         130    130    148    5    5    1 :  252  126
xfs_dabuf            202    202     16    1    1    1 :  252  126
xfs_da_state           0      0    336    0    0    1 :  124   62
xfs_trans             81    143    596    9   11    2 :  124   62
xfs_inode         931428 931428    408 103492 103492    1 :  124   62
xfs_btree_cur         58     58    132    2    2    1 :  252  126
xfs_bmap_free_item    252    253     12    1    1    1 :  252  126
page_buf_t           200    200    192   10   10    1 :  252  126
linvfs_icache     931425 931425    352 84675 84675    1 :  124   62
dnotify_cache          0      0     20    0    0    1 :  252  126
file_lock_cache       80     80     96    2    2    1 :  252  126
fasync_cache           0      0     16    0    0    1 :  252  126
uid_cache              8    113     32    1    1    1 :  252  126
skbuff_head_cache    673    680    192   34   34    1 :  252  126
sock                  75     75   1216   25   25    1 :   60   30
sigqueue              58     58    132    2    2    1 :  252  126
kiobuf                 0      0     64    0    0    1 :  252  126
cdev_cache            11    177     64    3    3    1 :  252  126
bdev_cache             5    118     64    2    2    1 :  252  126
mnt_cache             19    177     64    3    3    1 :  252  126
inode_cache          217    217    512   31   31    1 :  124   62
dentry_cache      499222 518850    128 17295 17295    1 :  252  126
dquot                  0      0    128    0    0    1 :  252  126
filp                 486    600    128   20   20    1 :  252  126
names_cache            3      3   4096    3    3    1 :   60   30
buffer_head        31305  34400     96  860  860    1 :  252  126
mm_struct            120    120    160    5    5    1 :  252  126
vm_area_struct       861    880     96   22   22    1 :  252  126
fs_cache             177    177     64    3    3    1 :  252  126
files_cache           63     63    416    7    7    1 :  124   62
signal_act            72     72   1312   24   24    1 :   60   30
size-131072(DMA)       0      0 131072    0    0   32 :    0    0
size-131072            0      0 131072    0    0   32 :    0    0
size-65536(DMA)        0      0  65536    0    0   16 :    0    0
size-65536             0      0  65536    0    0   16 :    0    0
size-32768(DMA)        0      0  32768    0    0    8 :    0    0
size-32768            24     24  32768   24   24    8 :    0    0
size-16384(DMA)        0      0  16384    0    0    4 :    0    0
size-16384            16     18  16384   16   18    4 :    0    0
size-8192(DMA)         0      0   8192    0    0    2 :    0    0
size-8192              7      8   8192    7    8    2 :    0    0
size-4096(DMA)         0      0   4096    0    0    1 :   60   30
size-4096            385    385   4096  385  385    1 :   60   30
size-2048(DMA)         0      0   2048    0    0    1 :   60   30
size-2048           1952   1952   2048  976  976    1 :   60   30
size-1024(DMA)         0      0   1024    0    0    1 :  124   62
size-1024            476    476   1024  119  119    1 :  124   62
size-512(DMA)          0      0    512    0    0    1 :  124   62
size-512             344    344    512   43   43    1 :  124   62
size-256(DMA)          0      0    256    0    0    1 :  252  126
size-256             892   1335    256   89   89    1 :  252  126
size-128(DMA)          0      0    128    0    0    1 :  252  126
size-128            4087   8130    128  271  271    1 :  252  126
size-64(DMA)           0      0     64    0    0    1 :  252  126
size-64            65813  90683     64 1537 1537    1 :  252  126
size-32(DMA)           0      0     32    0    0    1 :  252  126
size-32           421038 421038     32 3726 3726    1 :  252  126

/proc/meminfo:

         total:    used:    free:  shared: buffers:  cached:
Mem:  1057779712 1034821632 22958080        0    36864 136249344
Swap: 2147459072  2015232 2145443840
MemTotal:      1032988 kB
MemFree:         22420 kB
MemShared:           0 kB
Buffers:            36 kB
Cached:         132032 kB
SwapCached:       1024 kB
Active:          29204 kB
Inactive:       113520 kB
HighTotal:      131072 kB
HighFree:         7864 kB
LowTotal:       901916 kB
LowFree:         14556 kB
SwapTotal:     2097128 kB
SwapFree:      2095160 kB


From: Marcelo Tosatti [email blocked] Subject: Re: Reducing inode cache usage on 2.4? Date: Fri, 17 Dec 2004 13:12:28 -0200 Hi James, On Fri, Dec 17, 2004 at 05:26:20PM +0000, James Pearson wrote: > I have an NFS server with 1Gb RAM running a 2.4.26 kernel with 2 XFS > file systems with about 2 million files in total. > > Occasionally I get reports that the server is 'sticky' (slow > read/writes) and the inode cache appears to consume most of the > available memory and doesn't appear to reduce - a typical /proc/slabinfo > output is below. > > If I run a simple application that grabs memory on the server, the inode > and other caches are reduced and the server becomes more responsive > (i.e. data rates to/from the server are restored to 'normal'). > > Is there anyway I can purge the cached inode data, or any kernel > parameters I can tweak to limit the inode cache or flush it more frequently? > > Or am I looking in completely the wrong place i.e. the inode cache is > not the problem? No, in your case the extreme inode/dcache sizes indeed seem to be a problem. The default kernel shrinking ratio can be tuned for enhanced reclaim efficiency. > xfs_inode 931428 931428 408 103492 103492 1 : 124 62 > dentry_cache 499222 518850 128 17295 17295 1 : 252 126 vm_vfs_scan_ratio: ------------------ is what proportion of the VFS queues we will scan in one go. A value of 6 for vm_vfs_scan_ratio implies that 1/6th of the unused-inode, dentry and dquot caches will be freed during a normal aging round. Big fileservers (NFS, SMB etc.) probably want to set this value to 3 or 2. The default value is 6. ============================================================= Tune /proc/sys/vm/vm_vfs_scan_ratio increasing the value to 10 and so on and examine the results.
From: James Pearson [email blocked] Subject: Re: Reducing inode cache usage on 2.4? Date: Sat, 18 Dec 2004 00:32:54 +0000 Marcelo Tosatti wrote: > >>Or am I looking in completely the wrong place i.e. the inode cache is >>not the problem? > > > No, in your case the extreme inode/dcache sizes indeed seem to be a problem. > > The default kernel shrinking ratio can be tuned for enhanced reclaim efficiency. > > >>xfs_inode 931428 931428 408 103492 103492 1 : 124 62 >>dentry_cache 499222 518850 128 17295 17295 1 : 252 126 > > > vm_vfs_scan_ratio: > ------------------ > is what proportion of the VFS queues we will scan in one go. > A value of 6 for vm_vfs_scan_ratio implies that 1/6th of the > unused-inode, dentry and dquot caches will be freed during a > normal aging round. > Big fileservers (NFS, SMB etc.) probably want to set this > value to 3 or 2. > > The default value is 6. > ============================================================= > > Tune /proc/sys/vm/vm_vfs_scan_ratio increasing the value to 10 and so on and > examine the results. Thanks for the info - but doesn't increasing the value of vm_vfs_scan_ratio mean that less of the caches will be freed? Doing a few tests (on another test file system with 2 million or so files and 1Gb of memory) running 'find $disk -type f', with vm_vfs_scan_ratio set to 6 (or 10), the first two column values for xfs_inode, linvfs_icache and dentry_cache in /proc/slabinfo reach about 900000 and stay around that value, but setting vm_vfs_scan_ratio to 1, then each value still reaches 900000, but then falls to a few thousand and increases up to 900000 and then drop away again and repeats. This still happens when I cat many large files (100Mb) to /dev/null at the same time as running the find i.e. the inode caches can still reach 90% of the memory before being reclaimed (with vm_vfs_scan_ratio set to 1). If I stop the find process when the inode caches reach about 90% of the memory, and then start cat'ing the large files, it appears the inode caches are never reclaimed (or longer than it takes to cat 100Gb of data to /dev/null) - is this expected behaviour? It seems the inode cache has priority over cached file data. What triggers the 'normal ageing round'? Is it possible to trigger this earlier (at a lower memory usage), or give a higher priority to cached data? Thanks James Pearson
From: Andrew Morton [email blocked] Subject: Re: Reducing inode cache usage on 2.4? Date: Fri, 17 Dec 2004 17:21:04 -0800 James Pearson [email blocked] wrote: > > It seems the inode cache has priority over cached file data. It does. If the machine is full of unmapped clean pagecache pages the kernel won't even try to reclaim inodes. This should help a bit: --- 24/mm/vmscan.c~a 2004-12-17 17:18:31.660254712 -0800 +++ 24-akpm/mm/vmscan.c 2004-12-17 17:18:41.821709936 -0800 @@ -659,13 +659,13 @@ int fastcall try_to_free_pages_zone(zone do { nr_pages = shrink_caches(classzone, gfp_mask, nr_pages, &failed_swapout); - if (nr_pages <= 0) - return 1; shrink_dcache_memory(vm_vfs_scan_ratio, gfp_mask); shrink_icache_memory(vm_vfs_scan_ratio, gfp_mask); #ifdef CONFIG_QUOTA shrink_dqcache_memory(vm_vfs_scan_ratio, gfp_mask); #endif + if (nr_pages <= 0) + return 1; if (!failed_swapout) failed_swapout = !swap_out(classzone); } while (--tries); _ > What triggers the 'normal ageing round'? Is it possible to trigger this > earlier (at a lower memory usage), or give a higher priority to cached data? You could also try lowering /proc/sys/vm/vm_mapped_ratio. That will cause inodes to be reaped more easily, but will also cause more swapout.
From: Marcelo Tosatti [email blocked] Subject: Re: Reducing inode cache usage on 2.4? Date: Sat, 18 Dec 2004 09:02:47 -0200 James, Can apply Andrew's patch and examine the results? I've merged it to mainline because it looks sensible. Thanks Andrew!
From: James Pearson [email blocked] Subject: Re: Reducing inode cache usage on 2.4? Date: Mon, 20 Dec 2004 13:47:46 +0000 I've tested the patch on my test setup - running a 'find $disk -type f' and a cat of large files to /dev/null at the same time does indeed reduce the size of the inode and dentry caches considerably - the first column numbers for fs_inode, linvfs_icache and dentry_cache in /proc/slabinfo hover at about 400-600 (over 900000 previously). However, is this going a bit to far the other way? When I boot the machine with 4Gb RAM, the inode and dentry caches are squeezed to the same amounts, but it may be the case that it would be more beneficial to have more in the inode and dentry caches? i.e. I guess some sort of tunable factor that limits the minimum size of the inode and dentry caches in this case? But saying that, I notice my 'find $disk -type f' (with about 2 million files) runs a lot faster with the smaller inode/dentry caches - about 1 or 2 minutes with the patched kernel compared with about 5 to 7 minutes with the unpatched kernel - I guess it was taking longer to search the inode/dentry cache than reading direct from disk. James Pearson
From: Marcelo Tosatti [email blocked] Subject: Re: Reducing inode cache usage on 2.4? Date: Mon, 20 Dec 2004 10:46:04 -0200 On Mon, Dec 20, 2004 at 01:47:46PM +0000, James Pearson wrote: > I've tested the patch on my test setup - running a 'find $disk -type f' > and a cat of large files to /dev/null at the same time does indeed > reduce the size of the inode and dentry caches considerably - the first > column numbers for fs_inode, linvfs_icache and dentry_cache in > /proc/slabinfo hover at about 400-600 (over 900000 previously). > > However, is this going a bit to far the other way? When I boot the > machine with 4Gb RAM, the inode and dentry caches are squeezed to the > same amounts, but it may be the case that it would be more beneficial to > have more in the inode and dentry caches? i.e. I guess some sort of > tunable factor that limits the minimum size of the inode and dentry > caches in this case? One can increase vm_vfs_scan_ratio if required, but hopefully this change will benefit all workloads. Andrew, Andrea, do you think of any workloads which might be hurt by this change? > But saying that, I notice my 'find $disk -type f' (with about 2 million > files) runs a lot faster with the smaller inode/dentry caches - about 1 > or 2 minutes with the patched kernel compared with about 5 to 7 minutes > with the unpatched kernel - I guess it was taking longer to search the > inode/dentry cache than reading direct from disk. Wonderful.
From: Andrea Arcangeli [email blocked] Subject: Re: Reducing inode cache usage on 2.4? Date: Mon, 20 Dec 2004 16:10:45 +0100 On Mon, Dec 20, 2004 at 10:46:04AM -0200, Marcelo Tosatti wrote: > On Mon, Dec 20, 2004 at 01:47:46PM +0000, James Pearson wrote: > > I've tested the patch on my test setup - running a 'find $disk -type f' > > and a cat of large files to /dev/null at the same time does indeed > > reduce the size of the inode and dentry caches considerably - the first > > column numbers for fs_inode, linvfs_icache and dentry_cache in > > /proc/slabinfo hover at about 400-600 (over 900000 previously). > > > > However, is this going a bit to far the other way? When I boot the > > machine with 4Gb RAM, the inode and dentry caches are squeezed to the > > same amounts, but it may be the case that it would be more beneficial to > > have more in the inode and dentry caches? i.e. I guess some sort of > > tunable factor that limits the minimum size of the inode and dentry > > caches in this case? > > One can increase vm_vfs_scan_ratio if required, but hopefully this change > will benefit all workloads. > > Andrew, Andrea, do you think of any workloads which might be hurt by this change? I wouldn't touch the defaults, but the sysctl is there so if you've a strange workload you can tune for it. There's nothing wrong with dcache/icache growing a lot. A cat of a large file is polluting the cache, so that's not a workload that should shrink the dcache/icache. I'd prefer a feedback based on a real useful workload before even considering touching the defaults at this time.
From: Marcelo Tosatti [email blocked] Subject: Re: Reducing inode cache usage on 2.4? Date: Mon, 20 Dec 2004 13:06:34 -0200 On Mon, Dec 20, 2004 at 04:10:45PM +0100, Andrea Arcangeli wrote: > > I wouldn't touch the defaults, but the sysctl is there so if you've a > strange workload you can tune for it. > > There's nothing wrong with dcache/icache growing a lot. The thing is right now we dont try to reclaim from icache/dcache _at all_ if enough clean pagecache pages are found and reclaimed. Its sounds unfair to me. > A cat of a large file is polluting the cache, so that's not a workload that should shrink > the dcache/icache. Why not? If we have a lot of them they will probably be hurting performace, which seems to be the case now. > I'd prefer a feedback based on a real useful workload > before even considering touching the defaults at this time. Following this logic any workload which generates pagecache and happen to, most times, have enough pagecache clean to be reclaimed should not reclaim the i/dcache's. Which is not right. But yes, feedback based on other workloads is required. I'm hoping people do test the next 2.4.29-pre3 and send feedback. So I'll probably revert the patch if any considerable regression is found.
From: Andrea Arcangeli [email blocked] Subject: Re: Reducing inode cache usage on 2.4? Date: Mon, 20 Dec 2004 18:54:09 +0100 On Mon, Dec 20, 2004 at 01:06:34PM -0200, Marcelo Tosatti wrote: > The thing is right now we dont try to reclaim from icache/dcache _at all_ > if enough clean pagecache pages are found and reclaimed. > > Its sounds unfair to me. If most ram is in pagecache there's not much point to shrink the dcache. The more ram goes into dcache/icache, the less ram will be in pagecache, and the more likely we'll start shrinking dcache/icache. Also keep in mind in a highmem machine the pagecache will be in highmemory and the dcache/icache in lowmemory (on very very big boxes the lowmem_reserve algorithm pratically splits the two in non-overkapping zones), so especially on a big highmem machine shrinking dcache/icache during a pagecache allocation (because this is what the workload is doing: only pagecache allocations) is a worthless effort. This is the best solution we have right now, but there have been several discussions in the past on how to shrink dcache/icache. But if we want to talk on how to change this, we should talk about 2.6/2.7 only IMHO. > Why not? If we have a lot of them they will probably be hurting performace, which seems > to be the case now. The slowdown could be because the icache/dcache hash size is too small. It signals collisions in the dcache/icache hashtable. 2.6 with bootmem allocated hashes should be better. Optimizing 2.4 for performance if not worth the risk IMHO. I would suggest to check if you can reproduce in 2.6, and fix it there, if it's still there. > Following this logic any workload which generates pagecache and happen > to, most times, have enough pagecache clean to be reclaimed should not > reclaim the i/dcache's. Which is not right. This mostly happens for cache-polluting-workloads like in this testcase. If the cache would be activated, there would be less pages in the inactive list and you had a better chance to invoke the dcache/icache shrinking.
From: Marcelo Tosatti [email blocked] Subject: Re: Reducing inode cache usage on 2.4? Date: Mon, 20 Dec 2004 13:43:34 -0200 On Mon, Dec 20, 2004 at 06:54:09PM +0100, Andrea Arcangeli wrote: > On Mon, Dec 20, 2004 at 01:06:34PM -0200, Marcelo Tosatti wrote: > > The thing is right now we dont try to reclaim from icache/dcache _at all_ > > if enough clean pagecache pages are found and reclaimed. > > > > Its sounds unfair to me. > > If most ram is in pagecache there's not much point to shrink the dcache. > The more ram goes into dcache/icache, the less ram will be in pagecache, > and the more likely we'll start shrinking dcache/icache. Also keep in > mind in a highmem machine the pagecache will be in highmemory and the > dcache/icache in lowmemory (on very very big boxes the lowmem_reserve > algorithm pratically splits the two in non-overkapping zones), so > especially on a big highmem machine shrinking dcache/icache during a > pagecache allocation (because this is what the workload is doing: only > pagecache allocations) is a worthless effort. > > This is the best solution we have right now, but there have been several > discussions in the past on how to shrink dcache/icache. But if we want > to talk on how to change this, we should talk about 2.6/2.7 only IMHO. > > > Why not? If we have a lot of them they will probably be hurting performace, which seems > > to be the case now. > > The slowdown could be because the icache/dcache hash size is too small. > It signals collisions in the dcache/icache hashtable. 2.6 with bootmem > allocated hashes should be better. Optimizing 2.4 for performance if not > worth the risk IMHO. I would suggest to check if you can reproduce in > 2.6, and fix it there, if it's still there. > > > Following this logic any workload which generates pagecache and happen > > to, most times, have enough pagecache clean to be reclaimed should not > > reclaim the i/dcache's. Which is not right. > > This mostly happens for cache-polluting-workloads like in this testcase. > If the cache would be activated, there would be less pages in the > inactive list and you had a better chance to invoke the dcache/icache > shrinking. OK I buy your arguments I'll revert Andrew's patch.
From: Andrea Arcangeli [email blocked] Subject: Re: Reducing inode cache usage on 2.4? Date: Mon, 20 Dec 2004 20:20:46 +0100 On Fri, Dec 17, 2004 at 05:21:04PM -0800, Andrew Morton wrote: > James Pearson <james-p@moving-picture.com> wrote: > > > > It seems the inode cache has priority over cached file data. > > It does. If the machine is full of unmapped clean pagecache pages the > kernel won't even try to reclaim inodes. This should help a bit: > > --- 24/mm/vmscan.c~a 2004-12-17 17:18:31.660254712 -0800 > +++ 24-akpm/mm/vmscan.c 2004-12-17 17:18:41.821709936 -0800 > @@ -659,13 +659,13 @@ int fastcall try_to_free_pages_zone(zone > > do { > nr_pages = shrink_caches(classzone, gfp_mask, nr_pages, &failed_swapout); > - if (nr_pages <= 0) > - return 1; > shrink_dcache_memory(vm_vfs_scan_ratio, gfp_mask); > shrink_icache_memory(vm_vfs_scan_ratio, gfp_mask); > #ifdef CONFIG_QUOTA > shrink_dqcache_memory(vm_vfs_scan_ratio, gfp_mask); > #endif > + if (nr_pages <= 0) > + return 1; > if (!failed_swapout) > failed_swapout = !swap_out(classzone); > } while (--tries); I'm worried this is too aggressive by default and it may hurt stuff. The real bug is that we don't do anything when too many collisions happens in the hashtables. That is the thing to work on. We should free colliding entries in the background after a 'touch' timeout. That should work pretty well to age the dcache proprerly too. But the above will just shrink everything all the time and it's going to break stuff. For 2.6 we can talk about the background shrink based on timeout. My only suggestion for 2.4 is to try with vm_cache_scan_ratio = 20 or higher (or alternatively vm_mapped_ratio = 50 or = 20). There's a reason why everything is tunable by sysctl. I don't think the vm_lru_balance_ratio is the one he's interested about. vm_lru_balance_ratio controls how much work is being done at every dcache/icache shrinking. His real objective is to invoke the dcache/icache shrinking more frequently, how much work is being done at each pass is a secondary issue. If we don't invoke it, nothing will be shrunk, no matter what is the value of vm_lru_balance_ratio. Hope this helps funding an optimal tuning for the workload.
From: James Pearson [email blocked] Subject: Re: Reducing inode cache usage on 2.4? Date: Tue, 21 Dec 2004 11:33:24 +0000 Andrea Arcangeli wrote: > > My only suggestion for 2.4 is to try with vm_cache_scan_ratio = 20 or > higher (or alternatively vm_mapped_ratio = 50 or = 20). There's a > reason why everything is tunable by sysctl. > > I don't think the vm_lru_balance_ratio is the one he's interested > about. vm_lru_balance_ratio controls how much work is being done at > every dcache/icache shrinking. > > His real objective is to invoke the dcache/icache shrinking more > frequently, how much work is being done at each pass is a secondary > issue. If we don't invoke it, nothing will be shrunk, no matter what is > the value of vm_lru_balance_ratio. > > Hope this helps funding an optimal tuning for the workload. Setting vm_mapped_ratio to 20 seems to give a 'better' memory usage using my very contrived test - running a find will result in about 900Mb of dcache/icache, but then running a cat to /dev/null will shrink the dcache/icache down to between 100-300Mb - running the find and cat at the same time results in about the same dcache/icache usage. I'll give this a go on the production NFS server and I'll see if it improves things. Thanks James Pearson
From: Andrea Arcangeli [email blocked] Subject: Re: Reducing inode cache usage on 2.4? Date: Tue, 21 Dec 2004 14:22:55 +0100 On Tue, Dec 21, 2004 at 11:33:24AM +0000, James Pearson wrote: > Setting vm_mapped_ratio to 20 seems to give a 'better' memory usage > using my very contrived test - running a find will result in about 900Mb > of dcache/icache, but then running a cat to /dev/null will shrink the > dcache/icache down to between 100-300Mb - running the find and cat at > the same time results in about the same dcache/icache usage. > > I'll give this a go on the production NFS server and I'll see if it > improves things. Ok great. If 20 isn't enough just set it to 40, just be careful that if you set it too high the system may swap a bit too early. Overall this is still a workaround, real fix would be a background scanning of the icache/dcache collisions in the hash buckets but that's not for 2.4 ;).
From: James Pearson [email blocked] Subject: Re: Reducing inode cache usage on 2.4? Date: Tue, 21 Dec 2004 13:59:06 +0000 Andrea Arcangeli wrote: > On Tue, Dec 21, 2004 at 11:33:24AM +0000, James Pearson wrote: > >>Setting vm_mapped_ratio to 20 seems to give a 'better' memory usage >>using my very contrived test - running a find will result in about 900Mb >>of dcache/icache, but then running a cat to /dev/null will shrink the >>dcache/icache down to between 100-300Mb - running the find and cat at >>the same time results in about the same dcache/icache usage. >> >>I'll give this a go on the production NFS server and I'll see if it >>improves things. > > > Ok great. If 20 isn't enough just set it to 40, just be careful that if > you set it too high the system may swap a bit too early. I've changed the value of vm_mapped_ratio to 20 - which has a default value of 100 - I guess you're talking about vm_cache_scan_ratio? I've tried changing just vm_cache_scan_ratio to 20, but it doesn't seem to make any difference - I though a higher vm_cache_scan_ratio value meant less is scanned? James Pearson
From: Andrea Arcangeli [email blocked] Subject: Re: Reducing inode cache usage on 2.4? Date: Tue, 21 Dec 2004 15:39:54 +0100 On Tue, Dec 21, 2004 at 01:59:06PM +0000, James Pearson wrote: > I've changed the value of vm_mapped_ratio to 20 - which has a default > value of 100 - I guess you're talking about vm_cache_scan_ratio? yes, I was talking about vm_cache_scan_ratio, you can combine the two sysctl together just fine. > I've tried changing just vm_cache_scan_ratio to 20, but it doesn't seem > to make any difference - I though a higher vm_cache_scan_ratio value > meant less is scanned? The less pages are scanned, the more likely you won't free enough pagecache, the more likely you'll shrink dcache/icache. I see why vm_mapped_ratio makes most of the difference though and probably it's the easier fix for your problem (though increasing vm_cache_scan_ratio sure won't make things worse).
From: Marcelo Tosatti [email blocked] Subject: Re: Reducing inode cache usage on 2.4? Date: Sat, 18 Dec 2004 13:02:26 -0200 On Sat, Dec 18, 2004 at 12:32:54AM +0000, James Pearson wrote: > Marcelo Tosatti wrote: > > > >>Or am I looking in completely the wrong place i.e. the inode cache is > >>not the problem? > > > > > >No, in your case the extreme inode/dcache sizes indeed seem to be a > >problem. > >The default kernel shrinking ratio can be tuned for enhanced reclaim > >efficiency. > > > > > >>xfs_inode 931428 931428 408 103492 103492 1 : 124 62 > >>dentry_cache 499222 518850 128 17295 17295 1 : 252 126 > > > > > >vm_vfs_scan_ratio: > >------------------ > >is what proportion of the VFS queues we will scan in one go. > >A value of 6 for vm_vfs_scan_ratio implies that 1/6th of the > >unused-inode, dentry and dquot caches will be freed during a > >normal aging round. > >Big fileservers (NFS, SMB etc.) probably want to set this > >value to 3 or 2. > > > >The default value is 6. > >============================================================= > > > >Tune /proc/sys/vm/vm_vfs_scan_ratio increasing the value to 10 and so on > >and examine the results. > > Thanks for the info - but doesn't increasing the value of > vm_vfs_scan_ratio mean that less of the caches will be freed? Right - what I said was wrong - its the other way around: Decreasing the value increases the percentage of VFS caches scanned at each "aging pass". Now Andrew's changed the ageing round pass. Quoting him "If the machine is full of unmapped clean pagecache pages the kernel won't even try to reclaim inodes". vm_vfs_scan_ratio now is more meaningful. kswapd is awaken as soon as a zone's low watermark is reached, and will work to free pages until it reaches the zone's high watermark. There are three zones: DMA (1) , Normal (2) and Highmem (3). * On machines where it is needed (eg PCs) we divide physical memory * into multiple physical zones. On a PC we have 3 zones: * * ZONE_DMA < 16 MB ISA DMA capable memory * ZONE_NORMAL 16-896 MB direct mapped by the kernel * ZONE_HIGHMEM > 896 MB only page cache and user processes So these thresolds are used to calculate each zone's min, low and high watermarks using the following calculation (mm/page_alloc.c): mask = (realsize / zone_balance_ratio[j]); if (mask < zone_balance_min[j]) mask = zone_balance_min[j]; else if (mask > zone_balance_max[j]) mask = zone_balance_max[j]; zone->watermarks[j].min = mask; zone->watermarks[j].low = mask*2; zone->watermarks[j].high = mask*3; To trigger the normal aging round earlier the "low" watermark has to be increased, but you better increase the "high" watermark which makes kswapd work up longer until such high free page watermark is reached, one can try for example zone->watermarks[j].high = mask*4 But hopefully you wont need such modification (it would be nice if they were all boot configurable BTW) with Andrew's change.

Related Links:
AttachmentSize
inode.c34.82 KB
fs.h59.17 KB

tree instead of hashmap

Mark (not verified)
on
December 29, 2004 - 2:59am

Would using a tree instead of a hash map improve this situation? I've always wondered why people use hash maps instead of trees when they know lots of data goes in.

Mark.

O()

Anonymous (not verified)
on
December 29, 2004 - 7:32am

Well-balanced Trees take O(log n) to access an element and, IIRC, O(n log n) to insert one (due to necessary rebalancing), while non-crowded hash tables are O(1) for each of these operations.

So, hash tables are almost always faster than trees, especially for a large number of keys, given that (1) you have a good hash function and (2) you know beforehand how many keys you will insert at maximum so you can size the table correctly. If one or both of these conditions are not met, however, hash table performance can degenerate quickly. As with everything, it's a tradeoff: Hash tables are faster, but might deteriorate in corner cases if you're not careful.

ToDo: Timestamp-&-accessescounts--based LRU remover.

Anonymous (not verified)
on
December 31, 2004 - 9:16pm

"find /mnt/huge_library/ -iname '*name-to-search*'" removes the most used inodes's keys :(

So, we need a fairly design about administration of dirty-ies to minimize this problem.

By example, to use RDTSC (from i586 to up) for TimeStamping,
and this pseudo-algorithm

unsigned long x = rdtsc();
...
if (is_short_time(x,inode_cache->timestamp)) {
inode_cache->accessescounts++;
inode_cache->timestamp = x;
...
}
else {
add_new_inode_cache(x,...);
...
}

If there is not more memory for hashtables then call to wakeup_LRU_remover(...).

open4free ©

Inserting an element into a b

Anonymous (not verified)
on
January 11, 2005 - 8:29am

Inserting an element into a balanced trees such as a red-black tree or a 2-3-4 trees takes O(log n), not O(n log n). Skiplists have similar performance but other drawbacks. The speed of hashtables is theoretically O(1), but that is rarely the case in the real world. This is mostly due to bad hash functions. Memory locations are usually very bad hash values, for instance.

And in the case of caches, it is usually pretty hard to know how much the cache will grow, so you may have to use a hash table which can grow. Size change takes O(n) time, and usually the size is changed by a constatnt factor. If the hash table changes size often you will have o(log n) (average) and O(n) (worst case) IF the hash table truly has O(1) performance, otherwise performance will be even worse.

Trees are much slower then go

Anonymous (not verified)
on
December 29, 2004 - 6:13pm

Trees are much slower then good hash implementation. Good hash implementation takes just 2-3 memory references. And if they hit L1 cache, then hash lookup takes <10 processor cycles, which is impossible for trees.

When choosing dictionary implementation for some kind of cache it is almost always better to choose hash, 'cause you may remove elements from it when required.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.