Nick Piggin <nickpiggin@yahoo.com.au> writes:Well what brought this up for me was old user space code using an initial ramdisk. The actual failure that I saw occurred on the read path. And fixing init_page_buffers was the real world fix. At the moment I'm messing with it because it has become the itch I've decided to scratch. So at the moment I'm having fun, learning the block layer, refreshing my VM knowledge and getting my head around this wreck that we call buffer_heads. The high level concept of buffer_heads may be sane but the implementation seems to export a lot of nasty state. At this point my concern is what makes a clean code change in the kernel. Because user space can currently play with buffer_heads by way of the block device and cause lots of havoc (see the recent resierfs bug in this thread) that is why I increasingly think metadata buffer_heads should not share storage with the block device page cache. If that change is made then it happens that the current ramdisk would not need to worry about buffer heads and all of that nastiness and could just lock pages in the page cache. It would not be quite as good for testing filesystems but retaining the existing characteristics would be simple. After having looked a bit deeper the buffer_heads and the block devices don't look as intricately tied up as I had first thought. We still have the nasty case of: if (buffer_new(bh)) unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr); That I don't know how it got merged. But otherwise the caches are fully separate. So currently it looks to me like there are two big things that will clean up that part of the code a lot: - moving the metadata buffer_heads to a magic filesystem inode. - Using a simpler non-buffer_head returning version of get_block so we can make simple generic code for generating BIOs. As a meta_data cache manager perhaps, for a translation cache we need 8 bytes per page max. However all we need for a generic translation cache (assuming we still want one) is an array of sector_t per page. So what we would want is: int blkbits_per_page = PAGE_CACHE_SHIFT - inode->i_blkbits; if (blkbits_per_page <= 0) blkbits_per_page = 0; sector_t *blocks = kmalloc(sizeof(sector_t) << blkbits_per_page); And to remember if we have stored the translation: #define UNMAPPED_SECTOR (-1(sector_t)) ... The core of all of this being something like: #define MAX_BLOCKS_PER_PAGE (1 << (PAGE_CACHE_SHIFT - 9)) typedef int (page_blocks_t)(struct page *page, sector_t blocks[MAX_BLOCKS_PER_PAGE], int create); Which I can agree with. By definition! Eric -
| Alexandre Oliva | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Eric W. Biederman | Re: [net-2.6.24][patch 2/2] Dynamically allocate the loopback device |
| Ingo Molnar | Re: containers (was Re: -mm merge plans for 2.6.23) |
| Greg KH | [GIT PATCH] driver core patches against 2.6.24 |
git: | |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| David Miller | [GIT]: Networking |
| Michael Riepe | Re: 2.6.27.19 + 28.7: network timeouts for r8169 and 8139too |
