I'd like to get a first round of review on my AXFS filesystem. This is a simple
read only compressed filesystem like Squashfs and cramfs. AXFS is special
because it also allows for execute-in-place of your applications. It is a major
improvement over the cramfs XIP patches that have been floating around for ages.
The biggest improvement is in the way AXFS allows for each page to be XIP or
not. First, a user collects information about which pages are accessed on a
compressed image for each mmap()ed region from /proc/axfs/volume0. That
'profile' is used as an input to the image builder. The resulting image has
only the relevant pages uncompressed and XIP. The result is smaller memory
sizes and faster launches.
See http://axfs.sourceforge.net for more info.
fs/Kconfig | 21 +
fs/Makefile | 1
fs/axfs/Makefile | 7
fs/axfs/axfs_bdev.c | 158 ++++++++
fs/axfs/axfs_inode.c | 490 ++++++++++++++++++++++++++
fs/axfs/axfs_mtd.c | 233 ++++++++++++
fs/axfs/axfs_profiling.c | 594 +++++++++++++++++++++++++++++++
fs/axfs/axfs_super.c | 866 ++++++++++++++++++++++++++++++++++++++++++++++
fs/axfs/axfs_uml.c | 47 ++
fs/axfs/axfs_uncompress.c | 97 +++++
include/linux/axfs.h | 358 +++++++++++++++++++
11 files changed, 2872 insertions(+)
--
Jared, nice work! I've also read your paper from the linux symposium (http://ols.fedoraproject.org/OLS/Reprints-2008/hulbert-reprint.pdf) A few questions: - how does this benchmark compared to cramfs and squashfs in a NAND-only system (or is it just not a good plan to use this with NAND-only (of course I won't get XIP with NAND, I understand that) - would axfs be suitable as a filesystem on a ram disk? Background for the last question is that if you do not have the memory to retain all pages uncompressed (as you would with ramfs), this could be a nice intermediate format. Furthermore compared to ramfs, a filesystem on a ramdisk does not need the initialisation during startup (decompressing the cpio file, creating the files, copying the data), so when it comes to boot times a filesystem on a ramdisk (e.g. axfs) could be a better choice. Appreciate your feedback. Frans. --
> Jared, nice work! I don't know, I'm interested to find out. I just benchmarked that. Actually it should work very well as a NAND-only fs. Also you do get something like XIP with NAND. If you boot an XIP AXFS image on NAND or a blkdev it will copy that XIP region into RAM and "XIP" it from there. I think this will make it very good for LiveCD's. Though we just (minutes ago) realized our testing of that feature was flawed, so It could be. I plan on implementing support for brd. That might work nicely. --
I was going to take a look at this too. With any luck, it should be little effort required as it looks like you have the block device support in place? This filesystem actually should in theory work fairly well with brd, because then we wouldn't have to bring the data over into pagecache for frequently used pages but we can retain the compressed storage for the infrequently used stuff. I say in theory because I don't know of any serious users (except kernel testing) of brd :) --
FWIW, I'm not sure it's a good idea to name this new filesystem AXFS. People are almost certainly going to confuse it with XFS despite the filesystems being aimed at diammetrically opposed ends of the storage spectrum. ;) Cheers, Dave. -- Dave Chinner david@fromorbit.com --
In principle I think you are right. AXFS and XFS are similar names and it could lead to confusion. I think XFS should change its name to prevent confusion. I think by 2 years AXFS will be used in orders of magnitude more machines anyway. ;) About opposite end of the spectrum... Carsten just said AXFS might be nice for s390, so I'm not sure how true that is. I'm kind of attached to the name now. --
Hello Jared, People that care about their filesystem choice know their choices. People that don't care, well they don't care. Maybe AXIPFS would be the close alternative. One question on the use-case profiling and subsequent image rebuild: What if the use-case did not cover all cases of XIP use? If a compressed page is attempted to be executed, will the filesystem fall back to decompression to RAM and execution from RAM, or will this result in a faulty system? The design choices look real good. Congrats on the achievement. Regards, -- Leon --
No this will not result in a faulty system. It is perfectly acceptable to have all pages in a file XIP, no pages in a fill XIP, thanks! --
You probably want to read the paper at http://ols.fedoraproject.org/OLS/Reprints-2008/hulbert-reprint.pdf BTW, I regret now not having attended the OLS presentation, because there was so much emphasis on `XIP' in the description :-) Fortunately it's been recorded: http://free-electrons.com/community/videos/conferences/ so I'm gonna watch it right now... With kind regards, Geert Uytterhoeven Software Architect Sony Techsoft Centre Europe The Corporate Village
I like the general approach of it. It's much more flexible than the ext2 extension I've done, and the possibility to select XIP vs. compression per page is really really neat. I can imagine that people will prefer this over the ext2 implementation on s390. It is unclear to me how the "secondary block device" thing is supposed to work. Could you elaborate a bit on that? --
Agreed. I haven't had a good look through it yet, but at a glance it looks pretty neat. The VM side of things looks pretty reasonable (I fear XIP faulting might have another race or two, but that's a core mm issue rather than filesystem specific). --
Yes, I also like the file system, I guess this is 2.6.28 material and you should have it added to linux-next when you have addressed the comments so far. One thing that would be really nice is if you could add fake-write support in the way that I proposed for cramfs a few months ago. This would make axfs much more interesting for another set of users, and keep cramfs a really simple example file system. Arnd <>< --
No, there were a few remaining issues that I never found the time to work on. Arnd <>< --
How might I design a test to flush those bugs out? We haven't seen any. --
Not quite sure yet. I just fixed a couple of easy ones, but there could be some more lurking. Don't be too worried about it yet, I was just musing to myself there really :) --
First off we don't yet support direct_access(), but I am planning on that soon. Sure. For a system that has say a NOR Flash and a NAND or a embedded MMC, one can split a filesystem image such that only the XIP parts of the image are on the NOR while the compressed bits are on the NAND / eMMC. The NOR part is accessed as directly addressable memory, while the NAND would use mtd->read() and the eMMC would use block device access API's. In this case I would call this NAND or eMMC the "secondary device" because the primary device is the NOR. Assuming my NOR was at /dev//mtd2 and my NAND at /dev/mtd5. I would call the following to mount such a system: mount -t axfs -o second_dev=/dev/mtd5 /dev/mtd2 /mnt/axfs --
Sounds great, really nice idea. How does it fare with no MMU? Can the profiler and image builder lay out the XIP pages in such a way that no-MMU mmaps can map those regions? No complaint if not, it would be a nice bonus though. -- Jamie --
Sorry. I don't believe it will work on no-MMU as is. That said you _could_ tweak the mkfs tool to lay mmap()'ed regions down contiguously but then if you mmap() an unprofiled region, well that would be bad. I suppose you could make axfs_mmap smart enough to handle that. I guess the cleanest way would be to just make files lay down contiguously, you lose some of the space saving but it would work. I'm not plannin to get to this anytime soon. But I'd be willing merge patches. Can anybody convince me offline that working on no-MMU this makes financial sense for my employer? This is getting to be a common question. How many noMMU users are out there and why are you so interested? --
Hi Jared, That would be enough I think. If you could manually select which files are contiguous-and-uncompressed that would be One of those unknown factors, how many are there? Who knows, pretty much impossible to tell. One thing for sure is that many people who do non-MMU setups are interested in XIP to get the space savings. These are very often small devices with very constrained RAM and flash. (For whatever it is worth single NOR flash only boards are common in these smaller form factors :-) Regards Greg ------------------------------------------------------------------------ Greg Ungerer -- Chief Software Dude EMAIL: gerg@snapgear.com Secure Computing Corporation PHONE: +61 7 3435 2888 825 Stanley St, FAX: +61 7 3891 3630 Woolloongabba, QLD, 4102, Australia WEB: http://www.SnapGear.com --
So.... If you don't have an MMU when do you call ->fault? Does the True. --
Hi Jared, Sort of. It actually just uses a single ->read to bring in the entire file contents. There is a few limitations on the use of mmap() for non-mmu. Documentation/nommu-mmap.txt gives more details. With no MMU it does rely on being able to kmalloc() Regards Greg ------------------------------------------------------------------------ Greg Ungerer -- Chief Software Dude EMAIL: gerg@snapgear.com SnapGear -- a Secure Computing Company PHONE: +61 7 3435 2888 825 Stanley St, FAX: +61 7 3891 3630 Woolloongabba, QLD, 4102, Australia WEB: http://www.SnapGear.com --
That's unfortunate, if you're using FDPIC-ELF or BFLT-XIP, you really want to kmalloc() one region for code (i.e. mmap not the whole file), and one separate for data. Asking for a single larger region sometimes creates much higher memory pressure while kmalloc() attempts to defragment by evicting everything. But that's fiddly to do right in general. The natural thing for AXFS to do to support no-MMU FDPIC-ELF or BFLT-XIP is store the code segment uncompressed and contiguous, and the data segment however the filesystem prefers, and the profiling information to work out where these are is readily available from the mmap() calls, which are always the same when an executable is run. -- Jamie --
Hi Jamie, That is what the BFLT loader does. For the XIP case it mmap()s the text directly from the file, and then mmap()s a second region for the data/bss (reading the data into that region). I was referring to general mmap() of a file case above, not Yep. Regards Greg ------------------------------------------------------------------------ Greg Ungerer -- Chief Software Dude EMAIL: gerg@snapgear.com SnapGear -- a Secure Computing Company PHONE: +61 7 3435 2888 825 Stanley St, FAX: +61 7 3891 3630 Woolloongabba, QLD, 4102, Australia WEB: http://www.SnapGear.com --
I'm using XIP on a device with 32MB RAM. The reason I use it is _partly_ to save RAM, partly because programs start about 10 times faster (reading NOR flash is slow and I keep the XIP region in RAM) and partly because it reduces memory fragmentation. -- Jamie --
What kind of NOR you using? That is not what I measure with fast synchronous burst NOR's. --
I think the "fast" in "fast synchronous" gives it away :-) I'm using Spansion MirrorBit S29GL128N, which reads at about 0.6 MByte/s. Not because they're good, but because that's what the board I'm coding for has on it. I presume they were cheap and familiar to the board designers. (There is 32MB of RAM to play with after all.) So start a sequence of Busybox processes from a shell script is noticable, if it reads from NOR each time. Oh, and it's a 166MHz ARM, so it's quite capable of decompressing faster than the NOR can deliver. -- Jamie --
By the way, what speeds do you get on fast synchronous burst NORs - and which chips are those? Thanks, -- Jamie --
I am only familiar with the Numonyx product line up. If you are using a GL, you'll probably find our P33 a good fit and at competitive prices to GL as I understand it. That's I think 50Mhz. M18 is 100Mhz maybe a little higher. And we just announced our LPDDR 266 part, Velocity LP. A good way of making a rough estimate of read performance is to measure a cache miss latency and convert that to bandwidth. It's usually fairly close. 32Byte cache size / 16 bit bus = 16 word transfers memory controller latency (time from Load instruction to bus activity) = ~300ns (upto 450ns for some processors) initial latency (time to read first word) = ~100ns (60ns - 120ns) clock frequency (time between words) = 50Mhz = 20ns per word bus clean up = ~50ns 32Bytes = 300ns + 100ns + (16 - 1) * 20ns + 50ns = 750ns = 32B/750ns = 40MB/s This is a very simple model and reality is much more nuanced. You also need to check my assume numbers with the reality of your system. Also this doesn't take copying the data to RAM into account which is usually what you are really measuring. That's easy to model though. A rule of thumb is to say that copying to RAM will reduce this value less than 50% because RAM should be at least a little faster than NOR. Nevertheless, If you can't use a simple calculation like this to explain the numbers then you have a poorly configured bus timings or have your cache off. Cache is important because it would make the equation look like this, or worse. 4Bytes = 300ns + 100ns + 1 * 20ns + 50ns = 470ns =4B/470ns =8MB/s For a PXA270 if you go with the defaults it can look like this. 2Bytes = 300ns + 250ns + 250ns = 800ns = 2MB/s So, If you are only getting 0.6MB/s out of your NOR..... You're using it wrong. --
What's a GL? Never heard of it - all I can think of is OpenGL :-) I'm using a Sigma Designs 862x media processor. It clocks at 166MHz to main RAM, has an ARM internally to run Linux, and the intensive work happens in coprocessors. The NOR is not on the RAM bus, it's on a "peripheral bus". About the only thing I know about the bus is it's 16 bits wide - I have the schematic, but only the board supplier has I'm not sure if cache is an option with this device - but would it make a difference anyway? Launching executables like Busybox - those are much larger than the cache anyway, so launch time is dominated by bulk streaming copy speed. Thanks for the idea, I'll look into whether it's possible to access this 'peripheral bus' through the Interesting, thanks. I'm not sure it's possible to change the way NOR is being used with this chip, and it'll be a while before it's economical to replace the board with a new design. This is all very interesting - I had no prior experience with NOR, so didn't know that 0.6MB/s was slow. It's fast compared with older EEPROMs after all, and had imagined that people wanting fast flash would use NAND. On looking at the datasheet, I see it's quite a lot faster. I'm suspecting the Sigma Designs perpheral bus and the way it's wired up not doing it any favours. We already have the weirdness that we have to patch the Linux CFI-0002 MTD code: the CPU locks up when polling the erase status byte, until erase is finished. Unfortunately this is difficult to change now - I'm programming hardware which is already out in the field and cannot be redesigned. Thanks for your thoughts. -- Jamie --
Well the first read takes 100ns (plus the other chipset overhead 300ns) but other reads in a page are only an extra 25ns each. So your benefit is not from having the entire executable in cache it's from having the next 7 instructions in the cacheline for only an extra 25ns Usually these things can be fixed in the bootloader or by hacking the kernel to tweak the relevant chipset registers. --
I'm using a S29GL064N chip. Going through linux and /dev/mtd I get 13.5 MB/sec and reading directly from the chips give 15 MB/sec. I've not mapped the chip cached and I'm not using the page burst mode. That would help a lot certainly, but the current flash speed isn't much of a bottleneck. --
NAND is significantly faster when writing than NOR, read speed is of the same magnitude, possibly slower in many cases. /Ricard -- Ricard Wolf Wanderlöf ricardw(at)axis.com Axis Communications AB, Lund, Sweden www.axis.com Phone +46 46 272 2016 Fax +46 46 13 61 30 "With Free Software you are employing the best programmers on the planet" --
Right. Specifically, read bandwidth is on the same order of magnitude. However the read latency of NAND is a couple orders of magnitude higher (100ns vs 20,000ns) so it depends on what you are doing. --
> I think the "fast" in "fast synchronous" gives it away :-) I think you should get more like an order of magnitude higher.... Get an expert to look at your timings in the bootloader. Make sure things Depends on how you are measuring it. You ought to be able to get at least 2 orders of magnitude higher read speeds with a good sync Flash. Some of the newer stuff is even faster. --
Yes, looking at the Spansion datasheet, if it were interfaced properly it should be quite fast. (25ns access time for in-page 16-bit reads, 100ns for random reads). I'll see if ioremap_cached() makes a difference to streaming read performance. The BSP suppliers have been quite cautious in places, flushing cache a bit too often. (I'm not surprised - we had disk ext3 filesystem Thanks. Oh, how I look forward to the day of working with current kernels and current hardware. -- Jamie --
The key for XIP on noMMU would be the ability to store a file as one complete contiguous chunk. Can AXFS do this? Regards Greg ------------------------------------------------------------------------ Greg Ungerer -- Chief Software Dude EMAIL: gerg@snapgear.com Secure Computing Corporation PHONE: +61 7 3435 2888 825 Stanley St, FAX: +61 7 3891 3630 Woolloongabba, QLD, 4102, Australia WEB: http://www.SnapGear.com --
Or more generally, the mmap'd parts of a file. XIP doesn't mmap the whole file, it just maps the code and rodata. The data segment is copied. AXFS's magic for keeping parts of the file uncompressed, but parts compressed, would be good for this - both for space saving, and also because decompressing compressed data from NOR is faster than reading uncompressed data. -- Jamie --
Hi Jared, The version in SVN seems to be slightly older than the one you submitted? Which platform(s) do you use for testing? I gave AxFS a try on PS3 (ppc64, always use big-endian 64-bit for testing new code ;-). When mounting the image, I got the crash below: | attempt to access beyond end of device | loop0: rw=0, want=4920, limit=4912 | Unable to handle kernel paging request for data at address 0x00000028 | Faulting instruction address: 0xd000000000037988 | Oops: Kernel access of bad area, sig: 11 [#1] | SMP NR_CPUS=2 PS3 | Modules linked in: axfs zlib_inflate nfsd exportfs dm_crypt dm_mod sg joydev evdev | NIP: d000000000037988 LR: d000000000037974 CTR: 0000000000000000 | REGS: c00000000c1e3240 TRAP: 0300 Not tainted (2.6.27-rc4-dirty) | MSR: 8000000000008032 <EE,IR,DR> CR: 24044482 XER: 20000000 | DAR: 0000000000000028, DSISR: 0000000040000000 | TASK = c0000000068d4e40[1744] 'mount' THREAD: c00000000c1e0000 CPU: 0 | GPR00: d000000000037974 c00000000c1e34c0 d000000000043f30 c00000000c1e36a0 | GPR04: 000000000000013e 000000000000013e c00000000c1e2eb0 0000000000000002 | GPR08: c00000000058de80 0000000000000001 c0000000068d4e40 c00000000c1e34c0 | GPR12: 8000000000008032 c000000000671300 0000000010020000 00000000ff80bec1 | GPR16: 0000000010023dc8 0000000010023db8 00000000ff80bed1 0000000010023e00 | GPR20: 0000000000000001 0000000010023e38 c00000000c1e36a0 c00000000c1d5000 | GPR24: 0000000000000000 0000000000000004 0000000000266000 0000000000000000 | GPR28: 0000000000001000 0000000000000004 d0000000000438e0 c00000000c1e34c0 | NIP [d000000000037988] .axfs_copy_block+0xa0/0x144 [axfs] | LR [d000000000037974] .axfs_copy_block+0x8c/0x144 [axfs] | Call Trace: | [c00000000c1e34c0] [d000000000037974] .axfs_copy_block+0x8c/0x144 [axfs] (unreliable) | [c00000000c1e3580] [d000000000035f20] .axfs_copy_metadata+0x154/0x1cc [axfs] | [c00000000c1e3630] [d000000000035fd8] .axfs_verify_eofs_magic+0x40/0xa0 [axfs] | [c00000000c1e36c0] [d000000000036678] ...
Offset 0x28 is buffer_head->b_data, so it seems like sb_bread returns NULL, which it does for out of range block numbers. I guess axfs_copy_block should check for that condition, as it can happen on malicious file system images. I agree that this is likely to get caused by an endianess bug. A good help for finding endianess bugs is to use __be64 like data types everywhere and test with sparse -D__CHECK_ENDIAN__. Arnd --
> The version in SVN seems to be slightly older than the one you submitted? Oops. Okay I must have neglected to sync at the very end. Thanks. I forgot, there is also a git repo at Yeah we've had this problem before. I'm not so sure this is an endian Can you run mkfs.axfs on the same trivial directory on both ia32 and PPC64 and then get me the resulting images? --
Ah, little endian. From your good relationship with the s390 developers, I had hoped you would I'll send them by private email. With kind regards, Geert Uytterhoeven Software Architect Sony Techsoft Centre Europe The Corporate Village
Haha, we let you sort out the endianess issues first and then take the easy path :-). We have'nt tried it yet so far. --
git.infradead.org is a big-endian box, and I know you have an account there... -- David Woodhouse Open Source Technology Centre David.Woodhouse@intel.com Intel Corporation --
Interestingly, it also doesn't work on UserModeLinux (x86, 32-bit): | attempt to access beyond end of device | loop0: rw=0, want=24, limit=16 | | EIP: 0073:[<0811ec67>] CPU: 0 Not tainted ESP: 007b:19515c38 EFLAGS: 00210212 | Not tainted | EAX: 00000000 EBX: 00001000 ECX: 19484aa0 EDX: 190d9f0c | ESI: 195ee000 EDI: 19515cd0 EBP: 19515c6c DS: 007b ES: 007b | 08247af0: [<08069ba3>] show_regs+0xb4/0xb9 | 08247b1c: [<080591ee>] segv+0x222/0x23a | 08247bbc: [<08059296>] segv_handler+0x90/0x9a | 08247c68: [<080649b8>] sig_handler_common+0x63/0x72 | 08247ce0: [<08064cac>] sig_handler+0x31/0x3d | 08247cec: [<08064c0b>] handle_signal+0x4c/0x7a | 08247d0c: [<08066327>] hard_handler+0xf/0x14 | 08247d1c: [<005c0420>] 0x5c0420 | | Kernel panic - not syncing: Kernel mode fault at addr 0x14, ip 0x811ec67 | | EIP: 0073:[<4010a44e>] CPU: 0 Not tainted ESP: 007b:bfa323a0 EFLAGS: 00200246 | Not tainted | EAX: ffffffda EBX: 080595f8 ECX: 080595c8 EDX: 080595d8 | ESI: c0ed0000 EDI: 00000000 EBP: bfa323d8 DS: 007b ES: 007b | 08247a5c: [<08069ba3>] show_regs+0xb4/0xb9 | 08247a88: [<08059462>] panic_exit+0x25/0x3b | 08247a9c: [<08083642>] notifier_call_chain+0x27/0x4c | 08247ac4: [<0808367e>] __atomic_notifier_call_chain+0x17/0x19 | 08247ad4: [<08083695>] atomic_notifier_call_chain+0x15/0x17 | 08247af0: [<0806fd87>] panic+0x52/0xd8 | 08247b10: [<080591fc>] segv+0x230/0x23a | 08247bbc: [<08059296>] segv_handler+0x90/0x9a | 08247c68: [<080649b8>] sig_handler_common+0x63/0x72 | 08247ce0: [<08064cac>] sig_handler+0x31/0x3d | 08247cec: [<08064c0b>] handle_signal+0x4c/0x7a | 08247d0c: [<08066327>] hard_handler+0xf/0x14 | 08247d1c: [<005c0420>] 0x5c0420 Commandline is `mount image.axfs /mnt -o loop -t axfs'. Is there something wrong with the axfs version you submitted, or with mkfs.axfs? With kind regards, Geert Uytterhoeven Software Architect Sony Techsoft Centre Europe The Corporate Village
I found what's wrong. The size of an AxFS image created by mkfs.axfs is always n*4096+4 bytes large. So when it wants to check the magic value in the last 4 bytes, the block layer tries to read a whole 512-byte sector, which fails for loop-mounted images. If you test on real FLASH, additional bytes after the end of the AxFS image can be read, hence it works. By padding the image with 508 zero bytes, I can mount it, on both PS3 (ppc64) and UML (ai32). I can even read images created on PS3. However, there still are weird things going on, like `find' not seeing all files and directories, or just aborting, and `ls -lR' showing actual file contents in its output. With kind regards, Geert Uytterhoeven Software Architect Sony Techsoft Centre Europe The Corporate Village
Right. We haven't tested loopback since we added the magic end value. How is one expected to read those last 4 bytes of a loopbacked file? Are they unreadable? We can add the padding. I am just wondering if this is a bug or a known limitation in the loopback handling or if there is a different safer way of reading block devs with truncated Do you see this behavior for all builds for just the PS3? --
Can't you just include the final magic into the last block, thereby making the size a clean multiple of 4k? It looks as if you have some padding before the magic anyway. So you just have to make sure the padding is at least 4 bytes and write the magic to the end of it. Apart from solving this bug, it should also save you some space. ;) Jörn -- Invincibility is in oneself, vulnerability is in the opponent. -- Sun Tzu --
I'm going to have to look into this n*4K thing. The image doesn't need to be aligned. There shouldn't be any last block to put the magic in. But I haven't messed with the mkfs.axfs code for a while. --
The `find' issue I saw on both PS3 (ppc64) and UML (ia32). The `ls -lR' I tried on UML (ia32) only for now. With kind regards, Geert Uytterhoeven Software Architect Sony Techsoft Centre Europe The Corporate Village
Geert, Thanks for giving it a spin, especially on a platform as different from ours as the PS3. Before I dig more into what happened, I was wondering if you could tell me a bit more about your environment, particularly how you supplied the filesystem to the kernel and your mount commandline (also, if you used a boot commandline, what it was.) My first guess would be a ppc64 compiled UML session, but I'd like to be a bit more sure. Will Marone --
Nope, I just built axfs as a module and insmoded it. After that
mount image.axfs /mnt -o loop -t axfs
So nothing fancy.
With kind regards,
Geert Uytterhoeven
Software Architect
Sony Techsoft Centre Europe
The Corporate Village 