:Wouldn't making timestamp queries (at least from userland) enforce a :sync on the volume in question be useful here? :-- : Thomas E. Spanjaard : tgen@netphreax.net Making the 'hammer now' command do a sync() is a good idea. I will make that change right now so it doesn't get lost. Here's a general overview of the issues involved with having historical access to the filesystem: ---------------- Recording the timestamps in the in-memory cache, for a finer-grained snapshot capability, is doable but has its own issues. Here's an illustration: open() create file write() append 4K (file size now 4K) write() append 4K (file size now 8K) write() append 4K (file size now 12K) write() append 4K (file size now 16K) Now NONE of this has gone to disk yet, it's entirely in the in-memory cache. The inode is in the in-memory cache. The data is stored in the buffer cache. Even the directory entry for the file that we just created is still in the in-memory cache (HAMMER caches the raw records it intends to commit later on). If I wanted to be able to acquire a timestamp between each write and 'see' a snapshot of the file as of any point in the above sequence, then every write would also have to allocate a copy of the inode (because it changes size on each write). The data has the same problem though with a slightly different example. Lets say each write() was a seek-write, overwriting the previous data. Now with every write() I would have to allocate a copy of the data being overwritten. This is complicated by the fact that the buffer cache has no clue about 'historical' accesses, so I would not be able to use the buffer cache to cache the data. There's also another problem and that is with the efficiency of the topology on-disk. Even if I maintained all the copies of the inode and all the copies of the data in-memory, I would still have to sync all those copies to disk in order for things to remain historically coherent (whether it be in-cache or on-disk). This would result in hundreds or even thousands of copies of the inode on-disk, not to mention potentially many copies of the data. I just don't want to do that right now, at least not as a default. A lot of performance would be lost. Hence a sync() is needed if you want to create a demark which you can accurately snapshot. ------------- Here's a quick synopsis of how the cache would operate in a clustered filesystem: In order to properly integrate with in-memory caches, a wider cache coherency infrastructure is needed between machines such that modifications made on one machine proactively invalidate those protions of the cache(s) on other machines. At the same time, any 'dirty' cache data, for example when a file is created or written to, must lock the cache space in question on all other machines. The cache space in this case is not just the file data, but also the related namespaces (for creations, deletions, and renames). Attempts to access locked spaces from other machines in the cluster would have to force a flush to the filesystem backing store and lower the cache states for the effected information on the original machine from dirty to shared-read-only. It will be easiest to integrate the cache coherency information into the buffer cache and namecache themselves. Once a machine has dirtied an in-memory cache element... for example part of the namespace when creating a file or chunks of data written within a file, that machine must have a free hand to make further modifications to the cache spaces involved without further interaction with other machines. ------------- Now, if you think of those two major elements you can see that they actually fit together quite well. If I were to attempt to maintain transactional coherency on a per-system-call basis then the cache granularity between machines would have to be much, much smaller then our current in-memory caching elements provide. That would become a really nasty coding problem. So I don't even want to begin to complate transactional coherency at a finer-grain then sync() or fsync() until long after we actually have clustering working. -Matt
| Ingo Molnar | Re: [PATCH 00 of 36] x86/paravirt: groundwork for 64-bit Xen support |
| Linus Torvalds | Linux 2.6.27-rc8 |
| Alan Cox | [PATCH 02/27] drivers/char/hvc_console.c: adjust call to put_tty_driver |
| James Bottomley | Re: Integration of SCST in the mainstream Linux kernel |
git: | |
| Johannes Schindelin | RE: Switching from CVS to GIT |
| Florian v. Savigny | Can git be tweaked to work cross-platform, on FAT32? |
| Shawn Bohrer | [PATCH] Fix off by one error in prep_exclude. |
| Johannes Sixt | [PATCH 03/40] Add target architecture MinGW. |
| Marcos Laufer | dmesg IBM x3650 OpenBSD 4.3 |
| Nick Guenther | Re: Real men don't attack straw men |
| Steve B | Intel Atom and D945GCLF2 |
| Michael | QEMU /dev/tun issue with tun device number > 3 (more than 4 guests) |
| David Miller | [GIT]: Networking |
| Chuck Lever | Re: [bug?] tg3: Failed to load firmware "tigon/tg3_tso.bin" |
| Patrick McHardy | gre: minor cleanups in netlink interface |
| Jarek Poplawski | Re: [PATCH] net_sched: Add qdisc __NET_XMIT_STOLEN flag |
