Every little factor of 25 performance increase really helps. Ramback is a new virtual device with the ability to back a ramdisk by a real disk, obtaining the performance level of a ramdisk but with the data durability of a hard disk. To work this magic, ramback needs a little help from a UPS. In a typical test, ramback reduced a 25 second file operation[1] to under one second including sync. Even greater gains are possible for seek-intensive applications. The difference between ramback and an ordinary ramdisk is: when the machine powers down the data does not vanish because it is continuously saved to backing store. When line power returns, the backing store repopulates the ramdisk while allowing application io to proceed concurrently. Once fully populated, a little green light winks on and file operations once again run at ramdisk speed. So now you can ask some hard questions: what if the power goes out completely or the host crashes or something else goes wrong while critical data is still in the ramdisk? Easy: use reliable components. Don't crash. Measure your UPS window. This is not much to ask in order to transform your mild mannered hard disk into a raging superdisk able to leap tall benchmarks at a single bound. If line power goes out while ramback is running, the UPS kicks in and a power management script switches the driver from writeback to writethrough mode. Ramback proceeds to save all remaining dirty data while forcing each new application write through to backing store immediately. If UPS power runs out while ramback still holds unflushed dirty data then things get ugly. Hopefully a fsck -f will be able to pull something useful out of the mess. (This is where you might want to be running Ext3.) The name of the game is to install sufficient UPS power to get your dirty ramdisk data onto stable storage this time, every time. The basic design premise of ramback is alluringly simple: each write to a ramdisk sets a per-chunk ...
Are you using barriers or ordered disk writes with physical sync in the right moments or something like that? I think this is needed to allow any Thanks, GK --
Usual block device semantics are preserved so long as UPS power does not run out before emergency writeback completes. It is not possible to order writes to the backing store and still deliver ramdisk level write latency to the application. After the emergency writeback completes, ramback is supposed to behave just like a physical disk (with respect to writes - reads will still have ramdisk level latency). No special support is provided for barriers. It is not clear that anything special is needed. Daniel --
Why - your chunks simply become a linked list in write barrier order. Solve your bitmap sweep cost as well. As you are already making a copy before going to backing store you don't have the internal consistency problems of further writes during the I/O. Yes you may need to throttle in the specific case of having too many copies of pages sitting in the queue - but surely that would be the set of pages that are written but not yet committed from a previous store barrier ? BTW: I'm also curious why you made it a block device. What does that offer over say ramfs + dnotify and a userspace daemon or perhaps for big files to work smoothly a ramfs variant that keeps dirty bitmaps on file pages. That way write back would be file level and while you might lose changesets that have not been fsync()'d your underlying disk fs would always be coherent. Alan --
Good idea, it would be nice to offer that operating mode. But linear sweeping is going to put the most data onto rotating media the fastest, thus making the loss-of-line-power flush window as small as possible, which is what the current incarnation of this driver optimizes for. Note that half a TB worth of dirty ramdisk chunks will need 1 GB of linked list storage, so this imposes a limit on total dirty data for What happens with the linked list gets too long? Fall back to linear sweep? Or accept suboptimal write caching? A linked list would work for linking together dirty bitmap pages, one level up, thus 2**15 rarer. Even there I prefer the linear sweep. I intend to implement a dirty map of the dirty map, at least because I have not seen one of those before, but also because I think it will Indeed. That is the entire reason I did it that way. In fact ramback used to write the ramdisk and backing store from the same application source, so the writethrough code was significantly shorter. But not The only time ramback cares about barriers is when it switches to writethrough mode. It would be nice to have a mode where barriers are respected at the backing store level, but there is no way you will get the same write performance. The central idea here is that ramback relies on a UPS to achieve the ultimate in disk performance. I agree that other modes would be very nice, but not necessary for this thing to be actually useful. I suspect than early users will be looking for As a block device it is very flexible, and as a block device it is fairly simple. As a block device, the only interesting userspace setup is the hookup to power management scripts. Dnotify... probably you meant inotify, and even then it sounds daunting, but maybe somebody That would a nice hack, why not take a run at it? Regards, Daniel --
You get duplicated blocks though. But yes, I agree - write-backs to the
disk must be ordered, other it's going to be too unreliable in practice.
You could switch from a journal like the above to a bitmap when this
overrun occurs. (Typical problem in replication.) SteelEye holds a
patent on that though, as far as I know.
Regards,
Lars
--
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde
--
Hi Lars, I disagree with your claim of "too unreliable". If the UPS power does not fail before flushing completes, it is perfectly reliable. Perhaps you need a belt to go with your suspenders? As I wrote earlier, you cannot have optimal writeback speed and ordering at the same time. I can see eventually implementing some kind of ordered writeback mode where completion is signalled to the application before writeback completes. You then get to choose between fastest flush and most paranoid ordering. I guess everybody will choose fastest flush, but I will be happy to accept your patch to see which they actually If you think this is like replication then you have the wrong idea about what is going on. This is a cache consistency algorithm, not a replication algorithm. Regards, Daniel --
Daniel, I'm not saying you don't have a good thing here. Just that for
backing large filesystems, the risk of having to run a full fsck and
finding inconsistent metadata is pretty serious.
If I always assume a reliable shutdown - UPS protected, no crashes, etc
- you're right, but at least my real world has other failure scenarios
as well. In fact, the most common reason for unorderly shutdowns are
kernel crashes, not power failures in my experience.
So "perfectly reliable if UPS power does not fail" seems a bit over the
No disagreement here. The question would be how large the performance
I was trying to prod you into writing the ordered flushing. Maybe
claiming it is too hard will do the trick? ;-)
I see the differences, but I also see the similarities. What you're
doing can also be thought of as replicating from an instant IO store
(local memory) to a high latency, low bandwidth copy (the disk)
asynchronously.
Both obviously need to preserve consistency, the question is whether to
achieve transactional (ordered) consistency or not.
Regards,
Lars
--
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde
--
What are you doing to your kernel? My desktop at home, which runs my MTA, web site etc, and is subject to regular oom abuse by firefox: $ uptime 04:36:49 up 204 days, 9:38, 8 users, load average: 0.82, 0.43, 0.20 This machine has been up ever since I got a UPS for it, and then it only ever went down due to a blackout or my wife blowing a fuse with the vacuum cleaner. Honestly, I have never seen a machine running Linux 2.6 crash due to a software flaw, except when I caused it myself. I suspect the Linux kernel has a better MTBF than a hard In fact, replicating was one of the strategies I considered for this. But since it is a lot more work and will not perform as well as a simple sweep, I opted for the simple thing. Which turned out to be pretty complex anyway. You have to close all the same nasty races but with a considerably more complex base algorithm. I think that better wait for version 2.0. By the way, I could use a hand debugging this thing. Regards, Daniel --
I have experienced many 2.6 crashes due to software flaws. Hung processes leading to watchdog timeouts, bad kernel pointers, kernel deadlock, etc. When designing for reliable embedded systems it's not enough to handwave away the possibility of software flaws. Chris --
Indeed. You fix them. Which 2.6 kernel version failed for you, what made it fail, and does the latest version still fail? If Linux is not reliable then we are doomed and I do not care about whether my ramback fails because I will just slit my wrists anyway. How about you? Daniel --
Daniel, you're not objective. Simply look at LKML reports. People even report failures with the stable branch, and that's quite expected when more than 5000 patches are merged in two weeks. The question could even be returned to you: what kernel are you using to keep 204 days of uptime doing all that you describe ? Maybe 2.6.16.x would be fine, but even then, a lot of security issues have been fixed since the last 204 days (most of which would require a planned reboot), and also a number of normal bugs, No, I've looked at the violin-memory appliance, and I understand better your goal. Trying to get a secure and reliable generic kernel is a lost game. However, building a very specific kernel for an appliance is often quite achievable, because you know exactly the hardware, usage patterns, etc... which help you stabilize it. I think people who don't agree with you are simply thinking about a generic file server. While I would not like my company's NFS exports to rely on such a technology for the same concerns as exposed here, I would love to have such a beast for logs analysis, build farms, or various computations which require more than an average PC's RAM, and would benefit from the data to remain consistent across planned downtime. If you consider that the risk of a crash is 1/year and that you have to work one day to rebuild everything in case of a crash, it is certainly worth using this technology for many things. But if you consider that your data cannot suffer a loss even at a rate of 1/year, then you have to use something else. BTW, I would say that IMHO nothing here makes RAID impossible to use :-) Just wire 2 of these beasts to a central server with 10 Gbps NICs and you have a nice server :-) Regards, Willy --
So we have a flock of people arguing that you can't trust Linux. Well maybe there are situations were you can't, but what can you trust? Disk firmware? Bios? Big maybes everywhere. In my experience, Linux is very reliable. I think Linus, Andrew and others care an awful lot about that and go to considerable lengths to make it true. Got a list of Linux kernel flaws that bring down a system? Tell me and I will not use that version to run a transaction processing system, or I will fix them or get them fixed. But please do not tell me that Linux is too unreliable to run a transaction processing system. If Linux can't do it, then what can? By the way, the huge ramdisk that Violin ships runs Linux inside, to manage the raided, hotswappable memory modules. (Even cooler: they run Linux on a soft processor implemented on a big FPGA.) Does anybody think that they did not test to make sure Linux does not compromise their MTBF in any way? In practice, for the week I was able to test the box remotely and the 10 days I had it in my hands, the thing was solid as a rock. Good Sure. Leaving out dodgy stuff like hald, other bits I could mention, is probably a good idea. Scary thing is, thinks like hald are actually being run on servers but that is another issue entirely. It wasn't too long ago that NFS client was in the dodgy category, with oops, lockups, whathaveyou. It is pretty solid now, but it takes a while for the bad experiences to fade from memory. On the other hand, knfsd has never been the slightest bit of a problem. Helpful suggestion: don't run NFS client on your transaction processing unit. It may well be solid, but who needs to find out experimentally? Might as well toss gamin, dbus and udev while you are at it, for a further marginal reliability increase. Oh, and alsa, no offense to the great work there, but it just does not belong on a server. Definitely do I guess I am actually going to run evaluations on some mission critical systems using the ...
You've eluded to NBD needing deadlock fixes quite a few times in the past. I've even had some discussions with you on where you see NBD lacking (userspace nbd-server doesn't lock memory or set PF_MEMALLOC, etc). But I've lost track of what changes you have in mind for NBD. Are you talking about a complete re-write or do you have specific patches that will salvage the existing NBD client and/or server? Has this work already been done and you just need to dust it off? As an aside, using a kernel with the new per bdi dirty page accounting I've not been able to hit any deadlock scenarios with NBD. Am I not trying hard enough? Or are they now mythical? If real, do you have a reproducible scenario that will cause NBD to deadlock? I'm not interested in swap over NBD (e.g. network memory reserves?) because in practice I've found that the VM doesn't allow non-swap NBD use-cases to actually need that "netvm" sophistication... any other workload that deadlocks NBD would interesting. thanks, Mike --
On Wed, 12 Mar 2008 00:17:56 -0800 The traditional and proven method to constructing a reliable system is to assume that no component can be fully trusted. This is especially true for new code. By being paranoid about everything, failures in one component are usually contained well enough that one failure is not catastrophic. In order for ramback to get appeal with the people who are paranoid about data integrity (probably a vast majority of users), you will need some guarantees about flush order, etc... -- All Rights Reversed --
I disagree. Never mind that it already does provide such guarantees, just echo 1 >/proc/driver/ramback/name. But if you want the full performance you need to satisfy your paranoia at a higher level in the traditional way: by running two in parallel or whatever. Daniel --
I guess I'm being really vicious to them: I expose it to customers and
the real world.
My own servers also have uptimes of >400 days sometimes, and I wonder
what customers do to the poor things.
And yes, I'm not saying I don't see your point for specialised
deployments (filesystems which are easy to rebuild from scratch), but
transactional integrity is a requirement I'd rank really high on the
Where they control the hardware and run a rather specialized OS as well,
I'm afraid with those properties it doesn't really meet my needs :-(
And, wouldn't a simpler way to achieve something similar not be to use
the plain Linux fs caching/buffers, just disabling forced write out
maybe via a mount option? This strikes me as similar to the effect I get
from remounting NFS (a)sync. Make the fs ignore fsync et al.
It would have the advantage of using all memory available for caching
and not otherwise requested, too. (And, of course, the downside of
making it hard to reserve cache space for a given fs explicitly, at
least now. But I'm sure the control group / container folks would love
that feature. ;-)
Regards,
Lars
--
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde
--
It is ranked high, nonetheless I perceive this spate of sky-is-falling comments as low level FUD. Which of the following do you think is the least reliable component in your transactional system: 1) Linux 2) The computer 3) The hard disk 4) The battery 5) The fan Correct answer: the fan. The rest are roughly a tie (though of course you will find variations) and depend on how much money you spend on each of them. I know I do not have to explain this to you, but the way you calculate reliability for a complete system is to multiply the reliability of each component. The number of nines that drop out of this calculation is your reliability. At the moment, the version of Linux I run is looking like a 1.0, so is the UPS. Already got me through about 4 blackouts and half a dozen vacuum cleaner events. Though obviously neither is 1.0, both are darn close. The hard disk on the other hand... I have a box full of broken ones here, how about you? I have never had a PC go bad on me, ever. Had a couple of fans die, but these days I only buy PCs that run fine without a fan. So your proposition is, I can add nines to this system by introducing atomic update of the backing store. Fine, I agree with you. However if I already sit at six or seven nines then should I be putting my effort there, or where? Also no need to explain: when you introduce two way redundancy, you square the reliability. So have two independent power supplies on two independent UPSes. Sleep easy, plus you gain the ability to do scheduled battery maintenance, so reliability increases by more than the square. No matter how much you fiddle with atomic update of backing store, one disgruntled sysop going postal can still destroy your data with the help of a sledgehammer. You need to get this reliability thing in perspective. So how about you draft a Suse engineer to get working on the atomic backing store update, ETA six months? In the mean time, we can configure a transactional ...
Or to quote a little SciFi, in this case Captain Hunt from Andromeda: Slipstream: it's not the best way to travel faster than light, it's just the only way.[1] At least that's what first crossed my mind after reading the above. ;-) [1]: http://en.wikipedia.org/wiki/Slipstream_%28science_fiction%29#Andromeda Bis denn -- Real Programmers consider "what you see is what you get" to be just as bad a concept in Text Editors as it is in women. No, the Real Programmer wants a "you asked for it, you got it" text editor -- complicated, cryptic, powerful, unforgiving, dangerous. --
By the way are you aware that all you have to do is: echo 1 >/proc/driver/ramback/<name> and ramback already runs in the mode you speak of? Daniel --
Actually, in Centera we use generic hardware with a fairly normal kernel which has strategic backports from upstream (libata, nic drivers, etc). No UPS in the picture. Data integrity is protected by working with the application team to insure they understand when data is safely on the disk platter and working with IO & FS people to try and make sure we don't lie to them (too much ) about that promise. The centera boxes are tested with power failure & error injection and by all of our customers in all those ways customers do ;-) ric --
Hi Ric, Right, so Linux has gotten to the point where it competes with purpose- built embedded software in reliability. Not quite there, but close enough for mission-critical. I was not thinking of Centera when I mentioned the UPS though... Daniel --
This is our case, but we have been working for quite a while to enhance the reliability of the io stack & file systems. It also helps to be very careful to select hardware components with mature, open source & No problem, we certainly have many boxes with built in ups hardware ;-) ric --
A word to the wise indeed. Well I would never suggest that we can rest on our laurels as far as Linux reliability in concerned, only that it is already very reliable or you certainly would not ship products based on it. Daniel --
Nice fiction - stuff crashes eventually - not that this isn't useful. For a long time simply loading a 2-3GB Ramdisk off hard disk has been a good Ext3 is only going to help you if the ramdisk writeback respects barriers Why not - providing you clear the dirty bit before the write and you check it again after ? And on the disk size as you are going to have to suck all the content back in presumably a log structure is not a big If you are prepared to go bigger than the fs chunk size so lose the ordering guarantees your chunk size really ought to be *big* IMHO Alan --
On Mon, 10 Mar 2008 09:22:13 +0000 That could get ugly when ext3 has written to the same block multiple times. To get some level of consistency, ramback would need to keep around the different versions and flush them in order. -- All rights reversed. --
Ah, keep snapshots like ddsnap? Interesting idea. But complex, and ramback will stay perfectly consistent so long as you don't pull the plug on your UPS. I seem to recall that EMC has been peddling SAN storage with similar restrictions for quite some time now. Regards, Daniel --
Hi Alan, Nice to see so many redhatters taking an avid interest in storage :-) Right, and now with ramback you will be able to preserve that state and But that does not satisfy the requirement you snipped: * Applications need to be able to read and write ramback data during More accurately: in general, cannot transfer directly. The ramdisk may be external and not present a memory interface. Even an external ramdisk with a memory interface (the Violin box has this) would require extra programming to maintain cache consistency. Then there is the issue of ramdisks on the way that exceed the 40 bit physical addressing of current generation processors. Even for the simple case where the ramdisk is just part of the kernel unified cache, I would rather not go delving into that code when these transfers are on the slow path anyway. Application IO does its normal single copy_to/from_user thing. If somebody wants to fiddle with vm, the place to attack is right there. The copy_to/from_user can be eliminated (provided alignment requirements are met) using stupid page table tricks. In spite of Linus claiming there is no performance win "640K should be enough for anyone" The finer the granularity the faster the ramdisk syncs to backing store. The only attraction of coarse granularity I know of is shrinking the bitmap, which is currently not so big that it presents a problem. Your comment re fs chunk size reveals that I have failed to communicate the most basic principle of the ramback design: the backing store is not expected to represent a consistent filesystem state during normal operation. Only the ramdisk needs to maintain a consistent state, which I have taken care to ensure. You just need to believe in your battery, Linux and the hardware it runs on. Which of these do you mistrust? Regards, Daniel --
expecting the hw to never fail is unreasonable - it will. it's just a question what happens when (not if) it fails. and it's not about the backing store being inconsistent during normal operation - it's about what you are left with after an unclean shutdown. With your scheme the only time you can trust the on-disk data is when the device is off; when it fails for some reason (batteries do fail, kernel bugs do happen, DOS, overheating etc etc) you can no longer trust any of the data, and no - fsck doesn't help when you have a mix of old data overwritten by new stuff in basically random order. i can't see any scenario when it would make sense to trust the corrupted on-disk fs instead of restoring from backup (or regenerating). So is it just about avoiding repopulating the fs in the (likely) case of normal, clean shutdown? This could be a reasonable application of ramback (OTOH how often will this (shutdown) happen in practice...). IOW you get a ramdisk-based (ie fast) device that is capable of surviving power loss, but that's about it. Now, if you add snapshots to the backing store it suddenly becomes much more interesting -- you no longer need to put so much trust in all the hw. Should the device fail for whatever reason then you just rollback to the last good snapshot upon restart. No corrupted fs, no fsck; you lose some newly written data (that you couldn't recover w/o a snapshot anyway), but can trust the rest of it (assuming you trust the fs and storage hw, but that's no different then w/o ramback). artur --
or you could keep two devices as backing store, use one and switch to the other when the fs is consistent. This could as simple as noticing zero dirty data in the ramdisk or, if something is constantly writing to it, reacting periodically to some barrier (needs cow/doublebuffering in order to not throttle the writer, but you already do this). Means ramdisk can be as large as 1/2 the stable storage and a bit more i/o (resyncing after switch to the other device), but gives you two copies of the data; one stable and one that can be used to recover newer data should you need to. artur --
Actually no - ramback would be useless to this. You might crash and end Oh you mean "pray hard". e2fsck works well with typical disk style failures, it is not robust against random chunks vanishing. I know this as I've worked on and debugged a case where a raid card rebooted silently I was suggesting that you want log structure for the writeback disk so that you keep coherency and can recover it, an issue you seem intent on No I get that. You've ignored the fact I'm suggesting that design choice In a big critical environment - all three. Alan --
So then you know that people already rely on batteries in critical storage applications. So I do not understand why all the FUD from you. Particularly about Ext2/Ext3, which does recover well from random damage. You seem to be calling Linux unreliable. Daniel --
It's more reliable than many others, but it's not perfect. Besides, there are many failure modes beyond the control of the kernel. Hardware errors can lock up the bus and prevent I/O, RAM modules can go bad, technicians can yank out cards without waiting for the ready light. For certain classes of devices it's necessary to plan for these sorts of things, and a model where the on-disk structures may be inconsistent by design is not going to be very attractive. Chris --
...disks can break, batteries on raid controllers can fail, etc, etc... So you design for the number of nines you need, taking all factors into account, and you design for the performance you need. These are You are preaching to the converted. Systems consisting of: linux + disks + batteries + ram + network + redundancy can be as reliable as you need. Respectfully, I would like to return to the software engineering problem. This driver solves a problem for certain people. Not niche people to be forgotten about. If it does not solve your problem then please just write a driver that does, meanwhile this one needs some finishing work. Lets get the proverbial thousand eyeballs working. Has anybody besides me compiled this yet? Daniel --
There's no FUD here. The problem is that you didn't say that you've designed this for only a few nines. If you delete fsck from your rationale, simply saying that you rely on UPS to give you time to flush buffers, you have a much better story. Certainly, once you've flushed buffers and degraded to write-through mode, you're obviously as reliable as ext2/3. Your idea seems predicated on throwing large amounts of RAM at the problem. What I want to know is this: Is it really 25 times faster than ext3 with an equally huge buffer cache? --
Fsck was never a part of my rationale. Only reliability of components was and is. Then people jumped in saying Linux is too unreliable to use in a, hmm, storage system. Or transaction processing system. Or whatever. Yes. Regards, Daniel --
Well, that sounds convincing. Not. You know this how? --
By measuring it. time untar -xf linux-2.2.26.tar; time sync Daniel --
No numbers. No specifications. And by doing a sync, you explicitly excluded what I was asking, namely a big buffer cache. You've certainly convinced me; you don't know if your idea is worth a brass razoo. Come back when you've got some hard data. --
...or download the code and try it yourself. --
Thats cheating. Your ramback ignores sync. Just time it against ext3 _without_ doing the sync. That's still more reliable than what you have. Heck, comment out sync and fsync from your kernel. You'll likely be 10 times normal speed, and still more reliable than ramback. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
No, that allows ext3 to cheat, because ext3 does not supply any means of flushing its cached data to disk in response to loss of line power, and then continuing on in a "safe" mode until line power comes back. Fix that and you will have a replacement for ramback, arguably a more efficient one for this specialized application (it will not work for an external ramdisk). Until you do that, ramback is the only game in town to get these transaction speeds together with data durability. I have mentioned a number of times, that you _already_ rely on an equivalent scheme to ramback if you are using a battery-backed raid controller. Somehow, posters to this thread keep glossing over that and going back to the sky-is-falling argument. Daniel --
Ok, it seems like "ignore sync/fsync unless on UPS power" is what you really want? That should be easy enough to implement, either in kernelor as a LD_PRELOAD hack. So... untar with sync is fair benchmark against ramback on UPS power and untar without sync is fair benchmark against ramback on AC power. But you did untar with sync against ramback on AC power. That's wrong. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html pomozte zachranit klanovicky les: http://www.ujezdskystrom.info/ --
Sure, let's try it and then we will have a race. I would be happy to It is consistent and correct. You need to supply the missing features that ramback supplies before you have a filesystem-level solution. I really encourage you to try it, then we can compare the two approaches with both of them fully working. Daniel --
this I don't understand. what makes your approach 25x faster? looking at the comparison of a 500G filesystem with 500G of ram allocated for a buffer cache. yes, initially it will be a bit slower (until the files get into the buffer cache), and if fsync is disabled all writes will go to the buffer cache (until writeout hits) I may be able to see room for a few percent difference, but not 2x, let alone 25x. David Lang --
My test ran 25 times faster because it was write intensive and included sync. It did not however include seeks, which can cause an even bigger performance gap. The truth is, my system has _more_ cache available for file buffering than I used for the ramdisk, and almost every file operation I do (typically dozens of tree diffs, hundreds of compiles per day) goes _way_ faster on the ram disk. Really, really a lot faster. Because frankly, Linux is not very good at using its file cache these days. Somebody ought to fix that. (I am busy fixing other things.) In other, _real world_ NFS file serving tests, we have seen 20 - 200 times speedup in serving snapshotted volumes via NFS, using ddsnap for snapshots and replication. While it is true that ddsnap will eventually be optimized to improved performance on spinning media, I seriously doubt it will ever get closer than a factor of 20 or so, with a typical read/write mix. But that is just the pragmatic reality of machines everybody has these days, let us not get too wrapped up in that. Think about the Violin box. How are you going to put 504 gigabytes of data in buffer cache? Tell me how a transaction processing system is going to run with latency measured in microseconds, backed by hard disk, ever? Really guys, ramdisks are fast. Admit it, they are really really fast. So I provide a way to make them persistent also. For free, I might add. Why am I reminded of old arguments like "if men were meant to fly, God would have given them wings"? Please just give me your microsecond scale transaction processing solution and I will be impressed and grateful. Until then... here is mine. Service with a smile. Daniel --
if you are not measuring the time to get from ram to disk (which you are not doing in your ramback device) syncs are meaningless. seeks should only be a factor in the process of populating the buffer cache. both systems need to read the data from disk to the cache, they can either fault the data in as it's accessed, or run a process to read it all so you are saying that when the buffer cache stores the data from your ram disk it will slow down. that sounds like it equalizes the performance and it all depends on how you define the term 'backed by hard disk' if you don't write to the hard disk and just dirty pages in ram you can easily hit that sort of latency. I don't understand why you say it's so hard to put 504G of data into the buffer cache, you just read it and it's in the except that you are redefining the terms 'persistent' and 'free' to mean if you don't have to worry about unclean shutdowns then your system is not needed. all you need to do is to create a ramdisk that you populate with dd at boot time and save to disk with dd at shutdown. problem solved in a couple lines of shell scripts and no kernel changes needed. if you want the data to be safe in the face of unclean shutdowns and crashes, then you need to figure out how to make the image on disk consistant, and at this point you have basicly said that you don't think that it's a problem. so we're back to what you can do today with a couple lines of scripting. David Lang --
There was a time when punchcards ruled and everybody was nervous about storing their data on magnetic media. I remember it well, you may not. But you are repeating that bit of history, there is a proverb in there Feel free. You use your script, and somebody with a reliable UPS or two can use my driver, once it is stabilized of course. Just don't be in business against them if being a few milliseconds slower on the uptake means money lost. Daniel --
so now you are saying that you are faster then a ramdisk????? either you are completely out of touch or you misunderstood what I was saying. if you have a reliable UPS and are willing to rely on it to save your data take the identical hardware to what you are planning to use, but instead of using your driver just create a ramdisk and load it on boot and save the contents on shutdown. in this case you are doing zero disk I/O during normal operation, you only touch the disk during startup and shutdown. with your proposal the system will be copying chunks of data from the ramdisk to the hard disk at random times, and you are claiming that doing so makes you faster then a ramdisk???? I'll say it again. if you trust your UPS and don't care about unclean shutdowns (say for example that you trust that linux is just never going to crash) there's no need to write parts of the ramdisk to the hard disk during normal operation, you can wait until you are told that you are going to shutdown to do the data save. now there's no driver needed, just a couple lines of init scripts. David Lang --
No, I am saying that my driver is faster than any script you can write. Your script will not be able to give access to data while the ramdisk is being populated, nor will it be able to save efficiently exactly what is dirty in the ramdisk. (Explained in my original post if you Aha! You are getting close. Really, that is all ramback does. It just handles some very difficult related issues efficiently, in such a way as to minimize any denial of service from complete loss of UPS power. This is all just about using power management in a new way that gets higher performance. But your battery power has to be reliable. Just make it so. It is not difficult these days, or even particularly expensive. I calculated somewhere along the line that it would take something like 17 minutes to populate the big Violin ramdisk initially, and 17 minutes to save it during a loss of line power event, during which UPS power must be not run out before ramback achieves disk sync or you will get file corruption. (This rule was mentioned in my original post.) All well and could, you can in fact do that with a pretty simple script. But in the initial 17 minutes your application may not read or write the ramdisk data and in the closing 17 minutes it may not write. That knocks your system down to 4 nines, given one planned shutdown per year. Not good, not good at all. See, ramback is entirely about _not_ getting knocked down to 4 nines. It wants to stay above 6, given system components that satisfy that goal, comprising: * Linux * Processor, memory, motherboard etc * Dual power supplies with independent UPS backup * Ramback driver My proposition is, you can go out and purchase hardware right now that delivers 6 nines (30 seconds downtime/year) and yes, it will cost you, but if that worries you then set up two (much) cheaper ones and set them up as a failover cluster. (Helps that the Violin box can connect via PCI-e to two servers at the same time.) I say you can do this ...
you just use a redundant system and you no longer care how long it takes to shutdown or startup a system. if you're running a datacenter that cares about uptime to the point of counting 9's or buying a Violin box you are it also takes the faith that you will never have any unplanned shutdowns, since your system will loose massive amounts of data if they happen. nobody who worries about 9's will buy into that argument. you achieve 9's by figuring that things don't always work, and as a result you figure out how to engineer around the failures so that when they happen you stay up. manufacturers have been trying to promise that their boxes are so reliable that they won't go down for decades, and they haven't suceeded yet. David Lang --
The period where you cannot access the data is downtime. If your script just does a cp from a disk array to the ram device you cannot just read from the backing store in that period because you will need to fail over to the ramdisk at some point, and you cannot just read from the ramdisk because it is not populated yet. My point is, you cannot implement ramback as a two line script and expect to achieve anything resembling continuous data availability. I interpret your point about the script as, Ramback is trivial and easy to implement. That is kind of true and kind of untrue, because of the Never is not the right word, but indeed that is why I wrote the story about the rocket ship. If you want the performance that ramback delivers then you cover the risk of hardware failure by other, Why would you assume the data is not mirrored or replicated with a All true. Now what about the punchcard versus magnetic media story? There was a time when magnetic domains were considered less reliable than holes in paper cards, ironically we now think the opposite. So some people will have a hard time with the idea that a battery is reliable enough to get your important cached data on to hard disk when necessary, or that Linux is reliable enough to trust data to it, or whatever. They will get over it. Battery backed data will become a normal part of your life as progress marches on. Daniel --
Wouldn't a raid-1 set comprising disk + ramdisk do that with no downtime? --
In raid1, write completion has to wait for write completion on all mirror members, so writes run at disk speed. Reads run at ramdisk speed, so your proposal sounds useful, but ramback aims for high write performance as well. Daniel --
Ramback could be an interesting building block. Consider using a couple of systems exporting Ramback devices via Evgeniy's distributed storage target (or something similiar). In this case, you can have as many Ramback devices as you want comprise your mirror set to meet your availability requirements. Perhaps people are looking at this too much as an entire solution as opposed to a piece of a bigger puzzle. I think the idea has merit. Cheers, Jeff --
raid1 + kflushd tweak? special raid1 mode that signals completion when it hits _one_ of the drives, and does sync when the slower drive is idle? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
raid1 already supports marking member(s) as write-mostly. Any write-mostly member can also make use of write-behind mode (provided you have a write intent bitmap). --
Feel free :-) This is very close to how ramback already works. One subtlety is that ramback does not write twice from the same application data source, which could allow the data on the backing device to differ from the ramdisk if the user changes it during the write. I don't know how important it is to protect against this bug actually, but there you have it. Ramback can easily to changed to write twice from the same source just like a raid1 (in fact it originally was that way) which would make it even more like raid1. Adding ramback-like functionality to raid1 would be a nice contribution. I would fully support that but I do not have time to do it myself. Daniel --
Hmm, what happens if applications keep dirtying so much data you miss your 17minute deadline? Anyway... ext2 + lots of memory + tweaked settings of kflushd (only write data older than 10 years) + just not using sync/fsync except during shutdown + find / | xargs cat ...is ramback, right? Should have same performance, and you can still read/write during that 17+17 minutes. Ok, find | xargs might be slower... but we probably want to fix that anyway.... It has big advantage: if you only tell kflushd to hold up writes for an hour, you loose a little in performance and gain a lot in reliability... (If ext2+tweaks is slower than ramback, we have a bug to fix, I'm afraid). Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
Hi Pavel, Ramback is supposed to prevent that by allowing only a limited amount of application IO during flush mode. Currently this is accomplished by making each application write wait synchronously on the one before it, until flushing completes. This allows only a small amount of application traffic, something like 5% bandwidth. This solution is admittedly crude, and over time it will be improved to look more like a realtime scheduler, because this is in fact a realtime scheduling problem. Once flushing completes, application writes are still serialized and thus slow, which is a stronger condition than necessary to maintain transactional integrity for the filesystem. Eventually this will be optimized. For now, the maximum flush is only a few hundred MB on my workstation, which leaves a huge safety margin even with my $100 UPS. And the risk, however small, of having to run a lossy e2fsck because the battery got old and the power did run out, is mitigated by the fact that ramback runs on my kernel hacking partition, and everything unique there just gets uploaded to the internet regularly anyway. This serves as my replication algorithm. Note: I strongly recommend that any critical data entrusted to ramback be replicated to mitigate the risk of system No, you are missing some essential pieces. Ramback has two operating modes: 1) writeback (when ups-backed line power is available) 2) writethrough (when running on ups power) Plus, it has the daemon driven flushing for ups mode, and daemon driven one-pass populating for startup mode. That is all ramback is, but you do not quite get there with your solution above. Also, ramback works with generic block devices, opening up a wide range I hope that my work inspires other people like you to go in and work on some of the VM/VFS/BIO brokenness that helps make ramback such a big win. In the meantime, it is useful to be clear on just what we have here, and why some people care about it a lot. Daniel --
Sorry, I missed that the first time. See, it is 504G, not 504M. Daniel --
By "recover well", you must mean "loses massive swabs of data, leaving the system unbootable and with enormous numbers of user files missing." My experience. Expecting fsck to cover for missed writes is stupid. --
Whatever it can get off the disk it gets. It does a good job. If you don't think so, then don't tell me, tell Ted. Daniel --
On Wed, 12 Mar 2008 22:14:16 -0800 He knows. Ext3 cannot recover well from massive loss of intermediate writes. It isn't a normal failure mode and there isn't sufficient fs metadata robustness for this. A log structured backing store would deal with that but all you apparently want to do is scream FUD at anyone who doesn't agree with you. Alan --
Scream is an exaggeration, and FUD only applies to somebody who consistently overlooks the primary proposition in this design: that the battery backed power supply, computer hardware and Linux are reliable enough to entrust your data to them. I say this is practical, you say it is impossible, I say FUD. All you are proposing is that nobody can entrust their data to any hardware. Good point. There is no absolute reliability, only degrees of it. Many raid controllers now have battery backed writeback cache, which is exactly the same reliability proposition as ramback, on a smaller scale. Do you refuse to entrust your corporate data to such controllers? Daniel --
On Thu, 13 Mar 2008 11:14:39 -0800 That's a reasonable enough assumption, to anyone who has never dealt with software before, or whose data is just not important. People who have dealt with computers for longer will know that anything can fail at any time, and usually does unexpectedly and at bad moments. Some defensive programming to deal with random failures could make your project appealing to a lot more people than it would appeal to in its current state. -- All Rights Reversed --
In its current state it has bugs and so should appeal only to programmers who like to work with cutting edge stuff. So long as you keep insisting it has to have some kind of slow transactional sync to disk in order to be reliable enough for enterprise use, I have to leave you in my FUD filter. Did you read Ric's post where he mentions the UPS in some EMS products? Ask yourself, what is the UPS for? Then ask yourself if EMC makes billions of dollars selling those things to enterprise clients. Daniel --
I'd like to seem some science. I'd like to know how much faster it really is, and for that proper testing needs to be done. Since Daniel's scheme uses the same amount of RAM as disk, an appropriate test would be to pin (at least) that amount of RAM to buffer cache, and then to fill the cache with the contents of the disk (i.e. cat /dev/disk > /dev/null.) This sets the stage for tests, which tests should not include the sync operation. I'd like to see actual numbers against such a setup versus Daniel's scheme. Since buffer cache is shared by all disks, obviously the test must not access any other drive. One thing I will admit: RAM disks are fast. What I don't know is how much work there is to access blocks that are already in the buffer cache. In principle I suppose it should be a little slower, but not much. I'd like to know, though. I'd do the test myself if I had a machine with enough RAM, but I don't. Daniel (apparently) does... --
There is a correctable flaw in your experiment: loading the disk into buffer cache does not make the cached data available to the page cache. Maybe it should (good summer project there for somebody) but for now you need to tar the filesystem to dev/null or similar. Note that, because of poor cross-directory readahead, traversing a disk like that will not be as fast as reading it linearly. On the other hand, you will not have to read any free space into cache, which ramback does because it does not know what is free space (or care, really...) You are probably OK. I used a 150 MB ramdisk, of which I used only 100 MB. That is why I used a 2.2 kernel for my tests. Daniel --
>>>>> "Daniel" == Daniel Phillips <phillips@phunq.net> writes: Daniel> In its current state it has bugs and so should appeal only to Daniel> programmers who like to work with cutting edge stuff. As a prof SysAdmin and long time lurker, I feel I can chime in here a bit. No one is arguing that your code isn't neat, or have a feature which would be nice to have. They are arguing that your failure mode (when, not if, it fails for some reason) is horrible. Who remembers the NFS PrestoServer NFS accelerator cards? You could buy this PCI (or was it TurboChannel back then?) card for your DEC Alphas. It came with 4Mb of battery backed RAM so that NFS writes could be ack'd before being written to disk. We had just completed moving all the user home directories to this system that week, say around 4gb of data? Remember, this was around 94 sometime at a University. We were also using Advfs on DEC OSF/1, probably v1.2, maybe v1.3. Anyway, I came into work thursday night to pickup something I had forgotten before I took a three day weekend. The operator on duty asked me to look at the server since it had crashed and wasn't coming up properly. I ended up staying there until 9am the next morning working on it. Turned out to be both user and hardware error. We had forgotten to remove the piece of plastic to enable to battery on the card, but the circuits on the card lied and said battery voltage was fine no matter what the battery really was. So the system crashed. 4Mb of data from the filesystem when bye-bye. Can you say oops? What a total pain to diagnose. But even on a log structured filesystem, having 4Mb of data just get wiped out was enough to destroy all the filesystem. We ended up rolling back to the original server and junking the week of changes that users had made, and restoring chunks for users as they requested it. Luckily, it was early in the semester and not alot of stuff had gotten done yet. Now do you see why people are a bit hesitant of ...
RAID controllers do not have half a terabyte of RAM. Also, you are always invited to choose between speed (write back) and reliability (write through). Also, please note that the problem here is not related to the number of nines of availability. This number only counts the ratio between uptime and downtime. We're more facing a problem of MTBF, where the consequences of a failure are hard to predict. What I'm thinking about is that considering the fact that storage technologies are moving towards SSD (and I think 2008 will be the year of SSD), you should implement ordered writes (I've not said write through) since there's no seek time on those devices. Thus you will have the speed of RAM with the reliability of a properly synced FS. If your system crashes once a week, it will not be a problem anymore. Willy --
And? Either you have battery backed ram with critical data in it or That is why I keep recommending that a ramback setup be replicated or mirrored, which people in this thread keep glossing over. When replicated or mirrored, you still get the microsecond-level transaction times, and you get the safety too. Then there is a big class of applications where the data on the ramdisk can be reconstructed, it is just a pain and reduces uptime. These are potential ramback users, and in fact I will be one of those, using it There will be a whole bunch of patches from me that are SSD oriented, over time. The fact is, enterprise scale ramdisks are here now, while enterprise scale flash is not. Getting close, but not here. And flash does not approach the write performance of RAM, not now and probably not ever. Daniel --
It makes a lot of difference, and in addition raid controllers (good ones) respect barrier ordering in their RAM cache so they'll take tags or Either you keep a mirror in sync and get normal data rates or you keep the mirror out of sync and then you need to sort your writeback process out to preserve ordering. If you want ramback to be taken seriously then that is the interesting problem to solve and clearly has multiple solutions if you would start to take an objective look at your work. --
Ramback should obviously respect barriers, and it does, though at present only in the crude, default way of letting the block layer handle it. But interpreting a barrier to mean flush through to rotating media... performance will drop to the millisecond per transaction zone, like a normal disk. Not what ramback users want in normal operating mode. Flush mode, yes. Even raid controllers... so you agree that some of them just don't respond conservatively to tagged commands, either because the engineers don't know how to implement that (unlikely) or because they want to win the performance benchmarks, and they do trust their battery? "Some raid controllers" is just as good for my argument as "all raid controllers". Nobody is telling you which raid controller to use in your own personal system. I will pick the fast one and you can pick Ramback already is taken seriously, just not by you. That is fine, you apparently do not need or want the speed. Anyway, please do not get the impression that I am ignoring your ideas. There are some nice, intermediate modes that ramback could and in my opinion, should implement, to give users more options on how to trade off performance against resilience. I just need to make it clear that ramback, as conceived, already gives system builders the capability they need to achieve microsecond level transaction throughput and data safety at the same time... given a reliable battery, which is where we started. Daniel --
That isn't anything to do with what was being proposed. *ORDERING* not The ones that don't respect tagged ordering are the ultra cheap nasty things you buy down the local computer store that come with a 2 page manual in something vaguely like English. The stuff used for real work is I want the speed and reliability. Without that ramback is a distraction You have no guarantee of commit to stable storage so your use of the word "transaction" is a bit farcical. There are a whole variety of ways to get far better results than "whoops bang there goes the file system". Log structured backing media is one, even snapshots. That way you'd quantify that for the cost of more rotating storage (which is cheap) you can only lose "x" minutes of data and will lose everything from a defined consistent point. File based backing store also has similar properties done right, but needs some higher level care to track closure and dirty blocks on a per inode basis. Alan --
This is where you have made a fundamental mistake in your proposal. Suppose you have a steady, heavy write load onto ramback. Eventually, the entire ramdisk will be dirty and you have to drop back to disk speed, right? My design does not suffer from that problem, but your proposal does. It gets worse than that. Suppose somebody writes the same region twice, how do you order that? Do you try to store that new data somewhere, keeping in mind that we are already at terabyte scale? Is Somebody has. But please feel free to solve some other problem. I The UPS provides a guarantee of commit to stable storage. No amount of FUD will change that. But please go ahead and calculate the risks involved. I am confident you will admit that there are standard] techniques available to ameliorate risk, which may be applied _on top of_ ramback, thus not destroying its microsecond-level transaction performance as you propose. Daniel --
You only have to care about ordering if there is a store barrier between the two (not usual). You only have to care about filling if you generate enough dirty blocks at a very high rate (which is unusual for most workloads). If you don't care about those then we have ramdisk already and if you want to write a ramdisk driver for external ramdisk great. You'd also fix the layering violations then by allowing device mapper to implement things like snapshotting and writeback seperated from your driver. Even in the extreme case that you propose there are trivial ways of getting coherency. Simple example - if you can sweep all the data out in say 10 minutes then you can buy twice the physical media and ensure that one of the two sets of disk backups is genuinely store barrier consistent to some snapshot time (say every 30 minutes but obviously user tunable). If you at least had some kind of credible snapshotting you'd find people Stable storage to most people means "won't go away on a bad happening". Transaction likewise has a specific meaning in terms of an event occuring once only an either being recorded before or after the transaction occurred. Alan --
Hi Alan, According to you. A more accurate statement: if you have the ramdisk on the host, then the host is assumed to be reliable. If the ramdisk is external (http://www.violin-memory.com/products/violin1010.html) then your statement is untrue in every sense. But you did not address the logic of my statement above: that your fundamental design prevents you from operating at ramdisk speed during No wait, it is completely normal. There is a barrier on every journal Exactly the purpose for which this driver was written. And as a bonus it happens to be useful for internal ramdisk applications as well. (It Device mapper already can, so I do not get your point. Also, what is Hostility does not equate to accuracy. Galileo comes to mind. I see people arguing that a server+linux+batteries+mirroring+replication cannot achieve enterprise grade reliability. Balderdash. Regards, Daniel --
> Hostility does not equate to accuracy. Galileo comes to mind. I see no attempt to even discuss the use of two sets of physical storage to maintain coherent snapshots, just comments about hostility. That's a fairly poor way to repay people who spend a lot of time working with enterprise customers and are interested in solutions using things like giant ramdisks and are putting in time to discuss I look forward to seeing your constructive detailed analysis of failure modes based upon actual statistical data from real data centres. Unless you can produce that nobody is going to take you seriously, which is bad luck for the poor folks at violin if they are relying on you. Alan --
> You did not explain how your proposal will avoid dropping the transaction Here is a simple but high physical storage using approach (but hey disks are cheap) You walk across the ram dirty table writing out chunks to backing store 0. At some point in time you want a consistent snapshot so you pick the next write barrier point after this time and begin committing blocks dirtied after that moment to store 1 (with blocks before that moment being written to both). You don't permit more than one snapshot to be in progress at once so at some point you clear all the blocks for store 0. Your snapshotting interval is bounded by the time to write out the store, nor do you have to throttle writes to the ramdisk. You now have a consistent snapshot in store 0. At the next time interval we finish off store 1 and spew new blocks to store 2, after 2 is complete we go with 2, 0 and then 1 as the stable store. The only other real trick needed then is metadata, but you don't have to update that on disk too often and you only need two bits for each of the page in RAM. For any page it is either 00 Clean on stable store 01 Clean on current writing snapshot 10 Dirty on stable store (and thus both) 11 Dirty on current writing snapshot (but clean, old on stable) Pages go 00->11 or 01->11 when they are touched, 11->01 or 10->01 when they are written back. At the point we freeze a snapshot we move 01->00 11->10 00->11 and there are no pages in 10. And of course we don't update the big tables at this instant instead we store the page state as (value - cycle_count)&3 with each freeze moment doing cycle_count++; The 00->11 is perhaps not obvious but the logic is fairly simple. The snapshot we are building does not magically contain the stable data from a previous snapshot. Say 0 is our stable snapshot snapshot 0 page 0 contains the stable copy of a page snapshot 1 is currently being updated if we touched the page during the lifetime of snapshot 1 the newer ...
What about system crashes? They guarantee that data will be lost. I know opinions are divided on the subject of crashes: You say Linux doesn't; everybody else says it does. I side with experience. (It does.) --
Not if it is mirrored and replicated. Also nice if crashes are very I say it does not crash often, to the point where I have not seen it crash once for any reason I did not create myself (I tend to wait for the occasional brown bag release to fade away before shifting development We do get quite a few reports of less mature systems like hald and usb causing problems, and not too long ago NFS client was very crash happy. I did see some of those myself two years ago, and fixed them. On the whole, Linux is very reliable. Very very reliable. Now mirror that, replicate it, add in 2 x 2 redundant power supplies backed by independent UPS units so you can do regular preemptive maintenance on the batteries, and you have a sweet enterprise transaction processing system. All set for a faster than light moon shot :-) Daniel --
if you are depending on replication over the network you have just limited your throughput to your network speed and latency. on an enterprise level machine the network can frequently be significantly slower than the disk array that you are so frantic to avoid waiting for. David Lang --
Replication does not work that way. On each replication cycle, the differences between the most recent two volume snapshots go over the network. This strategy has the nice effect of consolidating rewrites. There are also excellent delta compression opportunities. In the worst case, with insufficient bandwidth for the churn rate of the volume, replication rate increases to the time for replicating the full volume. Again, at worst, this would require extra storage for the snapshot to be replicated equivalent to the original volume size, so that the primary volume is not forced to wait synchronously for a replication cycle to complete. Mirroring on the other hand, makes a realtime copy of a volume, that is never out of date. Frantic... your word. Designing for dependably high transaction rates requires a different mode of thinking that some traditionalists seem to be having some trouble with. Daniel --
so just mirror to a local disk array then. a local disk array has more write bandwidth than a network connection to a remote machine, so if you can mirror to a remote machine you can mirror to if by traditionalists you mean everyone who makes a living keeping systems running you are right. we want sane failure modes as much as we want performance. there will be times when we decide to go for speed at the expense of safety, but we want to do it knowingly, not when someone is promising both and only provides speed. and by the way, if the violin box use your software they have just moved from a resource for me to tap when needed to something that I will advise my company to avoid at all costs. David Lang --
Great idea. Except that the disk array has millisecond level latency, So you could potentially connect to a _huge_ disk array and write deltas to it. The disk array would have to support roughly 3 Gbytes/second of write bandwidth to keep up with the Violin ramdisk. Doable, but you are now in the serious heavy iron zone. Personally, I like my nice simple design a lot more. Just mirror it, as many times as you need to satisfy your paranoia. Or how about go write your own? Daniel --
Just a point of information, most of the mid-tier and above disk arrays can do replication/mirroring behind the scene (i.e., you write to one array and it takes care of replicating your write to one or more other arrays). This behind the scene replication can be over various types of connections - IP or fibre channel probably are the two most common paths. That will still leave you with the normal latency for a small write to an array which is (when you hit cache) order of 1-2 ms... ric --
your network will do less then 1 Gbit/sec, so to mirror in real-time (what you claim is trivial) you would need at least 24 network connections in parallel. that's a LOT harder to setup then a high performance disk array. David Lang --
by the way, the only way to get this much bandwideth between two machines is to directly connect PCI-e/16 card slots togeather. this is definantly not commodity hardware anymore (if it's even possible, PCI-e has some very short distance limitations) David Lang --
You can do that with 3 10GE NICS, though in practise that's not easy. Willy --
I think you've just tried to obfuscate the truth. As you have described, replication does not provide full protection against data loss; it loses all changes since last cycle. Recall that it was you who introduced the word "replication", in the context of guaranteeing no loss of data. Then you ignored David's point about the relatively low speed of networks, remarking only that mirroring is real-time. Reading between your words makes clear that "mirroring and replication" does You've rather under-valued dependability, though. Even your idea of mirroring systems is incomplete, because failure of the principle system requires transparent fail-over to the redundant system, which is actually quite challenging, especially with commodity systems hobbled together in the way you promote. Remember that you claimed microsecond-level transaction times, and 6-nines of availability. The former seems unlikely with replicated systems and, in the event of a failure, you won't achieve the latter. You still haven't investigated the benefit of your idea over a whopping great buffer cache. What's the point in all of this if it turns out, as Alan hinted should be the case, that a big buffer cache gives much the same performance? You appear to have gone to a great deal of effort without having performed quite simple yet obvious experiments. --
You are twisting words. I may have said that replication provides a point-in-time copy of a volume, which is exactly what it does, no more, A big buffer cache does not provide a guarantee that the dirty cache data saved to disk when line power is lost. If you would like to add that feature to the Linux buffer cache, then please do it, or make whichever other contribution you wish to make. If you just want to explain to me one more time that Linux, batteries, whatever, cannot be relied on, then please do not include me in the CC list. Daniel --
You said that you could achieve a certain performance, and later you said that for reliability you could use mirroring and replication but you never said that would lead to a performance hit. In fact you don't seem to be able to offer performance AND robustness; for performance you can only offer that level of robustness attainable on a single system, which means I think even you agreed was really not up to snuff for But the filesystem does offer a minimum level of consistency, which is missing from what you propose. You propose writing nothing unless line-power fails. The big buffer cache gives you all of the robustness of the underlying filesystem and including dirty buffer writes at some I haven't said that at all, other than as an axiom (which even you have agreed is fair) leading to comments on the results when something does fail. You keep saying that it won't ever fail, then that it will but that you can mitigate using redundant systems; and then you gloss over or refuse to face the attendant performance hit. Finally, you still have no idea whether your idea really does achieve a massive performance boost. You've never compared like amounts of RAM, nor the unsynced updates that most closely resemble your idea. In short, you've leaped on what seems to you to be a good idea and steadfastly refused to conduct even basic research. What's the point? You say don't cc you; I say go away, do that basic research, and come back when you have hard data. I really don't think you can ask for fairer than that. --
on_battery_power: sync mount / -oremount sync ...will of course work okay on any reasonable system. Not on yours, because you have to do echo i_really_mean_sync_when_i_say_sync > /hidden/file/somewhere sync (...which also shows that you are cheating). Now, will you either do your homework and show that page cache is somehow unsuitable for your job, or just stop wasting the bandwidth with useless rants? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
Speaking of useless rants... You need to go read the whole thread again, you missed the main bit. Daniel --
It completely changes the method to power it and the time the data may remain in RAM. The Smart 3200 I have right here simply has lithium batteries directly connected to the static RAM chips. Very low risk of power failure. The way your presented your work shows it rely on a UPS to sustain the PC's power supply, which it turn maintains the PC alive, which in turn tries not to reboot to keep its RAM consistent. There are a lot of reasons here to get a failure. Don't get me wrong, I still think your project has a lot of usages. But you have to admit that there are huge differences between using it in an appliance with battery-backed RAM which is able to recover data after a system crash, power outage or anything, and the average Joe's PC setup as an NFS server for the company with a cheap UPS to try not to lose the data should a power outage occur. I agree, but in this case, you should present it this way. You have been insisting too much on the average PC's reliability, the fact that no kernel ever crashed for you, etc... So you are demonstrating that your product is good provided that everything goes perfectly. All people who have experienced software or hardware problems in the past (ie mostly everyone here) will not trust your code because it relies on pre-requisites they know they do not My goal is not to replace RAM with flash, but disk with flash. You are against ordered writes for a performance reason. Use SSD instead of hard drives and it will be as fast as sequential writes. Also, when you say that enterprise scale flash is not there, I don't agree. You can already afford hundreds of gigs of flash in 3,5" form factor. An 1.6 TB SSD has even been presented at CES2008, with sales announced for Q3. So clearly this will replace your hard drives soon, very soon. Even if it costs $5k, that's a very acceptable solution to replace a disk in a RAM-speed appliance. Willy --
It already has ordered write when it is in flush mode. OK, I hear you. There will be an ordered write mode that uses barriers to decide the ordering. It will greatly reduce the speed at which ramback can flush dirty data because of the need to wait synchronously on every barrier, of which there are many. And thus will widen out the window during which UPS power must remain available if power goes out, in order to get all acknowledged transactions on to stable media. The advantage is, the stable media always has a point-in-time version of the filesystem. Don't expect this mode in the immediate future though, there are bugs to fix in the current driver, which already implements the required That would have been a miscommunication then. I see arguments coming in that suggest embedded solutions, EMC for example, are inherently more reliable than a Linux based solution. Well guess what? Some of those embedded solutions already use Linux. Also, peecees are much more reliable than people give them credit for, especially if you harden up the obvious points of failure such as fans and spinning disks. Once you have your system all hardened up, then you _still_ better replicate your important data. Perhaps I should not admit this, but I simply fail to do that on the machine from which I am posting right now, which also runs my web server and mail system. That is because I would have to reboot it to install ddsnap so I can replicate properly, and because the thing is so darn reliable that I just have not gotten around to it. I do copy off the important files from time to time though, and do various other things to ameliorate the risk. If Exactly what I mean: close but not there. Those gigantic RAM boxes are shipping now, and the same company has got a 5 TB flash box coming down the pipe, and sooner than Q3. But the RAM box will always outperform the flash box. You just keep throwing writes at it until all available flash is in erase mode, and the thing slows down. ...
it will mean that the window is larger, but it will also mean that if something else goes wrong and that window is not available the data that was written out will be useable (recent data will be lost, but older data will still be available) as for things that can go wrong the UPS battery can go bad you can have multiple power failures in a short time so your battery is not fully charged capacitors in the UPS can go bad capacitors in the power supply can go bad capacitors on the motherboard can go bad a kernel bug can crash the system a bug in a device driver (say nvidia graphics driver) can crash the system a card in the system can lock up the system bus the system power supply can die the system fans can die and cause the system to overheat cooling in the room the system is in can fail and cause the system to overheat airflow to the computer can get blocked and cause the system to overheat some other component in the computer can short out and cause the system to loose power internally I have had every single one of these things happen to me over the years. Some on personal equipment, some on work equipment. At work I recently had a series of disasters where capacitors in a 7 figure UPS blew up, and a few days later during a power outage when we were running on generator, a fuel company made a mistake while adding fuel to the generator and knocked it out. Even if you spend millions on equipment and professionals to set it up and maintain it, you can still go down. You may not care about it on your system (becouse you copy data elsewhere and don't change it rapidly), but most people do. with your current approach you are slightly better then a couple shell scripts from an availability point of view, you are no better in performance, but your failure mode is complete disaster. comparing you to 'cp drive ramdisk' at startup and 'rsync ramdisk drive' periodicly and at shutdown you are faster at startup, close enough at shutdown as to be in the noise ...
Actually modern DRAM can be put into "self refresh" mode which don't need (nor allow) any external accesses. Not very practical in typical PC case, though I think suspend to RAM uses it. Could be used for battery - backed RAID/disk controller as well. Obviously it changes nothing WRT ramback. -- Krzysztof Halasa --
But their RAM does not depend on a lot of factors to remain valid and Securing every component simply reduces the risk of a loss of service. What is important with data is to know the consequences of loss of service. If that only means that no one can work and that the last second of work is lost, it's generally acceptable. If it means everything is lost to a corrupted No, you're replacing disk activity with RAM activity. But you keep disk as Sorry if I was not clear. I was not speaking about replacing the RAM with flash, but only the disks. You keep the RAM for the speed, and use flash for permanent storage instead of disks. No seek time, average RW speed now slightly better than disks, that combined with your ramdisk and ordered write-backs writes will have the best of both worlds : RAM speed and flash reliability. Willy --
For example? Anecdote time. Remember there used to be "brand name" floppy disks and generic floppy disks, and the brand name ones cost a lot more because they were supposedly safer? Well, big secret, studies were done and the no-name disks came out better. Why? Because selling at commodity prices the generic makers could not afford returns. So they made them well. It is like that with PCs. Supposedly you get a lot more reliability when you spend more money and buy all high end near-custom gear. In fact, the cheap stuff just keeps on chugging, because those guys can't afford to have it break. So please don't underestimate the reliability of a PC. There are bits of Linux that are undeniably dodgy. We get a lot of bug reports about usb for example, keyboards just quitting and it's not the keyboard's fault. Just say no to usb in a server, at least until some fundamental cleanup happens there. The worst bug I've seen in a server this year? A buggy bios in a Dell server that would issue a keyboard error and sit and wait for somebody to press F1 when there was no keyboard attached. That is embedded software for you. Personally, I think we do way better than that in Yes. Dual power supplies are highly recommended for this application. With dual power supplies you can carry out preemptive maintenance on So mirror two of them, I keep saying. If that is not good enough for you, then make it three way, and replicate for good measure. The thing is, none of that hurts the microsecond level performance, and it gets you whatever data security you desire. Whereas anything that requires waiting on disk transactions does hurt performance. Since my interest currently lies in high performance, that is where my effort goes. And do I need to say it: patches gratefully accepted. For my immediate application... hacking the kernel in comfort... just Right. What we are talking about is filling in a missing level in the cache hierarchy, something like: L1 .3 ...
I strongly disagree. Cheap PC hardware is not even close to the quality of a serious, branded machine. Often capacitors are missing from power lines, and the ones that are installed fail sooner. Cooling fans are lower quality and fail much sooner. Timing issues abound. There's a reason why an IBM is a better machine than a "Black-n-Gold": IBM value their name so when you have a problem, they have a problem. Buy generic and when you get a problem they already have your money and since they have no investment in their name, they have nothing more to care about. --
That's just nonsense in a consolidated market. You change to IBM, then to Dell, then to HP then again to IBM. Maybe you even try Sun. That causes you more grief than any one of them. I have seen people doing that in all industry branches and even privately. If you love brands, then your choice becomes very limited. That's the real reason for them being much more expensive. If you think machines and specs, then you have a much more clear picture. After a while you even have your own measures for failure rates of those components and can handle it. No matter which brand :-) Best Regards Ingo Oeser --
What I mean is that in a PC, RAM contents are very fragile : - weak batteries in your UPS => end of game - loosy power cable between UPS and PC => end of game (BTW I have a customer who had such a problem, cables had both disconnected because of their own weight). - kernel panic => end of game - user error during planned maintenance => end of game - flaky driver writing to wrong memory location => can't trust your data In a normal PC, even if the RAM itself is a reliable component (ECC, ...) a lot of such problems which may happen will render it unusable. If you have to reboot, your BIOS will clean it up for you. That's why people are trying to explain to you that linux is not reliable enough to work like this. Now if you have all your RAM on a PCI-E board with a battery and which is not initialized by the BIOS so that it survives reboots, it changes a LOT of things, because all the problems mentionned above go away. Let me repeat it, the problem is not that those components are too unreliable to build a transactional system, it is that used in this manner, a very simple failure of any of them is enough to lose/corrupt all of your data. That was not my experience when I was a student. We would buy very cheap diskettes which were only sold by 100. 20% of them were already defective, and 20% of the remaining ones could not keep our data till the next morning! I knew guys who finally stopped copying games due to those diskettes, so If you have understood what I explained above, now you'll understand that I'm not underestimating the reliability of my PC, just the fact that keeping access to my RAM contents involves a lot of components, any of which will I thought this stupidity disappeared about 5 years ago ? I was about to build PIC-based PS/2 "terminators" to plug into machines to avoid this I never spoke about waiting for disk transactions. The RAM must be the only source and target of user data. Disk is there for permanent storage and should be written ...
Not sure if things like SLR-2 or so are still available, except second hand. But they at least provide compatibility for some time. -- Krzysztof Halasa --
They don't care if it breaks after 12 months, and for components and addons they don't care if it breaks, they just blame the end user for mis-installation or 'incompatibility'. There is a huge difference in Perhaps. But if your cache can destroy the contents of the layer below in situations that do occur it isn't useful. If you can fix that then it obviously has a lot of potential. Alan --
Actually, it's worse than that. Users have been trained that when a
computer bluescreens and losing all of their data, it's either (a)
just the way things are, or (b) it's microsoft's fault. Worse yet,
thanks to things like PC benchmarks, hard drive manutacturers have in
the past been encouraged to do things like lie to the OS about when
things had hit the hard drive platter just to score higher numbers on
winbench.
All of this is why I've in the past summed all of this up as Ted's law
of PC class hardware, which is that PC class hardware is cr*p. :-)
- Ted
--
I don't think so. I remember we had much more problems with noname disks. And yes, certain brands had been problematic too, but most The real life can't agree with this at all. The servers keep working for years and the cheap stuff quit fast (if initially working, which Most BIOS (all I've seen in this Millennium) have an option to disable that. On a server board you can usually have a remote console, how could We already have RAM between L3 and Flash. The problem is flushing L1 to disk/flash takes time. -- Krzysztof Halasa --
Besides, some SAN Storage Devices do have that amount of Ram. However it is better protected as in your typical PC. With Mirroring, it can be removed (including the battery packs) - and there is a procedure to actually replay the buffers once the new devices are in place. But thats not an argument against or in favor of Ramback, its just two different things. You would be suprised how many databases run on write back mode disks without fdsync() any nobody cares :) Greetings Bernd --
Do you mean it should be replicated with a second ramback? That would be pretty pointless, since all failure modes would affect both. It's not like one ramback will survive a crash when the other doesn't. --
A second machine running a second ramback, on a second UPS pair. I thought that was obvious. Daniel --
It could, in a bit different location maybe, but it isn't a substitute for ordered writes. -- Krzysztof Halasa --
Not sure if I understand the question correctly but obviously a pair (mirror) of servers running "dangerous" ramback would survive a crash of one machine and we could practically eliminate the probability of both (all) machines crashing simultaneously. However, there are cheaper ways to achieve similar performance and even better reliability - including those battery-backed (RAI)Disk controllers. -- Krzysztof Halasa --
OK, so we are only searching for the cheapest way to achieve these kinds of speeds, for some given uptime and risk level requirements. That is a really interesting subject, but can we please leave it for a while so I can get some work done on the code itself? Thanks, Daniel --
The write back ones are also battery backed properly, and will switched to write through (flushing out the cache) on the first sniff of a low battery signal. The decent ones (the kind used in serious business) also let you swap the battery backed RAM module to another card in the event of a failure of a card so you can complete recovery. --
Right, just like the Violin 1010, whose PCI-e cable can be hotplugged into a different server. Or plugged into two servers at the same time, because each 1010 has two PCI-e interfaces, so this can be done without manual intervention. See, we really are talking about the same thing. Except that ramback does it bigger and faster. Daniel --
On Sat, 15 Mar 2008 13:25:48 -0800 No because you don't honour the ordering and tag boundaries as they do. Alan --
Sophism. The statement was "battery backed properly" and "switch on first sniff", which is example how ramback works. Daniel --
So you apparently want three things:
a) ignoring fsync() and co on this device
b) disabling all write throttling on this device
c) never discarding cached data from this device
anything else i'm missing?
Alan already suggested the ramfs+writeback thread approach (possibly
with a little bit of help from the fs which could report just the dirty
regions), but i'm not sure even that is necessary.
(a) can be easily done (fixing the app, LD_PRELOAD or fs extension etc)
(b) couldn't the per-device write throttling be used to achieve this?
(c) shouldn't be impossible either, eg sticking PG_writeback comes to mind,
just the mm accounting needs to remain sane.
IOW can't this be done in a more generic way (and w/o a ramdisk in the
apples to oranges. what are the numbers for a nonjournalled disk-backed
fs and _without_ the sync? (You're not committing to stable storage anyway
so the sync is useless and if you don't respect the ordering so is the
journal)
artur
--
So that's what I've been doing wrong for all these years... -- Chris --
/proc is so 1990's. As your code has nothing to do with processes, please don't add new files in /proc/. sysfs is there for you to do Use debugfs for stuff like debug info like this. thanks, greg k-h --
Demonstrate some advantage and I will think about it. Daniel --
use of /proc is discouraged, if you insist on sticking with it in the face of opposition you will seriously hurt the chance of your patches being accepted. David Lang --
Again, as your code has nothing to do with "processes", please do not add new files to /proc. As you are a filesystem, why not /sys/fs/ ? It ends up with smaller code than procfs stuff as well, a good and nice advantage. thanks, greg k-h --
What about doing a similar thing as a device mapper target? Have a look a dm-cache, I know that development of that has stopped but it doesn't mean it couldn't be ressurected. It has an advantage that it is generic (any two block devices will do) and you don't need to populate the "cache" on start-up - it happens automatically through cache misses. Another use could be a flash based disk accelerator which may be pretty popular nowadays. Tvrtko Sophos Plc, The Pentagon, Abingdon Science Park, Abingdon, OX14 3YP, United Kingdom. Company Reg No 2096520. VAT Reg No GB 348 3873 20. --
It is a device mapper target (though there is no real advantage in that other than having a handy plug-in api). It does handle any two block devices, and it does populate on cache miss. But also has daemon-driven population, since it never makes sense to leave the backing disk idle then have to incur read latency because of that later. Regards, Daniel --
