Every little factor of 25 performance increase really helps. Ramback is a new virtual device with the ability to back a ramdisk by a real disk, obtaining the performance level of a ramdisk but with the data durability of a hard disk. To work this magic, ramback needs a little help from a UPS. In a typical test, ramback reduced a 25 second file operation[1] to under one second including sync. Even greater gains are possible for seek-intensive applications. The difference between ramback and an ordinary ramdisk is: when the machine powers down the data does not vanish because it is continuously saved to backing store. When line power returns, the backing store repopulates the ramdisk while allowing application io to proceed concurrently. Once fully populated, a little green light winks on and file operations once again run at ramdisk speed. So now you can ask some hard questions: what if the power goes out completely or the host crashes or something else goes wrong while critical data is still in the ramdisk? Easy: use reliable components. Don't crash. Measure your UPS window. This is not much to ask in order to transform your mild mannered hard disk into a raging superdisk able to leap tall benchmarks at a single bound. If line power goes out while ramback is running, the UPS kicks in and a power management script switches the driver from writeback to writethrough mode. Ramback proceeds to save all remaining dirty data while forcing each new application write through to backing store immediately. If UPS power runs out while ramback still holds unflushed dirty data then things get ugly. Hopefully a fsck -f will be able to pull something useful out of the mess. (This is where you might want to be running Ext3.) The name of the game is to install sufficient UPS power to get your dirty ramdisk data onto stable storage this time, every time. The basic design premise of ramback is alluringly simple: each write to a ramdisk sets a per-chunk dirty ...
What about doing a similar thing as a device mapper target? Have a look a dm-cache, I know that development of that has stopped but it doesn't mean it couldn't be ressurected. It has an advantage that it is generic (any two block devices will do) and you don't need to populate the "cache" on start-up - it happens automatically through cache misses. Another use could be a flash based disk accelerator which may be pretty popular nowadays. Tvrtko Sophos Plc, The Pentagon, Abingdon Science Park, Abingdon, OX14 3YP, United Kingdom. Company Reg No 2096520. VAT Reg No GB 348 3873 20. --
It is a device mapper target (though there is no real advantage in that other than having a handy plug-in api). It does handle any two block devices, and it does populate on cache miss. But also has daemon-driven population, since it never makes sense to leave the backing disk idle then have to incur read latency because of that later. Regards, Daniel --
/proc is so 1990's. As your code has nothing to do with processes, please don't add new files in /proc/. sysfs is there for you to do Use debugfs for stuff like debug info like this. thanks, greg k-h --
Demonstrate some advantage and I will think about it. Daniel --
Again, as your code has nothing to do with "processes", please do not add new files to /proc. As you are a filesystem, why not /sys/fs/ ? It ends up with smaller code than procfs stuff as well, a good and nice advantage. thanks, greg k-h --
use of /proc is discouraged, if you insist on sticking with it in the face of opposition you will seriously hurt the chance of your patches being accepted. David Lang --
So that's what I've been doing wrong for all these years... -- Chris --
So you apparently want three things:
a) ignoring fsync() and co on this device
b) disabling all write throttling on this device
c) never discarding cached data from this device
anything else i'm missing?
Alan already suggested the ramfs+writeback thread approach (possibly
with a little bit of help from the fs which could report just the dirty
regions), but i'm not sure even that is necessary.
(a) can be easily done (fixing the app, LD_PRELOAD or fs extension etc)
(b) couldn't the per-device write throttling be used to achieve this?
(c) shouldn't be impossible either, eg sticking PG_writeback comes to mind,
just the mm accounting needs to remain sane.
IOW can't this be done in a more generic way (and w/o a ramdisk in the
apples to oranges. what are the numbers for a nonjournalled disk-backed
fs and _without_ the sync? (You're not committing to stable storage anyway
so the sync is useless and if you don't respect the ordering so is the
journal)
artur
--Nice fiction - stuff crashes eventually - not that this isn't useful. For a long time simply loading a 2-3GB Ramdisk off hard disk has been a good Ext3 is only going to help you if the ramdisk writeback respects barriers Why not - providing you clear the dirty bit before the write and you check it again after ? And on the disk size as you are going to have to suck all the content back in presumably a log structure is not a big If you are prepared to go bigger than the fs chunk size so lose the ordering guarantees your chunk size really ought to be *big* IMHO Alan --
Hi Alan, Nice to see so many redhatters taking an avid interest in storage :-) Right, and now with ramback you will be able to preserve that state and But that does not satisfy the requirement you snipped: * Applications need to be able to read and write ramback data during More accurately: in general, cannot transfer directly. The ramdisk may be external and not present a memory interface. Even an external ramdisk with a memory interface (the Violin box has this) would require extra programming to maintain cache consistency. Then there is the issue of ramdisks on the way that exceed the 40 bit physical addressing of current generation processors. Even for the simple case where the ramdisk is just part of the kernel unified cache, I would rather not go delving into that code when these transfers are on the slow path anyway. Application IO does its normal single copy_to/from_user thing. If somebody wants to fiddle with vm, the place to attack is right there. The copy_to/from_user can be eliminated (provided alignment requirements are met) using stupid page table tricks. In spite of Linus claiming there is no performance win "640K should be enough for anyone" The finer the granularity the faster the ramdisk syncs to backing store. The only attraction of coarse granularity I know of is shrinking the bitmap, which is currently not so big that it presents a problem. Your comment re fs chunk size reveals that I have failed to communicate the most basic principle of the ramback design: the backing store is not expected to represent a consistent filesystem state during normal operation. Only the ramdisk needs to maintain a consistent state, which I have taken care to ensure. You just need to believe in your battery, Linux and the hardware it runs on. Which of these do you mistrust? Regards, Daniel --
Actually no - ramback would be useless to this. You might crash and end Oh you mean "pray hard". e2fsck works well with typical disk style failures, it is not robust against random chunks vanishing. I know this as I've worked on and debugged a case where a raid card rebooted silently I was suggesting that you want log structure for the writeback disk so that you keep coherency and can recover it, an issue you seem intent on No I get that. You've ignored the fact I'm suggesting that design choice In a big critical environment - all three. Alan --
So then you know that people already rely on batteries in critical storage applications. So I do not understand why all the FUD from you. Particularly about Ext2/Ext3, which does recover well from random damage. You seem to be calling Linux unreliable. Daniel --
By "recover well", you must mean "loses massive swabs of data, leaving the system unbootable and with enormous numbers of user files missing." My experience. Expecting fsck to cover for missed writes is stupid. --
Whatever it can get off the disk it gets. It does a good job. If you don't think so, then don't tell me, tell Ted. Daniel --
On Wed, 12 Mar 2008 22:14:16 -0800 He knows. Ext3 cannot recover well from massive loss of intermediate writes. It isn't a normal failure mode and there isn't sufficient fs metadata robustness for this. A log structured backing store would deal with that but all you apparently want to do is scream FUD at anyone who doesn't agree with you. Alan --
Scream is an exaggeration, and FUD only applies to somebody who consistently overlooks the primary proposition in this design: that the battery backed power supply, computer hardware and Linux are reliable enough to entrust your data to them. I say this is practical, you say it is impossible, I say FUD. All you are proposing is that nobody can entrust their data to any hardware. Good point. There is no absolute reliability, only degrees of it. Many raid controllers now have battery backed writeback cache, which is exactly the same reliability proposition as ramback, on a smaller scale. Do you refuse to entrust your corporate data to such controllers? Daniel --
RAID controllers do not have half a terabyte of RAM. Also, you are always invited to choose between speed (write back) and reliability (write through). Also, please note that the problem here is not related to the number of nines of availability. This number only counts the ratio between uptime and downtime. We're more facing a problem of MTBF, where the consequences of a failure are hard to predict. What I'm thinking about is that considering the fact that storage technologies are moving towards SSD (and I think 2008 will be the year of SSD), you should implement ordered writes (I've not said write through) since there's no seek time on those devices. Thus you will have the speed of RAM with the reliability of a properly synced FS. If your system crashes once a week, it will not be a problem anymore. Willy --
The write back ones are also battery backed properly, and will switched to write through (flushing out the cache) on the first sniff of a low battery signal. The decent ones (the kind used in serious business) also let you swap the battery backed RAM module to another card in the event of a failure of a card so you can complete recovery. --
Right, just like the Violin 1010, whose PCI-e cable can be hotplugged into a different server. Or plugged into two servers at the same time, because each 1010 has two PCI-e interfaces, so this can be done without manual intervention. See, we really are talking about the same thing. Except that ramback does it bigger and faster. Daniel --
On Sat, 15 Mar 2008 13:25:48 -0800 No because you don't honour the ordering and tag boundaries as they do. Alan --
Sophism. The statement was "battery backed properly" and "switch on first sniff", which is example how ramback works. Daniel --
And? Either you have battery backed ram with critical data in it or That is why I keep recommending that a ramback setup be replicated or mirrored, which people in this thread keep glossing over. When replicated or mirrored, you still get the microsecond-level transaction times, and you get the safety too. Then there is a big class of applications where the data on the ramdisk can be reconstructed, it is just a pain and reduces uptime. These are potential ramback users, and in fact I will be one of those, using it There will be a whole bunch of patches from me that are SSD oriented, over time. The fact is, enterprise scale ramdisks are here now, while enterprise scale flash is not. Getting close, but not here. And flash does not approach the write performance of RAM, not now and probably not ever. Daniel --
Do you mean it should be replicated with a second ramback? That would be pretty pointless, since all failure modes would affect both. It's not like one ramback will survive a crash when the other doesn't. --
It could, in a bit different location maybe, but it isn't a substitute for ordered writes. -- Krzysztof Halasa --
How so? Daniel --
Not sure if I understand the question correctly but obviously a pair (mirror) of servers running "dangerous" ramback would survive a crash of one machine and we could practically eliminate the probability of both (all) machines crashing simultaneously. However, there are cheaper ways to achieve similar performance and even better reliability - including those battery-backed (RAI)Disk controllers. -- Krzysztof Halasa --
OK, so we are only searching for the cheapest way to achieve these kinds of speeds, for some given uptime and risk level requirements. That is a really interesting subject, but can we please leave it for a while so I can get some work done on the code itself? Thanks, Daniel --
A second machine running a second ramback, on a second UPS pair. I thought that was obvious. Daniel --
Besides, some SAN Storage Devices do have that amount of Ram. However it is better protected as in your typical PC. With Mirroring, it can be removed (including the battery packs) - and there is a procedure to actually replay the buffers once the new devices are in place. But thats not an argument against or in favor of Ramback, its just two different things. You would be suprised how many databases run on write back mode disks without fdsync() any nobody cares :) Greetings Bernd --
It completely changes the method to power it and the time the data may remain in RAM. The Smart 3200 I have right here simply has lithium batteries directly connected to the static RAM chips. Very low risk of power failure. The way your presented your work shows it rely on a UPS to sustain the PC's power supply, which it turn maintains the PC alive, which in turn tries not to reboot to keep its RAM consistent. There are a lot of reasons here to get a failure. Don't get me wrong, I still think your project has a lot of usages. But you have to admit that there are huge differences between using it in an appliance with battery-backed RAM which is able to recover data after a system crash, power outage or anything, and the average Joe's PC setup as an NFS server for the company with a cheap UPS to try not to lose the data should a power outage occur. I agree, but in this case, you should present it this way. You have been insisting too much on the average PC's reliability, the fact that no kernel ever crashed for you, etc... So you are demonstrating that your product is good provided that everything goes perfectly. All people who have experienced software or hardware problems in the past (ie mostly everyone here) will not trust your code because it relies on pre-requisites they know they do not My goal is not to replace RAM with flash, but disk with flash. You are against ordered writes for a performance reason. Use SSD instead of hard drives and it will be as fast as sequential writes. Also, when you say that enterprise scale flash is not there, I don't agree. You can already afford hundreds of gigs of flash in 3,5" form factor. An 1.6 TB SSD has even been presented at CES2008, with sales announced for Q3. So clearly this will replace your hard drives soon, very soon. Even if it costs $5k, that's a very acceptable solution to replace a disk in a RAM-speed appliance. Willy --
It already has ordered write when it is in flush mode. OK, I hear you. There will be an ordered write mode that uses barriers to decide the ordering. It will greatly reduce the speed at which ramback can flush dirty data because of the need to wait synchronously on every barrier, of which there are many. And thus will widen out the window during which UPS power must remain available if power goes out, in order to get all acknowledged transactions on to stable media. The advantage is, the stable media always has a point-in-time version of the filesystem. Don't expect this mode in the immediate future though, there are bugs to fix in the current driver, which already implements the required That would have been a miscommunication then. I see arguments coming in that suggest embedded solutions, EMC for example, are inherently more reliable than a Linux based solution. Well guess what? Some of those embedded solutions already use Linux. Also, peecees are much more reliable than people give them credit for, especially if you harden up the obvious points of failure such as fans and spinning disks. Once you have your system all hardened up, then you _still_ better replicate your important data. Perhaps I should not admit this, but I simply fail to do that on the machine from which I am posting right now, which also runs my web server and mail system. That is because I would have to reboot it to install ddsnap so I can replicate properly, and because the thing is so darn reliable that I just have not gotten around to it. I do copy off the important files from time to time though, and do various other things to ameliorate the risk. If Exactly what I mean: close but not there. Those gigantic RAM boxes are shipping now, and the same company has got a 5 TB flash box coming down the pipe, and sooner than Q3. But the RAM box will always outperform the flash box. You just keep throwing writes at it until all available flash is in erase mode, and the thing slows down. If ...
But their RAM does not depend on a lot of factors to remain valid and Securing every component simply reduces the risk of a loss of service. What is important with data is to know the consequences of loss of service. If that only means that no one can work and that the last second of work is lost, it's generally acceptable. If it means everything is lost to a corrupted No, you're replacing disk activity with RAM activity. But you keep disk as Sorry if I was not clear. I was not speaking about replacing the RAM with flash, but only the disks. You keep the RAM for the speed, and use flash for permanent storage instead of disks. No seek time, average RW speed now slightly better than disks, that combined with your ramdisk and ordered write-backs writes will have the best of both worlds : RAM speed and flash reliability. Willy --
For example? Anecdote time. Remember there used to be "brand name" floppy disks and generic floppy disks, and the brand name ones cost a lot more because they were supposedly safer? Well, big secret, studies were done and the no-name disks came out better. Why? Because selling at commodity prices the generic makers could not afford returns. So they made them well. It is like that with PCs. Supposedly you get a lot more reliability when you spend more money and buy all high end near-custom gear. In fact, the cheap stuff just keeps on chugging, because those guys can't afford to have it break. So please don't underestimate the reliability of a PC. There are bits of Linux that are undeniably dodgy. We get a lot of bug reports about usb for example, keyboards just quitting and it's not the keyboard's fault. Just say no to usb in a server, at least until some fundamental cleanup happens there. The worst bug I've seen in a server this year? A buggy bios in a Dell server that would issue a keyboard error and sit and wait for somebody to press F1 when there was no keyboard attached. That is embedded software for you. Personally, I think we do way better than that in Yes. Dual power supplies are highly recommended for this application. With dual power supplies you can carry out preemptive maintenance on So mirror two of them, I keep saying. If that is not good enough for you, then make it three way, and replicate for good measure. The thing is, none of that hurts the microsecond level performance, and it gets you whatever data security you desire. Whereas anything that requires waiting on disk transactions does hurt performance. Since my interest currently lies in high performance, that is where my effort goes. And do I need to say it: patches gratefully accepted. For my immediate application... hacking the kernel in comfort... just Right. What we are talking about is filling in a missing level in the cache hierarchy, something like: L1 .3 ns ...
I don't think so. I remember we had much more problems with noname disks. And yes, certain brands had been problematic too, but most The real life can't agree with this at all. The servers keep working for years and the cheap stuff quit fast (if initially working, which Most BIOS (all I've seen in this Millennium) have an option to disable that. On a server board you can usually have a remote console, how could We already have RAM between L3 and Flash. The problem is flushing L1 to disk/flash takes time. -- Krzysztof Halasa --
They don't care if it breaks after 12 months, and for components and addons they don't care if it breaks, they just blame the end user for mis-installation or 'incompatibility'. There is a huge difference in Perhaps. But if your cache can destroy the contents of the layer below in situations that do occur it isn't useful. If you can fix that then it obviously has a lot of potential. Alan --
Actually, it's worse than that. Users have been trained that when a
computer bluescreens and losing all of their data, it's either (a)
just the way things are, or (b) it's microsoft's fault. Worse yet,
thanks to things like PC benchmarks, hard drive manutacturers have in
the past been encouraged to do things like lie to the OS about when
things had hit the hard drive platter just to score higher numbers on
winbench.
All of this is why I've in the past summed all of this up as Ted's law
of PC class hardware, which is that PC class hardware is cr*p. :-)
- Ted
--What I mean is that in a PC, RAM contents are very fragile : - weak batteries in your UPS => end of game - loosy power cable between UPS and PC => end of game (BTW I have a customer who had such a problem, cables had both disconnected because of their own weight). - kernel panic => end of game - user error during planned maintenance => end of game - flaky driver writing to wrong memory location => can't trust your data In a normal PC, even if the RAM itself is a reliable component (ECC, ...) a lot of such problems which may happen will render it unusable. If you have to reboot, your BIOS will clean it up for you. That's why people are trying to explain to you that linux is not reliable enough to work like this. Now if you have all your RAM on a PCI-E board with a battery and which is not initialized by the BIOS so that it survives reboots, it changes a LOT of things, because all the problems mentionned above go away. Let me repeat it, the problem is not that those components are too unreliable to build a transactional system, it is that used in this manner, a very simple failure of any of them is enough to lose/corrupt all of your data. That was not my experience when I was a student. We would buy very cheap diskettes which were only sold by 100. 20% of them were already defective, and 20% of the remaining ones could not keep our data till the next morning! I knew guys who finally stopped copying games due to those diskettes, so If you have understood what I explained above, now you'll understand that I'm not underestimating the reliability of my PC, just the fact that keeping access to my RAM contents involves a lot of components, any of which will I thought this stupidity disappeared about 5 years ago ? I was about to build PIC-based PS/2 "terminators" to plug into machines to avoid this I never spoke about waiting for disk transactions. The RAM must be the only source and target of user data. Disk is there for permanent storage and should ...
Not sure if things like SLR-2 or so are still available, except second hand. But they at least provide compatibility for some time. -- Krzysztof Halasa --
I strongly disagree. Cheap PC hardware is not even close to the quality of a serious, branded machine. Often capacitors are missing from power lines, and the ones that are installed fail sooner. Cooling fans are lower quality and fail much sooner. Timing issues abound. There's a reason why an IBM is a better machine than a "Black-n-Gold": IBM value their name so when you have a problem, they have a problem. Buy generic and when you get a problem they already have your money and since they have no investment in their name, they have nothing more to care about. --
That's just nonsense in a consolidated market. You change to IBM, then to Dell, then to HP then again to IBM. Maybe you even try Sun. That causes you more grief than any one of them. I have seen people doing that in all industry branches and even privately. If you love brands, then your choice becomes very limited. That's the real reason for them being much more expensive. If you think machines and specs, then you have a much more clear picture. After a while you even have your own measures for failure rates of those components and can handle it. No matter which brand :-) Best Regards Ingo Oeser --
it will mean that the window is larger, but it will also mean that if something else goes wrong and that window is not available the data that was written out will be useable (recent data will be lost, but older data will still be available) as for things that can go wrong the UPS battery can go bad you can have multiple power failures in a short time so your battery is not fully charged capacitors in the UPS can go bad capacitors in the power supply can go bad capacitors on the motherboard can go bad a kernel bug can crash the system a bug in a device driver (say nvidia graphics driver) can crash the system a card in the system can lock up the system bus the system power supply can die the system fans can die and cause the system to overheat cooling in the room the system is in can fail and cause the system to overheat airflow to the computer can get blocked and cause the system to overheat some other component in the computer can short out and cause the system to loose power internally I have had every single one of these things happen to me over the years. Some on personal equipment, some on work equipment. At work I recently had a series of disasters where capacitors in a 7 figure UPS blew up, and a few days later during a power outage when we were running on generator, a fuel company made a mistake while adding fuel to the generator and knocked it out. Even if you spend millions on equipment and professionals to set it up and maintain it, you can still go down. You may not care about it on your system (becouse you copy data elsewhere and don't change it rapidly), but most people do. with your current approach you are slightly better then a couple shell scripts from an availability point of view, you are no better in performance, but your failure mode is complete disaster. comparing you to 'cp drive ramdisk' at startup and 'rsync ramdisk drive' periodicly and at shutdown you are faster at startup, close enough at shutdown as to be in the noise (eit...
Actually modern DRAM can be put into "self refresh" mode which don't need (nor allow) any external accesses. Not very practical in typical PC case, though I think suspend to RAM uses it. Could be used for battery - backed RAID/disk controller as well. Obviously it changes nothing WRT ramback. -- Krzysztof Halasa --
It makes a lot of difference, and in addition raid controllers (good ones) respect barrier ordering in their RAM cache so they'll take tags or Either you keep a mirror in sync and get normal data rates or you keep the mirror out of sync and then you need to sort your writeback process out to preserve ordering. If you want ramback to be taken seriously then that is the interesting problem to solve and clearly has multiple solutions if you would start to take an objective look at your work. --
Ramback should obviously respect barriers, and it does, though at present only in the crude, default way of letting the block layer handle it. But interpreting a barrier to mean flush through to rotating media... performance will drop to the millisecond per transaction zone, like a normal disk. Not what ramback users want in normal operating mode. Flush mode, yes. Even raid controllers... so you agree that some of them just don't respond conservatively to tagged commands, either because the engineers don't know how to implement that (unlikely) or because they want to win the performance benchmarks, and they do trust their battery? "Some raid controllers" is just as good for my argument as "all raid controllers". Nobody is telling you which raid controller to use in your own personal system. I will pick the fast one and you can pick Ramback already is taken seriously, just not by you. That is fine, you apparently do not need or want the speed. Anyway, please do not get the impression that I am ignoring your ideas. There are some nice, intermediate modes that ramback could and in my opinion, should implement, to give users more options on how to trade off performance against resilience. I just need to make it clear that ramback, as conceived, already gives system builders the capability they need to achieve microsecond level transaction throughput and data safety at the same time... given a reliable battery, which is where we started. Daniel --
That isn't anything to do with what was being proposed. *ORDERING* not The ones that don't respect tagged ordering are the ultra cheap nasty things you buy down the local computer store that come with a 2 page manual in something vaguely like English. The stuff used for real work is I want the speed and reliability. Without that ramback is a distraction You have no guarantee of commit to stable storage so your use of the word "transaction" is a bit farcical. There are a whole variety of ways to get far better results than "whoops bang there goes the file system". Log structured backing media is one, even snapshots. That way you'd quantify that for the cost of more rotating storage (which is cheap) you can only lose "x" minutes of data and will lose everything from a defined consistent point. File based backing store also has similar properties done right, but needs some higher level care to track closure and dirty blocks on a per inode basis. Alan --
This is where you have made a fundamental mistake in your proposal. Suppose you have a steady, heavy write load onto ramback. Eventually, the entire ramdisk will be dirty and you have to drop back to disk speed, right? My design does not suffer from that problem, but your proposal does. It gets worse than that. Suppose somebody writes the same region twice, how do you order that? Do you try to store that new data somewhere, keeping in mind that we are already at terabyte scale? Is Somebody has. But please feel free to solve some other problem. I The UPS provides a guarantee of commit to stable storage. No amount of FUD will change that. But please go ahead and calculate the risks involved. I am confident you will admit that there are standard] techniques available to ameliorate risk, which may be applied _on top of_ ramback, thus not destroying its microsecond-level transaction performance as you propose. Daniel --
What about system crashes? They guarantee that data will be lost. I know opinions are divided on the subject of crashes: You say Linux doesn't; everybody else says it does. I side with experience. (It does.) --
Not if it is mirrored and replicated. Also nice if crashes are very I say it does not crash often, to the point where I have not seen it crash once for any reason I did not create myself (I tend to wait for the occasional brown bag release to fade away before shifting development We do get quite a few reports of less mature systems like hald and usb causing problems, and not too long ago NFS client was very crash happy. I did see some of those myself two years ago, and fixed them. On the whole, Linux is very reliable. Very very reliable. Now mirror that, replicate it, add in 2 x 2 redundant power supplies backed by independent UPS units so you can do regular preemptive maintenance on the batteries, and you have a sweet enterprise transaction processing system. All set for a faster than light moon shot :-) Daniel --
if you are depending on replication over the network you have just limited your throughput to your network speed and latency. on an enterprise level machine the network can frequently be significantly slower than the disk array that you are so frantic to avoid waiting for. David Lang --
Replication does not work that way. On each replication cycle, the differences between the most recent two volume snapshots go over the network. This strategy has the nice effect of consolidating rewrites. There are also excellent delta compression opportunities. In the worst case, with insufficient bandwidth for the churn rate of the volume, replication rate increases to the time for replicating the full volume. Again, at worst, this would require extra storage for the snapshot to be replicated equivalent to the original volume size, so that the primary volume is not forced to wait synchronously for a replication cycle to complete. Mirroring on the other hand, makes a realtime copy of a volume, that is never out of date. Frantic... your word. Designing for dependably high transaction rates requires a different mode of thinking that some traditionalists seem to be having some trouble with. Daniel --
I think you've just tried to obfuscate the truth. As you have described, replication does not provide full protection against data loss; it loses all changes since last cycle. Recall that it was you who introduced the word "replication", in the context of guaranteeing no loss of data. Then you ignored David's point about the relatively low speed of networks, remarking only that mirroring is real-time. Reading between your words makes clear that "mirroring and replication" does You've rather under-valued dependability, though. Even your idea of mirroring systems is incomplete, because failure of the principle system requires transparent fail-over to the redundant system, which is actually quite challenging, especially with commodity systems hobbled together in the way you promote. Remember that you claimed microsecond-level transaction times, and 6-nines of availability. The former seems unlikely with replicated systems and, in the event of a failure, you won't achieve the latter. You still haven't investigated the benefit of your idea over a whopping great buffer cache. What's the point in all of this if it turns out, as Alan hinted should be the case, that a big buffer cache gives much the same performance? You appear to have gone to a great deal of effort without having performed quite simple yet obvious experiments. --
You are twisting words. I may have said that replication provides a point-in-time copy of a volume, which is exactly what it does, no more, A big buffer cache does not provide a guarantee that the dirty cache data saved to disk when line power is lost. If you would like to add that feature to the Linux buffer cache, then please do it, or make whichever other contribution you wish to make. If you just want to explain to me one more time that Linux, batteries, whatever, cannot be relied on, then please do not include me in the CC list. Daniel --
on_battery_power: sync mount / -oremount sync ...will of course work okay on any reasonable system. Not on yours, because you have to do echo i_really_mean_sync_when_i_say_sync > /hidden/file/somewhere sync (...which also shows that you are cheating). Now, will you either do your homework and show that page cache is somehow unsuitable for your job, or just stop wasting the bandwidth with useless rants? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
Speaking of useless rants... You need to go read the whole thread again, you missed the main bit. Daniel --
You said that you could achieve a certain performance, and later you said that for reliability you could use mirroring and replication but you never said that would lead to a performance hit. In fact you don't seem to be able to offer performance AND robustness; for performance you can only offer that level of robustness attainable on a single system, which means I think even you agreed was really not up to snuff for But the filesystem does offer a minimum level of consistency, which is missing from what you propose. You propose writing nothing unless line-power fails. The big buffer cache gives you all of the robustness of the underlying filesystem and including dirty buffer writes at some I haven't said that at all, other than as an axiom (which even you have agreed is fair) leading to comments on the results when something does fail. You keep saying that it won't ever fail, then that it will but that you can mitigate using redundant systems; and then you gloss over or refuse to face the attendant performance hit. Finally, you still have no idea whether your idea really does achieve a massive performance boost. You've never compared like amounts of RAM, nor the unsynced updates that most closely resemble your idea. In short, you've leaped on what seems to you to be a good idea and steadfastly refused to conduct even basic research. What's the point? You say don't cc you; I say go away, do that basic research, and come back when you have hard data. I really don't think you can ask for fairer than that. --
so just mirror to a local disk array then. a local disk array has more write bandwidth than a network connection to a remote machine, so if you can mirror to a remote machine you can mirror to if by traditionalists you mean everyone who makes a living keeping systems running you are right. we want sane failure modes as much as we want performance. there will be times when we decide to go for speed at the expense of safety, but we want to do it knowingly, not when someone is promising both and only provides speed. and by the way, if the violin box use your software they have just moved from a resource for me to tap when needed to something that I will advise my company to avoid at all costs. David Lang --
Great idea. Except that the disk array has millisecond level latency, So you could potentially connect to a _huge_ disk array and write deltas to it. The disk array would have to support roughly 3 Gbytes/second of write bandwidth to keep up with the Violin ramdisk. Doable, but you are now in the serious heavy iron zone. Personally, I like my nice simple design a lot more. Just mirror it, as many times as you need to satisfy your paranoia. Or how about go write your own? Daniel --
your network will do less then 1 Gbit/sec, so to mirror in real-time (what you claim is trivial) you would need at least 24 network connections in parallel. that's a LOT harder to setup then a high performance disk array. David Lang --
by the way, the only way to get this much bandwideth between two machines is to directly connect PCI-e/16 card slots togeather. this is definantly not commodity hardware anymore (if it's even possible, PCI-e has some very short distance limitations) David Lang --
You can do that with 3 10GE NICS, though in practise that's not easy. Willy --
Just a point of information, most of the mid-tier and above disk arrays can do replication/mirroring behind the scene (i.e., you write to one array and it takes care of replicating your write to one or more other arrays). This behind the scene replication can be over various types of connections - IP or fibre channel probably are the two most common paths. That will still leave you with the normal latency for a small write to an array which is (when you hit cache) order of 1-2 ms... ric --
So we've all noticed Alan --
You only have to care about ordering if there is a store barrier between the two (not usual). You only have to care about filling if you generate enough dirty blocks at a very high rate (which is unusual for most workloads). If you don't care about those then we have ramdisk already and if you want to write a ramdisk driver for external ramdisk great. You'd also fix the layering violations then by allowing device mapper to implement things like snapshotting and writeback seperated from your driver. Even in the extreme case that you propose there are trivial ways of getting coherency. Simple example - if you can sweep all the data out in say 10 minutes then you can buy twice the physical media and ensure that one of the two sets of disk backups is genuinely store barrier consistent to some snapshot time (say every 30 minutes but obviously user tunable). If you at least had some kind of credible snapshotting you'd find people Stable storage to most people means "won't go away on a bad happening". Transaction likewise has a specific meaning in terms of an event occuring once only an either being recorded before or after the transaction occurred. Alan --
Hi Alan, According to you. A more accurate statement: if you have the ramdisk on the host, then the host is assumed to be reliable. If the ramdisk is external (http://www.violin-memory.com/products/violin1010.html) then your statement is untrue in every sense. But you did not address the logic of my statement above: that your fundamental design prevents you from operating at ramdisk speed during No wait, it is completely normal. There is a barrier on every journal Exactly the purpose for which this driver was written. And as a bonus it happens to be useful for internal ramdisk applications as well. (It Device mapper already can, so I do not get your point. Also, what is Hostility does not equate to accuracy. Galileo comes to mind. I see people arguing that a server+linux+batteries+mirroring+replication cannot achieve enterprise grade reliability. Balderdash. Regards, Daniel --
> Hostility does not equate to accuracy. Galileo comes to mind. I see no attempt to even discuss the use of two sets of physical storage to maintain coherent snapshots, just comments about hostility. That's a fairly poor way to repay people who spend a lot of time working with enterprise customers and are interested in solutions using things like giant ramdisks and are putting in time to discuss I look forward to seeing your constructive detailed analysis of failure modes based upon actual statistical data from real data centres. Unless you can produce that nobody is going to take you seriously, which is bad luck for the poor folks at violin if they are relying on you. Alan --
> You did not explain how your proposal will avoid dropping the transaction Here is a simple but high physical storage using approach (but hey disks are cheap) You walk across the ram dirty table writing out chunks to backing store 0. At some point in time you want a consistent snapshot so you pick the next write barrier point after this time and begin committing blocks dirtied after that moment to store 1 (with blocks before that moment being written to both). You don't permit more than one snapshot to be in progress at once so at some point you clear all the blocks for store 0. Your snapshotting interval is bounded by the time to write out the store, nor do you have to throttle writes to the ramdisk. You now have a consistent snapshot in store 0. At the next time interval we finish off store 1 and spew new blocks to store 2, after 2 is complete we go with 2, 0 and then 1 as the stable store. The only other real trick needed then is metadata, but you don't have to update that on disk too often and you only need two bits for each of the page in RAM. For any page it is either 00 Clean on stable store 01 Clean on current writing snapshot 10 Dirty on stable store (and thus both) 11 Dirty on current writing snapshot (but clean, old on stable) Pages go 00->11 or 01->11 when they are touched, 11->01 or 10->01 when they are written back. At the point we freeze a snapshot we move 01->00 11->10 00->11 and there are no pages in 10. And of course we don't update the big tables at this instant instead we store the page state as (value - cycle_count)&3 with each freeze moment doing cycle_count++; The 00->11 is perhaps not obvious but the logic is fairly simple. The snapshot we are building does not magically contain the stable data from a prev
