Every little factor of 25 performance increase really helps.
Ramback is a new virtual device with the ability to back a ramdisk
by a real disk, obtaining the performance level of a ramdisk but with
the data durability of a hard disk. To work this magic, ramback needs
a little help from a UPS. In a typical test, ramback reduced a 25
second file operation[1] to under one second including sync. Even
greater gains are possible for seek-intensive applications.The difference between ramback and an ordinary ramdisk is: when the
machine powers down the data does not vanish because it is continuously
saved to backing store. When line power returns, the backing store
repopulates the ramdisk while allowing application io to proceed
concurrently. Once fully populated, a little green light winks on and
file operations once again run at ramdisk speed.So now you can ask some hard questions: what if the power goes out
completely or the host crashes or something else goes wrong while
critical data is still in the ramdisk? Easy: use reliable components.
Don't crash. Measure your UPS window. This is not much to ask in
order to transform your mild mannered hard disk into a raging superdisk
able to leap tall benchmarks at a single bound.If line power goes out while ramback is running, the UPS kicks in and a
power management script switches the driver from writeback to
writethrough mode. Ramback proceeds to save all remaining dirty data
while forcing each new application write through to backing store
immediately.If UPS power runs out while ramback still holds unflushed dirty data
then things get ugly. Hopefully a fsck -f will be able to pull
something useful out of the mess. (This is where you might want to be
running Ext3.) The name of the game is to install sufficient UPS power
to get your dirty ramdisk data onto stable storage this time, every
time.The basic design premise of ramback is alluringly simple: each write to
a ramdisk sets a per-chunk dirty ...
What about doing a similar thing as a device mapper target? Have a look a
dm-cache, I know that development of that has stopped but it doesn't mean
it couldn't be ressurected. It has an advantage that it is generic (any
two block devices will do) and you don't need to populate the "cache" on
start-up - it happens automatically through cache misses.Another use could be a flash based disk accelerator which may be pretty
popular nowadays.Tvrtko
Sophos Plc, The Pentagon, Abingdon Science Park, Abingdon,
OX14 3YP, United Kingdom.Company Reg No 2096520. VAT Reg No GB 348 3873 20.
--
It is a device mapper target (though there is no real advantage in that
other than having a handy plug-in api). It does handle any two block
devices, and it does populate on cache miss. But also has daemon-driven
population, since it never makes sense to leave the backing disk idle
then have to incur read latency because of that later.Regards,
Daniel
--
/proc is so 1990's. As your code has nothing to do with processes,
please don't add new files in /proc/. sysfs is there for you to doUse debugfs for stuff like debug info like this.
thanks,
greg k-h
--
Demonstrate some advantage and I will think about it.
Daniel
--
Again, as your code has nothing to do with "processes", please do not
add new files to /proc.As you are a filesystem, why not /sys/fs/ ?
It ends up with smaller code than procfs stuff as well, a good and nice
advantage.thanks,
greg k-h
--
use of /proc is discouraged, if you insist on sticking with it in the face
of opposition you will seriously hurt the chance of your patches being
accepted.David Lang
--
So that's what I've been doing wrong for all these years...
-- Chris
--
So you apparently want three things:
a) ignoring fsync() and co on this device
b) disabling all write throttling on this device
c) never discarding cached data from this deviceanything else i'm missing?
Alan already suggested the ramfs+writeback thread approach (possibly
with a little bit of help from the fs which could report just the dirty
regions), but i'm not sure even that is necessary.(a) can be easily done (fixing the app, LD_PRELOAD or fs extension etc)
(b) couldn't the per-device write throttling be used to achieve this?
(c) shouldn't be impossible either, eg sticking PG_writeback comes to mind,
just the mm accounting needs to remain sane.IOW can't this be done in a more generic way (and w/o a ramdisk in the
apples to oranges. what are the numbers for a nonjournalled disk-backed
fs and _without_ the sync? (You're not committing to stable storage anyway
so the sync is useless and if you don't respect the ordering so is the
journal)artur
--
Nice fiction - stuff crashes eventually - not that this isn't useful. For
a long time simply loading a 2-3GB Ramdisk off hard disk has been a goodExt3 is only going to help you if the ramdisk writeback respects barriers
Why not - providing you clear the dirty bit before the write and you
check it again after ? And on the disk size as you are going to have to
suck all the content back in presumably a log structure is not a bigIf you are prepared to go bigger than the fs chunk size so lose the
ordering guarantees your chunk size really ought to be *big* IMHOAlan
--
Hi Alan,
Nice to see so many redhatters taking an avid interest in storage :-)
Right, and now with ramback you will be able to preserve that state and
But that does not satisfy the requirement you snipped:
* Applications need to be able to read and write ramback data during
More accurately: in general, cannot transfer directly. The ramdisk may
be external and not present a memory interface. Even an external
ramdisk with a memory interface (the Violin box has this) would require
extra programming to maintain cache consistency. Then there is the
issue of ramdisks on the way that exceed the 40 bit physical addressing
of current generation processors.Even for the simple case where the ramdisk is just part of the kernel
unified cache, I would rather not go delving into that code when these
transfers are on the slow path anyway. Application IO does its normal
single copy_to/from_user thing. If somebody wants to fiddle with vm,
the place to attack is right there. The copy_to/from_user can be
eliminated (provided alignment requirements are met) using stupid page
table tricks. In spite of Linus claiming there is no performance win"640K should be enough for anyone"
The finer the granularity the faster the ramdisk syncs to backing
store. The only attraction of coarse granularity I know of is
shrinking the bitmap, which is currently not so big that it presents
a problem.Your comment re fs chunk size reveals that I have failed to
communicate the most basic principle of the ramback design: the
backing store is not expected to represent a consistent filesystem
state during normal operation. Only the ramdisk needs to maintain a
consistent state, which I have taken care to ensure. You just need
to believe in your battery, Linux and the hardware it runs on. Which
of these do you mistrust?Regards,
Daniel
--
Actually no - ramback would be useless to this. You might crash and end
Oh you mean "pray hard". e2fsck works well with typical disk style
failures, it is not robust against random chunks vanishing. I know this
as I've worked on and debugged a case where a raid card rebooted silentlyI was suggesting that you want log structure for the writeback disk so
that you keep coherency and can recover it, an issue you seem intent onNo I get that. You've ignored the fact I'm suggesting that design choice
In a big critical environment - all three.
Alan
--
So then you know that people already rely on batteries in critical storage
applications. So I do not understand why all the FUD from you.Particularly about Ext2/Ext3, which does recover well from random damage.
You seem to be calling Linux unreliable.
Daniel
--
By "recover well", you must mean "loses massive swabs of data, leaving
the system unbootable and with enormous numbers of user files missing."
My experience.Expecting fsck to cover for missed writes is stupid.
--
Whatever it can get off the disk it gets. It does a good job. If you
don't think so, then don't tell me, tell Ted.Daniel
--
On Wed, 12 Mar 2008 22:14:16 -0800
He knows. Ext3 cannot recover well from massive loss of intermediate
writes. It isn't a normal failure mode and there isn't sufficient fs
metadata robustness for this. A log structured backing store would deal
with that but all you apparently want to do is scream FUD at anyone who
doesn't agree with you.Alan
--
Scream is an exaggeration, and FUD only applies to somebody who
consistently overlooks the primary proposition in this design: that the
battery backed power supply, computer hardware and Linux are reliable
enough to entrust your data to them. I say this is practical, you say
it is impossible, I say FUD.All you are proposing is that nobody can entrust their data to any
hardware. Good point. There is no absolute reliability, only degrees
of it.Many raid controllers now have battery backed writeback cache, which
is exactly the same reliability proposition as ramback, on a smaller
scale. Do you refuse to entrust your corporate data to such
controllers?Daniel
--
RAID controllers do not have half a terabyte of RAM. Also, you are always
invited to choose between speed (write back) and reliability (write through).Also, please note that the problem here is not related to the number of
nines of availability. This number only counts the ratio between uptime
and downtime. We're more facing a problem of MTBF, where the consequences
of a failure are hard to predict.What I'm thinking about is that considering the fact that storage
technologies are moving towards SSD (and I think 2008 will be the
year of SSD), you should implement ordered writes (I've not said
write through) since there's no seek time on those devices. Thus
you will have the speed of RAM with the reliability of a properly
synced FS. If your system crashes once a week, it will not be a
problem anymore.Willy
--
The write back ones are also battery backed properly, and will switched
to write through (flushing out the cache) on the first sniff of a low
battery signal.The decent ones (the kind used in serious business) also let you swap the
battery backed RAM module to another card in the event of a failure of a
card so you can complete recovery.
--
Right, just like the Violin 1010, whose PCI-e cable can be hotplugged
into a different server. Or plugged into two servers at the same time,
because each 1010 has two PCI-e interfaces, so this can be done without
manual intervention.See, we really are talking about the same thing. Except that ramback
does it bigger and faster.Daniel
--
On Sat, 15 Mar 2008 13:25:48 -0800
No because you don't honour the ordering and tag boundaries as they do.
Alan
--
Sophism. The statement was "battery backed properly" and "switch on
first sniff", which is example how ramback works.Daniel
--
And? Either you have battery backed ram with critical data in it or
That is why I keep recommending that a ramback setup be replicated or
mirrored, which people in this thread keep glossing over. When
replicated or mirrored, you still get the microsecond-level transaction
times, and you get the safety too.Then there is a big class of applications where the data on the ramdisk
can be reconstructed, it is just a pain and reduces uptime. These are
potential ramback users, and in fact I will be one of those, using itThere will be a whole bunch of patches from me that are SSD oriented,
over time. The fact is, enterprise scale ramdisks are here now, while
enterprise scale flash is not. Getting close, but not here. And flash
does not approach the write performance of RAM, not now and probably
not ever.Daniel
--
Do you mean it should be replicated with a second ramback? That would
be pretty pointless, since all failure modes would affect both. It's
not like one ramback will survive a crash when the other doesn't.
--
It could, in a bit different location maybe, but it isn't a substitute
for ordered writes.
--
Krzysztof Halasa
--
How so?
Daniel
--
Not sure if I understand the question correctly but obviously a pair
(mirror) of servers running "dangerous" ramback would survive a crash
of one machine and we could practically eliminate the probability of
both (all) machines crashing simultaneously. However, there are
cheaper ways to achieve similar performance and even better
reliability - including those battery-backed (RAI)Disk controllers.
--
Krzysztof Halasa
--
OK, so we are only searching for the cheapest way to achieve these
kinds of speeds, for some given uptime and risk level requirements.
That is a really interesting subject, but can we please leave it for a
while so I can get some work done on the code itself?Thanks,
Daniel
--
A second machine running a second ramback, on a second UPS pair.
I thought that was obvious.Daniel
--
Besides, some SAN Storage Devices do have that amount of Ram. However it is
better protected as in your typical PC. With Mirroring, it can be removed
(including the battery packs) - and there is a procedure to actually replay
the buffers once the new devices are in place.But thats not an argument against or in favor of Ramback, its just two
different things. You would be suprised how many databases run on write back
mode disks without fdsync() any nobody cares :)Greetings
Bernd
--
It completely changes the method to power it and the time the data may
remain in RAM. The Smart 3200 I have right here simply has lithium
batteries directly connected to the static RAM chips. Very low risk of
power failure. The way your presented your work shows it rely on a UPS
to sustain the PC's power supply, which it turn maintains the PC alive,
which in turn tries not to reboot to keep its RAM consistent. There are
a lot of reasons here to get a failure.Don't get me wrong, I still think your project has a lot of usages. But
you have to admit that there are huge differences between using it in
an appliance with battery-backed RAM which is able to recover data after
a system crash, power outage or anything, and the average Joe's PC setup
as an NFS server for the company with a cheap UPS to try not to lose the
data should a power outage occur.I agree, but in this case, you should present it this way. You have been
insisting too much on the average PC's reliability, the fact that no kernel
ever crashed for you, etc... So you are demonstrating that your product is
good provided that everything goes perfectly. All people who have experienced
software or hardware problems in the past (ie mostly everyone here) will not
trust your code because it relies on pre-requisites they know they do notMy goal is not to replace RAM with flash, but disk with flash. You are
against ordered writes for a performance reason. Use SSD instead of
hard drives and it will be as fast as sequential writes. Also, when
you say that enterprise scale flash is not there, I don't agree. You
can already afford hundreds of gigs of flash in 3,5" form factor. An
1.6 TB SSD has even been presented at CES2008, with sales announced
for Q3. So clearly this will replace your hard drives soon, very soon.
Even if it costs $5k, that's a very acceptable solution to replace a
disk in a RAM-speed appliance.Willy
--
It already has ordered write when it is in flush mode.
OK, I hear you. There will be an ordered write mode that uses barriers
to decide the ordering. It will greatly reduce the speed at which
ramback can flush dirty data because of the need to wait synchronously
on every barrier, of which there are many. And thus will widen out the
window during which UPS power must remain available if power goes out,
in order to get all acknowledged transactions on to stable media. The
advantage is, the stable media always has a point-in-time version of
the filesystem.Don't expect this mode in the immediate future though, there are bugs
to fix in the current driver, which already implements the requiredThat would have been a miscommunication then. I see arguments coming
in that suggest embedded solutions, EMC for example, are inherently more
reliable than a Linux based solution. Well guess what? Some of those
embedded solutions already use Linux.Also, peecees are much more reliable than people give them credit for,
especially if you harden up the obvious points of failure such as fans
and spinning disks. Once you have your system all hardened up, then
you _still_ better replicate your important data. Perhaps I should not
admit this, but I simply fail to do that on the machine from which I am
posting right now, which also runs my web server and mail system. That
is because I would have to reboot it to install ddsnap so I can replicate
properly, and because the thing is so darn reliable that I just have
not gotten around to it. I do copy off the important files from time
to time though, and do various other things to ameliorate the risk. IfExactly what I mean: close but not there. Those gigantic RAM boxes are
shipping now, and the same company has got a 5 TB flash box coming down
the pipe, and sooner than Q3. But the RAM box will always outperform
the flash box. You just keep throwing writes at it until all available
flash is in erase mode, and the thing slows down. If ...
But their RAM does not depend on a lot of factors to remain valid and
Securing every component simply reduces the risk of a loss of service.
What is important with data is to know the consequences of loss of service.
If that only means that no one can work and that the last second of work is
lost, it's generally acceptable. If it means everything is lost to a corruptedNo, you're replacing disk activity with RAM activity. But you keep disk as
Sorry if I was not clear. I was not speaking about replacing the RAM with
flash, but only the disks. You keep the RAM for the speed, and use flash
for permanent storage instead of disks. No seek time, average RW speed now
slightly better than disks, that combined with your ramdisk and ordered
write-backs writes will have the best of both worlds : RAM speed and flash
reliability.Willy
--
For example?
Anecdote time. Remember there used to be "brand name" floppy disks and
generic floppy disks, and the brand name ones cost a lot more because
they were supposedly safer? Well, big secret, studies were done and
the no-name disks came out better. Why? Because selling at commodity
prices the generic makers could not afford returns. So they made them
well.It is like that with PCs. Supposedly you get a lot more reliability
when you spend more money and buy all high end near-custom gear. In
fact, the cheap stuff just keeps on chugging, because those guys can't
afford to have it break.So please don't underestimate the reliability of a PC.
There are bits of Linux that are undeniably dodgy. We get a lot of bug
reports about usb for example, keyboards just quitting and it's not the
keyboard's fault. Just say no to usb in a server, at least until some
fundamental cleanup happens there.The worst bug I've seen in a server this year? A buggy bios in a Dell
server that would issue a keyboard error and sit and wait for somebody
to press F1 when there was no keyboard attached. That is embedded
software for you. Personally, I think we do way better than that inYes. Dual power supplies are highly recommended for this application.
With dual power supplies you can carry out preemptive maintenance onSo mirror two of them, I keep saying. If that is not good enough for
you, then make it three way, and replicate for good measure. The thing
is, none of that hurts the microsecond level performance, and it gets
you whatever data security you desire. Whereas anything that requires
waiting on disk transactions does hurt performance. Since my interest
currently lies in high performance, that is where my effort goes. And
do I need to say it: patches gratefully accepted.For my immediate application... hacking the kernel in comfort... just
Right. What we are talking about is filling in a missing level in the
cache hierarchy, something like:L1 .3 ns
...
I don't think so. I remember we had much more problems with noname
disks. And yes, certain brands had been problematic too, but mostThe real life can't agree with this at all. The servers keep working
for years and the cheap stuff quit fast (if initially working, whichMost BIOS (all I've seen in this Millennium) have an option to disable
that.On a server board you can usually have a remote console, how could
We already have RAM between L3 and Flash.
The problem is flushing L1 to disk/flash takes time.
--
Krzysztof Halasa
--
They don't care if it breaks after 12 months, and for components and
addons they don't care if it breaks, they just blame the end user for
mis-installation or 'incompatibility'. There is a huge difference inPerhaps. But if your cache can destroy the contents of the layer below in
situations that do occur it isn't useful. If you can fix that then it
obviously has a lot of potential.Alan
--
Actually, it's worse than that. Users have been trained that when a
computer bluescreens and losing all of their data, it's either (a)
just the way things are, or (b) it's microsoft's fault. Worse yet,
thanks to things like PC benchmarks, hard drive manutacturers have in
the past been encouraged to do things like lie to the OS about when
things had hit the hard drive platter just to score higher numbers on
winbench.All of this is why I've in the past summed all of this up as Ted's law
of PC class hardware, which is that PC class hardware is cr*p. :-)- Ted
--
What I mean is that in a PC, RAM contents are very fragile :
- weak batteries in your UPS => end of game
- loosy power cable between UPS and PC => end of game (BTW I have a customer
who had such a problem, cables had both disconnected because of their own
weight).
- kernel panic => end of game
- user error during planned maintenance => end of game
- flaky driver writing to wrong memory location => can't trust your dataIn a normal PC, even if the RAM itself is a reliable component (ECC, ...)
a lot of such problems which may happen will render it unusable. If you
have to reboot, your BIOS will clean it up for you. That's why people are
trying to explain to you that linux is not reliable enough to work like
this.Now if you have all your RAM on a PCI-E board with a battery and which is
not initialized by the BIOS so that it survives reboots, it changes a LOT
of things, because all the problems mentionned above go away. Let me
repeat it, the problem is not that those components are too unreliable
to build a transactional system, it is that used in this manner, a very
simple failure of any of them is enough to lose/corrupt all of your data.That was not my experience when I was a student. We would buy very cheap
diskettes which were only sold by 100. 20% of them were already defective,
and 20% of the remaining ones could not keep our data till the next morning!
I knew guys who finally stopped copying games due to those diskettes, soIf you have understood what I explained above, now you'll understand that
I'm not underestimating the reliability of my PC, just the fact that keeping
access to my RAM contents involves a lot of components, any of which willI thought this stupidity disappeared about 5 years ago ? I was about to
build PIC-based PS/2 "terminators" to plug into machines to avoid thisI never spoke about waiting for disk transactions. The RAM must be the
only source and target of user data. Disk is there for permanent storage
and should ...
Not sure if things like SLR-2 or so are still available, except second
hand. But they at least provide compatibility for some time.
--
Krzysztof Halasa
--
I strongly disagree. Cheap PC hardware is not even close to the quality
of a serious, branded machine. Often capacitors are missing from power
lines, and the ones that are installed fail sooner. Cooling fans are
lower quality and fail much sooner. Timing issues abound.There's a reason why an IBM is a better machine than a "Black-n-Gold":
IBM value their name so when you have a problem, they have a problem.
Buy generic and when you get a problem they already have your money and
since they have no investment in their name, they have nothing more to
care about.
--
That's just nonsense in a consolidated market.
You change to IBM, then to Dell, then to HP
then again to IBM. Maybe you even try Sun.That causes you more grief than any one of them.
I have seen people doing that in all industry branches
and even privately.If you love brands, then your choice becomes very limited.
That's the real reason for them being much more expensive.If you think machines and specs, then you have a much more clear
picture. After a while you even have your own measures for failure
rates of those components and can handle it. No matter which brand :-)Best Regards
Ingo Oeser
--
it will mean that the window is larger, but it will also mean that if
something else goes wrong and that window is not available the data that
was written out will be useable (recent data will be lost, but older data
will still be available)as for things that can go wrong
the UPS battery can go bad
you can have multiple power failures in a short time so your battery is not fully charged
capacitors in the UPS can go bad
capacitors in the power supply can go bad
capacitors on the motherboard can go bad
a kernel bug can crash the system
a bug in a device driver (say nvidia graphics driver) can crash the system
a card in the system can lock up the system bus
the system power supply can die
the system fans can die and cause the system to overheat
cooling in the room the system is in can fail and cause the system to overheat
airflow to the computer can get blocked and cause the system to overheat
some other component in the computer can short out and cause the system to loose power internallyI have had every single one of these things happen to me over the years.
Some on personal equipment, some on work equipment. At work I recently had
a series of disasters where capacitors in a 7 figure UPS blew up, and a
few days later during a power outage when we were running on generator, a
fuel company made a mistake while adding fuel to the generator and knocked
it out.Even if you spend millions on equipment and professionals to set it up and
maintain it, you can still go down.You may not care about it on your system (becouse you copy data elsewhere
and don't change it rapidly), but most people do. with your current
approach you are slightly better then a couple shell scripts from an
availability point of view, you are no better in performance, but your
failure mode is complete disaster.comparing you to 'cp drive ramdisk' at startup and 'rsync ramdisk drive'
periodicly and at shutdown you are faster at startup, close enough at
shutdown as to be in the noise (eit...
Actually modern DRAM can be put into "self refresh" mode which don't
need (nor allow) any external accesses. Not very practical in typical
PC case, though I think suspend to RAM uses it. Could be used for
battery - backed RAID/disk controller as well.Obviously it changes nothing WRT ramback.
--
Krzysztof Halasa
--
It makes a lot of difference, and in addition raid controllers (good
ones) respect barrier ordering in their RAM cache so they'll take tags orEither you keep a mirror in sync and get normal data rates or you keep
the mirror out of sync and then you need to sort your writeback process
out to preserve ordering.If you want ramback to be taken seriously then that is the interesting
problem to solve and clearly has multiple solutions if you would start to
take an objective look at your work.--
Ramback should obviously respect barriers, and it does, though at
present only in the crude, default way of letting the block layer
handle it.But interpreting a barrier to mean flush through to rotating media...
performance will drop to the millisecond per transaction zone, like a
normal disk. Not what ramback users want in normal operating mode.
Flush mode, yes.Even raid controllers... so you agree that some of them just don't
respond conservatively to tagged commands, either because the engineers
don't know how to implement that (unlikely) or because they want to win
the performance benchmarks, and they do trust their battery?"Some raid controllers" is just as good for my argument as "all raid
controllers". Nobody is telling you which raid controller to use in
your own personal system. I will pick the fast one and you can pickRamback already is taken seriously, just not by you. That is fine, you
apparently do not need or want the speed.Anyway, please do not get the impression that I am ignoring your ideas.
There are some nice, intermediate modes that ramback could and in my
opinion, should implement, to give users more options on how to trade
off performance against resilience. I just need to make it clear that
ramback, as conceived, already gives system builders the capability
they need to achieve microsecond level transaction throughput and data
safety at the same time... given a reliable battery, which is where we
started.Daniel
--
That isn't anything to do with what was being proposed. *ORDERING* not
The ones that don't respect tagged ordering are the ultra cheap nasty
things you buy down the local computer store that come with a 2 page
manual in something vaguely like English. The stuff used for real work isI want the speed and reliability. Without that ramback is a distraction
You have no guarantee of commit to stable storage so your use of the word
"transaction" is a bit farcical.There are a whole variety of ways to get far better results than "whoops
bang there goes the file system". Log structured backing media is one,
even snapshots. That way you'd quantify that for the cost of more
rotating storage (which is cheap) you can only lose "x" minutes of data
and will lose everything from a defined consistent point. File based
backing store also has similar properties done right, but needs some
higher level care to track closure and dirty blocks on a per inode basis.Alan
--
This is where you have made a fundamental mistake in your proposal.
Suppose you have a steady, heavy write load onto ramback. Eventually,
the entire ramdisk will be dirty and you have to drop back to disk
speed, right? My design does not suffer from that problem, but your
proposal does.It gets worse than that. Suppose somebody writes the same region
twice, how do you order that? Do you try to store that new data
somewhere, keeping in mind that we are already at terabyte scale? IsSomebody has. But please feel free to solve some other problem. I
The UPS provides a guarantee of commit to stable storage. No amount of
FUD will change that. But please go ahead and calculate the risks
involved. I am confident you will admit that there are standard]
techniques available to ameliorate risk, which may be applied _on top of_
ramback, thus not destroying its microsecond-level transaction
performance as you propose.Daniel
--
What about system crashes? They guarantee that data will be lost. I
know opinions are divided on the subject of crashes: You say Linux
doesn't; everybody else says it does. I side with experience. (It does.)
--
Not if it is mirrored and replicated. Also nice if crashes are very
I say it does not crash often, to the point where I have not seen it
crash once for any reason I did not create myself (I tend to wait for
the occasional brown bag release to fade away before shifting development We do get quite a few
reports of less mature systems like hald and usb causing problems, and
not too long ago NFS client was very crash happy. I did see some of
those myself two years ago, and fixed them.On the whole, Linux is very reliable. Very very reliable. Now mirror
that, replicate it, add in 2 x 2 redundant power supplies backed by
independent UPS units so you can do regular preemptive maintenance on
the batteries, and you have a sweet enterprise transaction processing
system. All set for a faster than light moon shot :-)Daniel
--
if you are depending on replication over the network you have just limited
your throughput to your network speed and latency. on an enterprise level
machine the network can frequently be significantly slower than the disk
array that you are so frantic to avoid waiting for.David Lang
--
Replication does not work that way. On each replication cycle, the
differences between the most recent two volume snapshots go over the
network. This strategy has the nice effect of consolidating rewrites.
There are also excellent delta compression opportunities.In the worst case, with insufficient bandwidth for the churn rate of
the volume, replication rate increases to the time for replicating the
full volume. Again, at worst, this would require extra storage for the
snapshot to be replicated equivalent to the original volume size, so
that the primary volume is not forced to wait synchronously for a
replication cycle to complete.Mirroring on the other hand, makes a realtime copy of a volume, that is
never out of date.Frantic... your word. Designing for dependably high transaction rates
requires a different mode of thinking that some traditionalists seem to
be having some trouble with.Daniel
--
I think you've just tried to obfuscate the truth. As you have
described, replication does not provide full protection against data
loss; it loses all changes since last cycle. Recall that it was you who
introduced the word "replication", in the context of guaranteeing no
loss of data. Then you ignored David's point about the relatively low
speed of networks, remarking only that mirroring is real-time. Reading
between your words makes clear that "mirroring and replication" doesYou've rather under-valued dependability, though. Even your idea of
mirroring systems is incomplete, because failure of the principle system
requires transparent fail-over to the redundant system, which is
actually quite challenging, especially with commodity systems hobbled
together in the way you promote. Remember that you claimed
microsecond-level transaction times, and 6-nines of availability. The
former seems unlikely with replicated systems and, in the event of a
failure, you won't achieve the latter.You still haven't investigated the benefit of your idea over a whopping
great buffer cache. What's the point in all of this if it turns out, as
Alan hinted should be the case, that a big buffer cache gives much the
same performance? You appear to have gone to a great deal of effort
without having performed quite simple yet obvious experiments.
--
You are twisting words. I may have said that replication provides a
point-in-time copy of a volume, which is exactly what it does, no more,A big buffer cache does not provide a guarantee that the dirty cache
data saved to disk when line power is lost. If you would like to
add that feature to the Linux buffer cache, then please do it, or make
whichever other contribution you wish to make. If you just want to
explain to me one more time that Linux, batteries, whatever, cannot
be relied on, then please do not include me in the CC list.Daniel
--
on_battery_power:
sync
mount / -oremount sync...will of course work okay on any reasonable system. Not on yours,
because you have to doecho i_really_mean_sync_when_i_say_sync > /hidden/file/somewhere
sync(...which also shows that you are cheating).
Now, will you either do your homework and show that page cache is
somehow unsuitable for your job, or just stop wasting the bandwidth
with useless rants?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
Speaking of useless rants...
You need to go read the whole thread again, you missed the main bit.
Daniel
--
You said that you could achieve a certain performance, and later you
said that for reliability you could use mirroring and replication but
you never said that would lead to a performance hit. In fact you don't
seem to be able to offer performance AND robustness; for performance you
can only offer that level of robustness attainable on a single system,
which means I think even you agreed was really not up to snuff for
But the filesystem does offer a minimum level of consistency, which is
missing from what you propose. You propose writing nothing unless
line-power fails. The big buffer cache gives you all of the robustness
of the underlying filesystem and including dirty buffer writes at someI haven't said that at all, other than as an axiom (which even you have
agreed is fair) leading to comments on the results when something does
fail. You keep saying that it won't ever fail, then that it will but
that you can mitigate using redundant systems; and then you gloss over
or refuse to face the attendant performance hit. Finally, you still
have no idea whether your idea really does achieve a massive performance
boost. You've never compared like amounts of RAM, nor the unsynced
updates that most closely resemble your idea. In short, you've leaped
on what seems to you to be a good idea and steadfastly refused to
conduct even basic research. What's the point?You say don't cc you; I say go away, do that basic research, and come
back when you have hard data. I really don't think you can ask for
fairer than that.
--
so just mirror to a local disk array then.
a local disk array has more write bandwidth than a network connection to a
remote machine, so if you can mirror to a remote machine you can mirror toif by traditionalists you mean everyone who makes a living keeping systems
running you are right. we want sane failure modes as much as we want
performance.there will be times when we decide to go for speed at the expense of
safety, but we want to do it knowingly, not when someone is promising both
and only provides speed.and by the way, if the violin box use your software they have just moved
from a resource for me to tap when needed to something that I will advise
my company to avoid at all costs.David Lang
--
Great idea. Except that the disk array has millisecond level latency,
So you could potentially connect to a _huge_ disk array and write deltas
to it. The disk array would have to support roughly 3 Gbytes/second of
write bandwidth to keep up with the Violin ramdisk. Doable, but you are
now in the serious heavy iron zone.Personally, I like my nice simple design a lot more. Just mirror it, as
many times as you need to satisfy your paranoia. Or how about go write
your own?Daniel
--
your network will do less then 1 Gbit/sec, so to mirror in real-time (what
you claim is trivial) you would need at least 24 network connections in
parallel. that's a LOT harder to setup then a high performance disk array.David Lang
--
by the way, the only way to get this much bandwideth between two machines
is to directly connect PCI-e/16 card slots togeather. this is definantly
not commodity hardware anymore (if it's even possible, PCI-e has some very
short distance limitations)David Lang
--
You can do that with 3 10GE NICS, though in practise that's not easy.
Willy
--
Just a point of information, most of the mid-tier and above disk arrays
can do replication/mirroring behind the scene (i.e., you write to one
array and it takes care of replicating your write to one or more other
arrays). This behind the scene replication can be over various types of
connections - IP or fibre channel probably are the two most common paths.That will still leave you with the normal latency for a small write to
an array which is (when you hit cache) order of 1-2 ms...ric
--
So we've all noticed
Alan
--
You only have to care about ordering if there is a store barrier between
the two (not usual). You only have to care about filling if you generate
enough dirty blocks at a very high rate (which is unusual for most
workloads). If you don't care about those then we have ramdisk already and
if you want to write a ramdisk driver for external ramdisk great. You'd
also fix the layering violations then by allowing device mapper to
implement things like snapshotting and writeback seperated from your
driver.Even in the extreme case that you propose there are trivial ways of
getting coherency. Simple example - if you can sweep all the data out in
say 10 minutes then you can buy twice the physical media and ensure that
one of the two sets of disk backups is genuinely store barrier consistent
to some snapshot time (say every 30 minutes but obviously user tunable).
If you at least had some kind of credible snapshotting you'd find peopleStable storage to most people means "won't go away on a bad happening".
Transaction likewise has a specific meaning in terms of an event occuring
once only an either being recorded before or after the transaction
occurred.Alan
--
Hi Alan,
According to you. A more accurate statement: if you have the ramdisk
on the host, then the host is assumed to be reliable. If the ramdisk
is external (http://www.violin-memory.com/products/violin1010.html)
then your statement is untrue in every sense.But you did not address the logic of my statement above: that your
fundamental design prevents you from operating at ramdisk speed duringNo wait, it is completely normal. There is a barrier on every journal
Exactly the purpose for which this driver was written. And as a bonus
it happens to be useful for internal ramdisk applications as well. (ItDevice mapper already can, so I do not get your point. Also, what is
Hostility does not equate to accuracy. Galileo comes to mind.
I see people arguing that a server+linux+batteries+mirroring+replication
cannot achieve enterprise grade reliability. Balderdash.Regards,
Daniel
--
> Hostility does not equate to accuracy. Galileo comes to mind.
I see no attempt to even discuss the use of two sets
of physical storage to maintain coherent snapshots, just comments about
hostility. That's a fairly poor way to repay people who spend a lot of
time working with enterprise customers and are interested in solutions
using things like giant ramdisks and are putting in time to discussI look forward to seeing your constructive detailed analysis of failure
modes based upon actual statistical data from real data centres. Unless
you can produce that nobody is going to take you seriously, which is bad
luck for the poor folks at violin if they are relying on you.Alan
--
> You did not explain how your proposal will avoid dropping the transaction
Here is a simple but high physical storage using approach (but hey disks
are cheap)You walk across the ram dirty table writing out chunks to backing
store 0.At some point in time you want a consistent snapshot so you pick the next
write barrier point after this time and begin committing blocks dirtied
after that moment to store 1 (with blocks before that moment being
written to both). You don't permit more than one snapshot to be in
progress at once so at some point you clear all the blocks for store 0.
Your snapshotting interval is bounded by the time to write out the store,
nor do you have to throttle writes to the ramdisk.You now have a consistent snapshot in store 0. At the next time interval
we finish off store 1 and spew new blocks to store 2, after 2 is complete
we go with 2, 0 and then 1 as the stable store.The only other real trick needed then is metadata, but you don't have to
update that on disk too often and you only need two bits for each of the
page in RAM.For any page it is either
00 Clean on stable store
01 Clean on current writing snapshot
10 Dirty on stable store (and thus both)
11 Dirty on current writing snapshot (but clean, old on stable)Pages go 00->11 or 01->11 when they are touched, 11->01 or 10->01 when
they are written back.At the point we freeze a snapshot we move 01->00 11->10 00->11 and there
are no pages in 10. And of course we don't update the big tables at this
instant instead we store the page state as(value - cycle_count)&3
with each freeze moment doing
cycle_count++;
The 00->11 is perhaps not obvious but the logic is fairly simple. The
snapshot we are building does not magically contain the stable data from
a previous snapshot.Say 0 is our stable snapshot
snapshot 0 page 0 contains the stable copy of a page
snapshot 1 is currently being updatedif we touched the page during the lifetime of s...
On Thu, 13 Mar 2008 11:14:39 -0800
That's a reasonable enough assumption, to anyone who has never dealt
with software before, or whose data is just not important.People who have dealt with computers for longer will know that anything
can fail at any time, and usually does unexpectedly and at bad moments.Some defensive programming to deal with random failures could make your
project appealing to a lot more people than it would appeal to in its
current state.--
All Rights Reversed
--
In its current state it has bugs and so should appeal only to
programmers who like to work with cutting edge stuff.So long as you keep insisting it has to have some kind of slow
transactional sync to disk in order to be reliable enough for
enterprise use, I have to leave you in my FUD filter. Did you
read Ric's post where he mentions the UPS in some EMS products?
Ask yourself, what is the UPS for? Then ask yourself if EMC
makes billions of dollars selling those things to enterprise
clients.Daniel
--
>>>>> "Daniel" == Daniel Phillips <phillips@phunq.net> writes:
Daniel> In its current state it has bugs and so should appeal only to
Daniel> programmers who like to work with cutting edge stuff.As a prof SysAdmin and long time lurker, I feel I can chime in here a
bit. No one is arguing that your code isn't neat, or have a feature
which would be nice to have. They are arguing that your failure mode
(when, not if, it fails for some reason) is horrible.Who remembers the NFS PrestoServer NFS accelerator cards? You could
buy this PCI (or was it TurboChannel back then?) card for your DEC
Alphas. It came with 4Mb of battery backed RAM so that NFS writes
could be ack'd before being written to disk. We had just completed
moving all the user home directories to this system that week, say
around 4gb of data? Remember, this was around 94 sometime at a
University. We were also using Advfs on DEC OSF/1, probably v1.2,
maybe v1.3.Anyway, I came into work thursday night to pickup something I had
forgotten before I took a three day weekend. The operator on duty
asked me to look at the server since it had crashed and wasn't coming
up properly.I ended up staying there until 9am the next morning working on it.
Turned out to be both user and hardware error. We had forgotten to
remove the piece of plastic to enable to battery on the card, but the
circuits on the card lied and said battery voltage was fine no matter
what the battery really was.So the system crashed. 4Mb of data from the filesystem when bye-bye.
Can you say oops? What a total pain to diagnose. But even on a log
structured filesystem, having 4Mb of data just get wiped out was
enough to destroy all the filesystem.We ended up rolling back to the original server and junking the week
of changes that users had made, and restoring chunks for users as they
requested it. Luckily, it was early in the semester and not alot of
stuff had gotten done yet.Now do you see why people...
I'd like to seem some science. I'd like to know how much faster it
really is, and for that proper testing needs to be done. Since Daniel's
scheme uses the same amount of RAM as disk, an appropriate test would be
to pin (at least) that amount of RAM to buffer cache, and then to fill
the cache with the contents of the disk (i.e. cat /dev/disk >
/dev/null.) This sets the stage for tests, which tests should not
include the sync operation. I'd like to see actual numbers against such
a setup versus Daniel's scheme. Since buffer cache is shared by all
disks, obviously the test must not access any other drive.One thing I will admit: RAM disks are fast. What I don't know is how
much work there is to access blocks that are already in the buffer
cache. In principle I suppose it should be a little slower, but not
much. I'd like to know, though. I'd do the test myself if I had a
machine with enough RAM, but I don't. Daniel (apparently) does...
--
There is a correctable flaw in your experiment: loading the disk into
buffer cache does not make the cached data available to the page
cache. Maybe it should (good summer project there for somebody) but
for now you need to tar the filesystem to dev/null or similar. Note
that, because of poor cross-directory readahead, traversing a disk
like that will not be as fast as reading it linearly. On the other
hand, you will not have to read any free space into cache, which
ramback does because it does not know what is free space (or care,
really...)You are probably OK. I used a 150 MB ramdisk, of which I used only
100 MB. That is why I used a 2.2 kernel for my tests.Daniel
--
It's more reliable than many others, but it's not perfect.
Besides, there are many failure modes beyond the control of the kernel.
Hardware errors can lock up the bus and prevent I/O, RAM modules can
go bad, technicians can yank out cards without waiting for the ready
light.For certain classes of devices it's necessary to plan for these sorts of
things, and a model where the on-disk structures may be inconsistent by
design is not going to be very attractive.Chris
--
...disks can break, batteries on raid controllers can fail, etc, etc...
So you design for the number of nines you need, taking all factors
into account, and you design for the performance you need. These areYou are preaching to the converted. Systems consisting of:
linux + disks + batteries + ram + network + redundancy
can be as reliable as you need. Respectfully, I would like to return
to the software engineering problem. This driver solves a problem for
certain people. Not niche people to be forgotten about. If it does
not solve your problem then please just write a driver that does,
meanwhile this one needs some finishing work. Lets get the proverbial
thousand eyeballs working. Has anybody besides me compiled this yet?Daniel
--
There's no FUD here. The problem is that you didn't say that you've
designed this for only a few nines. If you delete fsck from your
rationale, simply saying that you rely on UPS to give you time to flush
buffers, you have a much better story. Certainly, once you've flushed
buffers and degraded to write-through mode, you're obviously as reliable
as ext2/3.Your idea seems predicated on throwing large amounts of RAM at the
problem. What I want to know is this: Is it really 25 times faster than
ext3 with an equally huge buffer cache?
--
Fsck was never a part of my rationale. Only reliability of components
was and is. Then people jumped in saying Linux is too unreliable to
use in a, hmm, storage system. Or transaction processing system. Or
whatever.Yes.
Regards,
Daniel
--
this I don't understand. what makes your approach 25x faster?
looking at the comparison of a 500G filesystem with 500G of ram allocated
for a buffer cache.yes, initially it will be a bit slower (until the files get into the
buffer cache), and if fsync is disabled all writes will go to the buffer
cache (until writeout hits)I may be able to see room for a few percent difference, but not 2x, let
alone 25x.David Lang
--
My test ran 25 times faster because it was write intensive and included
sync. It did not however include seeks, which can cause an even bigger
performance gap.The truth is, my system has _more_ cache available for file buffering
than I used for the ramdisk, and almost every file operation I do
(typically dozens of tree diffs, hundreds of compiles per day) goes
_way_ faster on the ram disk. Really, really a lot faster. Because
frankly, Linux is not very good at using its file cache these days.
Somebody ought to fix that. (I am busy fixing other things.)In other, _real world_ NFS file serving tests, we have seen 20 - 200
times speedup in serving snapshotted volumes via NFS, using ddsnap
for snapshots and replication. While it is true that ddsnap will
eventually be optimized to improved performance on spinning media,
I seriously doubt it will ever get closer than a factor of 20 or so,
with a typical read/write mix.But that is just the pragmatic reality of machines everybody has these
days, let us not get too wrapped up in that. Think about the Violin
box. How are you going to put 504 gigabytes of data in buffer cache?
Tell me how a transaction processing system is going to run with
latency measured in microseconds, backed by hard disk, ever?Really guys, ramdisks are fast. Admit it, they are really really fast.
So I provide a way to make them persistent also. For free, I might
add.Why am I reminded of old arguments like "if men were meant to fly, God
would have given them wings"? Please just give me your microsecond
scale transaction processing solution and I will be impressed and
grateful. Until then... here is mine. Service with a smile.Daniel
--
if you are not measuring the time to get from ram to disk (which you are
not doing in your ramback device) syncs are meaningless.seeks should only be a factor in the process of populating the buffer
cache. both systems need to read the data from disk to the cache, they can
either fault the data in as it's accessed, or run a process to read it allso you are saying that when the buffer cache stores the data from your ram
disk it will slow down. that sounds like it equalizes the performance andit all depends on how you define the term 'backed by hard disk' if you
don't write to the hard disk and just dirty pages in ram you can easily
hit that sort of latency. I don't understand why you say it's so hard to
put 504G of data into the buffer cache, you just read it and it's in theexcept that you are redefining the terms 'persistent' and 'free' to mean
if you don't have to worry about unclean shutdowns then your system is not
needed. all you need to do is to create a ramdisk that you populate with
dd at boot time and save to disk with dd at shutdown. problem solved in a
couple lines of shell scripts and no kernel changes needed.if you want the data to be safe in the face of unclean shutdowns and
crashes, then you need to figure out how to make the image on disk
consistant, and at this point you have basicly said that you don't think
that it's a problem. so we're back to what you can do today with a couple
lines of scripting.David Lang
--
Sorry, I missed that the first time. See, it is 504G, not 504M.
Daniel
--
There was a time when punchcards ruled and everybody was nervous about
storing their data on magnetic media. I remember it well, you may not.
But you are repeating that bit of history, there is a proverb in thereFeel free. You use your script, and somebody with a reliable UPS or
two can use my driver, once it is stabilized of course. Just don't be
in business against them if being a few milliseconds slower on the
uptake means money lost.Daniel
--
so now you are saying that you are faster then a ramdisk?????
either you are completely out of touch or you misunderstood what I was
saying.if you have a reliable UPS and are willing to rely on it to save your data
take the identical hardware to what you are planning to use, but instead
of using your driver just create a ramdisk and load it on boot and save
the contents on shutdown.in this case you are doing zero disk I/O during normal operation, you only
touch the disk during startup and shutdown.with your proposal the system will be copying chunks of data from the
ramdisk to the hard disk at random times, and you are claiming that doing
so makes you faster then a ramdisk????I'll say it again. if you trust your UPS and don't care about unclean
shutdowns (say for example that you trust that linux is just never going
to crash) there's no need to write parts of the ramdisk to the hard disk
during normal operation, you can wait until you are told that you are
going to shutdown to do the data save.now there's no driver needed, just a couple lines of init scripts.
David Lang
--
No, I am saying that my driver is faster than any script you can write.
Your script will not be able to give access to data while the ramdisk
is being populated, nor will it be able to save efficiently exactly
what is dirty in the ramdisk. (Explained in my original post if youAha! You are getting close. Really, that is all ramback does. It
just handles some very difficult related issues efficiently, in such a
way as to minimize any denial of service from complete loss of UPS
power. This is all just about using power management in a new way that
gets higher performance. But your battery power has to be reliable.
Just make it so. It is not difficult these days, or even particularly
expensive.I calculated somewhere along the line that it would take something like
17 minutes to populate the big Violin ramdisk initially, and 17 minutes
to save it during a loss of line power event, during which UPS power
must be not run out before ramback achieves disk sync or you will get
file corruption. (This rule was mentioned in my original post.)All well and could, you can in fact do that with a pretty simple script.
But in the initial 17 minutes your application may not read or write
the ramdisk data and in the closing 17 minutes it may not write. That
knocks your system down to 4 nines, given one planned shutdown per year.
Not good, not good at all.See, ramback is entirely about _not_ getting knocked down to 4 nines.
It wants to stay above 6, given system components that satisfy that goal,
comprising:* Linux
* Processor, memory, motherboard etc
* Dual power supplies with independent UPS backup
* Ramback driverMy proposition is, you can go out and purchase hardware right now that
delivers 6 nines (30 seconds downtime/year) and yes, it will cost you,
but if that worries you then set up two (much) cheaper ones and set
them up as a failover cluster. (Helps that the Violin box can connect
via PCI-e to two servers at the same time.)I say you can do this reliabl...
Hmm, what happens if applications keep dirtying so much data you miss
your 17minute deadline?Anyway...
ext2
+ lots of memory
+ tweaked settings of kflushd (only write data older than 10 years)
+ just not using sync/fsync except during shutdown
+ find / | xargs cat...is ramback, right? Should have same performance, and you can still
read/write during that 17+17 minutes.Ok, find | xargs might be slower... but we probably want to fix that
anyway....It has big advantage: if you only tell kflushd to hold up writes for
an hour, you loose a little in performance and gain a lot in
reliability...(If ext2+tweaks is slower than ramback, we have a bug to fix, I'm
afraid).
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
Hi Pavel,
Ramback is supposed to prevent that by allowing only a limited amount
of application IO during flush mode. Currently this is accomplished
by making each application write wait synchronously on the one before
it, until flushing completes. This allows only a small amount of
application traffic, something like 5% bandwidth. This solution is
admittedly crude, and over time it will be improved to look more like
a realtime scheduler, because this is in fact a realtime scheduling
problem.Once flushing completes, application writes are still serialized and
thus slow, which is a stronger condition than necessary to maintain
transactional integrity for the filesystem. Eventually this will be
optimized.For now, the maximum flush is only a few hundred MB on my workstation,
which leaves a huge safety margin even with my $100 UPS. And the risk,
however small, of having to run a lossy e2fsck because the battery got
old and the power did run out, is mitigated by the fact that ramback
runs on my kernel hacking partition, and everything unique there just
gets uploaded to the internet regularly anyway. This serves as my
replication algorithm. Note: I strongly recommend that any critical
data entrusted to ramback be replicated to mitigate the risk of systemNo, you are missing some essential pieces. Ramback has two operating
modes:1) writeback (when ups-backed line power is available)
2) writethrough (when running on ups power)Plus, it has the daemon driven flushing for ups mode, and daemon driven
one-pass populating for startup mode. That is all ramback is, but you
do not quite get there with your solution above.Also, ramback works with generic block devices, opening up a wide range
I hope that my work inspires other people like you to go in and work
on some of the VM/VFS/BIO brokenness that helps make ramback such a
big win. In the meantime, it is useful to be clear on just what we
have here, and why some people care about it a lot.Daniel
--
you just use a redundant system and you no longer care how long it takes
to shutdown or startup a system. if you're running a datacenter that cares
about uptime to the point of counting 9's or buying a Violin box you areit also takes the faith that you will never have any unplanned shutdowns,
since your system will loose massive amounts of data if they happen.
nobody who worries about 9's will buy into that argument. you achieve 9's
by figuring that things don't always work, and as a result you figure out
how to engineer around the failures so that when they happen you stay up.
manufacturers have been trying to promise that their boxes are so reliable
that they won't go down for decades, and they haven't suceeded yet.David Lang
--
The period where you cannot access the data is downtime. If your script
just does a cp from a disk array to the ram device you cannot just read
from the backing store in that period because you will need to fail over
to the ramdisk at some point, and you cannot just read from the ramdisk
because it is not populated yet. My point is, you cannot implement
ramback as a two line script and expect to achieve anything resembling
continuous data availability.I interpret your point about the script as, Ramback is trivial and easy
to implement. That is kind of true and kind of untrue, because of theNever is not the right word, but indeed that is why I wrote the story
about the rocket ship. If you want the performance that ramback
delivers then you cover the risk of hardware failure by other,Why would you assume the data is not mirrored or replicated with a
All true. Now what about the punchcard versus magnetic media story?
There was a time when magnetic domains were considered less reliable
than holes in paper cards, ironically we now think the opposite. So
some people will have a hard time with the idea that a battery is
reliable enough to get your important cached data on to hard disk when
necessary, or that Linux is reliable enough to trust data to it, or
whatever. They will get over it. Battery backed data will become a
normal part of your life as progress marches on.Daniel
--
Wouldn't a raid-1 set comprising disk + ramdisk do that with no downtime?
--
In raid1, write completion has to wait for write completion on all
mirror members, so writes run at disk speed. Reads run at ramdisk
speed, so your proposal sounds useful, but ramback aims for high
write performance as well.Daniel
--
raid1 + kflushd tweak?
special raid1 mode that signals completion when it hits _one_ of the
drives, and does sync when the slower drive is idle?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
Feel free :-)
This is very close to how ramback already works. One subtlety is that
ramback does not write twice from the same application data source,
which could allow the data on the backing device to differ from the
ramdisk if the user changes it during the write. I don't know how
important it is to protect against this bug actually, but there you
have it. Ramback can easily to changed to write twice from the same
source just like a raid1 (in fact it originally was that way) which
would make it even more like raid1.Adding ramback-like functionality to raid1 would be a nice
contribution. I would fully support that but I do not have time to do
it myself.Daniel
--
raid1 already supports marking member(s) as write-mostly. Any
write-mostly member can also make use of write-behind mode (provided
you have a write intent bitmap).
--
Ramback could be an interesting building block. Consider using a
couple of systems exporting Ramback devices via Evgeniy's distributed
storage target (or something similiar). In this case, you can have as
many Ramback devices as you want comprise your mirror set to meet your
availability requirements. Perhaps people are looking at this too
much as an entire solution as opposed to a piece of a bigger puzzle.
I think the idea has merit.Cheers,
Jeff
--
Well, that sounds convincing. Not. You know this how?
--
By measuring it. time untar -xf linux-2.2.26.tar; time sync
Daniel
--
Thats cheating. Your ramback ignores sync.
Just time it against ext3 _without_ doing the sync. That's still more
reliable than what you have.Heck, comment out sync and fsync from your kernel. You'll likely be 10
times normal speed, and still more reliable than ramback.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
No, that allows ext3 to cheat, because ext3 does not supply any means
of flushing its cached data to disk in response to loss of line power,
and then continuing on in a "safe" mode until line power comes back.Fix that and you will have a replacement for ramback, arguably a more
efficient one for this specialized application (it will not work for an
external ramdisk). Until you do that, ramback is the only game in town
to get these transaction speeds together with data durability.I have mentioned a number of times, that you _already_ rely on an
equivalent scheme to ramback if you are using a battery-backed raid
controller. Somehow, posters to this thread keep glossing over that
and going back to the sky-is-falling argument.Daniel
--
Ok, it seems like "ignore sync/fsync unless on UPS power" is what you
really want? That should be easy enough to implement, either in
kernelor as a LD_PRELOAD hack.So... untar with sync is fair benchmark against ramback on UPS power
and untar without sync is fair benchmark against ramback on AC power.But you did untar with sync against ramback on AC power.
That's wrong.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
pomozte zachranit klanovicky les: http://www.ujezdskystrom.info/
--
Sure, let's try it and then we will have a race. I would be happy to
It is consistent and correct. You need to supply the missing features
that ramback supplies before you have a filesystem-level solution. I
really encourage you to try it, then we can compare the two approaches
with both of them fully working.Daniel
--
No numbers. No specifications. And by doing a sync, you explicitly
excluded what I was asking, namely a big buffer cache. You've certainly
convinced me; you don't know if your idea is worth a brass razoo.Come back when you've got some hard data.
--
...or download the code and try it yourself.
--
expecting the hw to never fail is unreasonable - it will. it's just a
question what happens when (not if) it fails.
and it's not about the backing store being inconsistent during normal
operation - it's about what you are left with after an unclean shutdown.
With your scheme the only time you can trust the on-disk data is when
the device is off; when it fails for some reason (batteries do fail,
kernel bugs do happen, DOS, overheating etc etc) you can no longer
trust any of the data, and no - fsck doesn't help when you have a mix
of old data overwritten by new stuff in basically random order. i can't
see any scenario when it would make sense to trust the corrupted on-disk
fs instead of restoring from backup (or regenerating). So is it just
about avoiding repopulating the fs in the (likely) case of normal,
clean shutdown? This could be a reasonable application of ramback (OTOH
how often will this (shutdown) happen in practice...). IOW you get
a ramdisk-based (ie fast) device that is capable of surviving power loss,
but that's about it.Now, if you add snapshots to the backing store it suddenly becomes much
more interesting -- you no longer need to put so much trust in all the
hw. Should the device fail for whatever reason then you just rollback to
the last good snapshot upon restart. No corrupted fs, no fsck; you lose
some newly written data (that you couldn't recover w/o a snapshot anyway),
but can trust the rest of it (assuming you trust the fs and storage hw,
but that's no different then w/o ramback).artur
--
or you could keep two devices as backing store, use one and switch to the
other when the fs is consistent. This could as simple as noticing zero
dirty data in the ramdisk or, if something is constantly writing to it,
reacting periodically to some barrier (needs cow/doublebuffering in order
to not throttle the writer, but you already do this). Means ramdisk can
be as large as 1/2 the stable storage and a bit more i/o (resyncing after
switch to the other device), but gives you two copies of the data; one
stable and one that can be used to recover newer data should you need to.artur
--
On Mon, 10 Mar 2008 09:22:13 +0000
That could get ugly when ext3 has written to the same block multiple
times. To get some level of consistency, ramback would need to keep
around the different versions and flush them in order.--
All rights reversed.
--
Ah, keep snapshots like ddsnap? Interesting idea. But complex, and
ramback will stay perfectly consistent so long as you don't pull the
plug on your UPS. I seem to recall that EMC has been peddling SAN
storage with similar restrictions for quite some time now.Regards,
Daniel
--
Are you using barriers or ordered disk writes with physical sync in the
right moments or something like that? I think this is needed to allow anyThanks,
GK
--
Usual block device semantics are preserved so long as UPS power does
not run out before emergency writeback completes. It is not possible
to order writes to the backing store and still deliver ramdisk level
write latency to the application.After the emergency writeback completes, ramback is supposed to
behave just like a physical disk (with respect to writes - reads will
still have ramdisk level latency). No special support is provided
for barriers. It is not clear that anything special is needed.Daniel
--
Why - your chunks simply become a linked list in write barrier order.
Solve your bitmap sweep cost as well. As you are already making a copy
before going to backing store you don't have the internal consistency
problems of further writes during the I/O.Yes you may need to throttle in the specific case of having too many
copies of pages sitting in the queue - but surely that would be the set of
pages that are written but not yet committed from a previous store
barrier ?BTW: I'm also curious why you made it a block device. What does that
offer over say ramfs + dnotify and a userspace daemon or perhaps for big
files to work smoothly a ramfs variant that keeps dirty bitmaps on file
pages. That way write back would be file level and while you might lose
changesets that have not been fsync()'d your underlying disk fs would
always be coherent.Alan
--
You get duplicated blocks though. But yes, I agree - write-backs to the
disk must be ordered, other it's going to be too unreliable in practice.You could switch from a journal like the above to a bitmap when this
overrun occurs. (Typical problem in replication.) SteelEye holds a
patent on that though, as far as I know.Regards,
Lars--
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde--
Hi Lars,
I disagree with your claim of "too unreliable". If the UPS power does
not fail before flushing completes, it is perfectly reliable. Perhaps
you need a belt to go with your suspenders?As I wrote earlier, you cannot have optimal writeback speed and ordering
at the same time. I can see eventually implementing some kind of ordered
writeback mode where completion is signalled to the application before
writeback completes. You then get to choose between fastest flush and
most paranoid ordering. I guess everybody will choose fastest flush,
but I will be happy to accept your patch to see which they actuallyIf you think this is like replication then you have the wrong idea
about what is going on. This is a cache consistency algorithm, not
a replication algorithm.Regards,
Daniel
--
Daniel, I'm not saying you don't have a good thing here. Just that for
backing large filesystems, the risk of having to run a full fsck and
finding inconsistent metadata is pretty serious.If I always assume a reliable shutdown - UPS protected, no crashes, etc
- you're right, but at least my real world has other failure scenarios
as well. In fact, the most common reason for unorderly shutdowns are
kernel crashes, not power failures in my experience.So "perfectly reliable if UPS power does not fail" seems a bit over the
No disagreement here. The question would be how large the performance
I was trying to prod you into writing the ordered flushing. Maybe
claiming it is too hard will do the trick? ;-)I see the differences, but I also see the similarities. What you're
doing can also be thought of as replicating from an instant IO store
(local memory) to a high latency, low bandwidth copy (the disk)
asynchronously.Both obviously need to preserve consistency, the question is whether to
achieve transactional (ordered) consistency or not.Regards,
Lars--
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde--
What are you doing to your kernel?
My desktop at home, which runs my MTA, web site etc, and is subject to
regular oom abuse by firefox:$ uptime
04:36:49 up 204 days, 9:38, 8 users, load average: 0.82, 0.43, 0.20This machine has been up ever since I got a UPS for it, and then it
only ever went down due to a blackout or my wife blowing a fuse with
the vacuum cleaner. Honestly, I have never seen a machine running
Linux 2.6 crash due to a software flaw, except when I caused it
myself. I suspect the Linux kernel has a better MTBF than a hardIn fact, replicating was one of the strategies I considered for this.
But since it is a lot more work and will not perform as well as a
simple sweep, I opted for the simple thing. Which turned out to
be pretty complex anyway. You have to close all the same nasty
races but with a considerably more complex base algorithm. I think
that better wait for version 2.0.By the way, I could use a hand debugging this thing.
Regards,
Daniel
--
I guess I'm being really vicious to them: I expose it to customers and
the real world.My own servers also have uptimes of >400 days sometimes, and I wonder
what customers do to the poor things.And yes, I'm not saying I don't see your point for specialised
deployments (filesystems which are easy to rebuild from scratch), but
transactional integrity is a requirement I'd rank really high on theWhere they control the hardware and run a rather specialized OS as well,
I'm afraid with those properties it doesn't really meet my needs :-(
And, wouldn't a simpler way to achieve something similar not be to use
the plain Linux fs caching/buffers, just disabling forced write out
maybe via a mount option? This strikes me as similar to the effect I get
from remounting NFS (a)sync. Make the fs ignore fsync et al.It would have the advantage of using all memory available for caching
and not otherwise requested, too. (And, of course, the downside of
making it hard to reserve cache space for a given fs explicitly, at
least now. But I'm sure the control group / container folks would love
that feature. ;-)Regards,
Lars--
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde--
Actually, in Centera we use generic hardware with a fairly normal kernel
which has strategic backports from upstream (libata, nic drivers, etc).No UPS in the picture. Data integrity is protected by working with the
application team to insure they understand when data is safely on the
disk platter and working with IO & FS people to try and make sure we
don't lie to them (too much ) about that promise.The centera boxes are tested with power failure & error injection and by
all of our customers in all those ways customers do ;-)ric
--
Hi Ric,
Right, so Linux has gotten to the point where it competes with purpose-
built embedded software in reliability. Not quite there, but close
enough for mission-critical.I was not thinking of Centera when I mentioned the UPS though...
Daniel
--
This is our case, but we have been working for quite a while to enhance
the reliability of the io stack & file systems. It also helps to be very
careful to select hardware components with mature, open source &No problem, we certainly have many boxes with built in ups hardware ;-)
ric
--
A word to the wise indeed. Well I would never suggest that we can rest
on our laurels as far as Linux reliability in concerned, only that it is
already very reliable or you certainly would not ship products based on
it.Daniel
--
By the way are you aware that all you have to do is:
echo 1 >/proc/driver/ramback/<name>
and ramback already runs in the mode you speak of?
Daniel
--
It is ranked high, nonetheless I perceive this spate of sky-is-falling
comments as low level FUD. Which of the following do you think is the
least reliable component in your transactional system:1) Linux
2) The computer
3) The hard disk
4) The battery
5) The fanCorrect answer: the fan. The rest are roughly a tie (though of course
you will find variations) and depend on how much money you spend on
each of them. I know I do not have to explain this to you, but the way
you calculate reliability for a complete system is to multiply the
reliability of each component. The number of nines that drop out of
this calculation is your reliability.At the moment, the version of Linux I run is looking like a 1.0, so is
the UPS. Already got me through about 4 blackouts and half a dozen
vacuum cleaner events. Though obviously neither is 1.0, both are darn
close. The hard disk on the other hand... I have a box full of broken
ones here, how about you?I have never had a PC go bad on me, ever. Had a couple of fans die,
but these days I only buy PCs that run fine without a fan.So your proposition is, I can add nines to this system by introducing
atomic update of the backing store. Fine, I agree with you. However
if I already sit at six or seven nines then should I be putting my
effort there, or where?Also no need to explain: when you introduce two way redundancy, you
square the reliability. So have two independent power supplies on two
independent UPSes. Sleep easy, plus you gain the ability to do
scheduled battery maintenance, so reliability increases by more than
the square.No matter how much you fiddle with atomic update of backing store, one
disgruntled sysop going postal can still destroy your data with the
help of a sledgehammer. You need to get this reliability thing in
perspective.So how about you draft a Suse engineer to get working on the atomic
backing store update, ETA six months? In the mean time, we can
configure a transactional system...
Or to quote a little SciFi, in this case Captain Hunt from Andromeda:
Slipstream: it's not the best way to travel faster than light, it's just
the only way.[1]At least that's what first crossed my mind after reading the above. ;-)
[1]: http://en.wikipedia.org/wiki/Slipstream_%28science_fiction%29#Andromeda
Bis denn
--
Real Programmers consider "what you see is what you get" to be just as
bad a concept in Text Editors as it is in women. No, the Real Programmer
wants a "you asked for it, you got it" text editor -- complicated,
cryptic, powerful, unforgiving, dangerous.--
I have experienced many 2.6 crashes due to software flaws. Hung
processes leading to watchdog timeouts, bad kernel pointers, kernel
deadlock, etc.When designing for reliable embedded systems it's not enough to handwave
away the possibility of software flaws.Chris
--
Indeed. You fix them. Which 2.6 kernel version failed for you, what
made it fail, and does the latest version still fail?If Linux is not reliable then we are doomed and I do not care about
whether my ramback fails because I will just slit my wrists anyway.
How about you?Daniel
--
Daniel, you're not objective. Simply look at LKML reports. People even
report failures with the stable branch, and that's quite expected when
more than 5000 patches are merged in two weeks. The question could even
be returned to you: what kernel are you using to keep 204 days of uptime
doing all that you describe ? Maybe 2.6.16.x would be fine, but even then,
a lot of security issues have been fixed since the last 204 days (most of
which would require a planned reboot), and also a number of normal bugs,No, I've looked at the violin-memory appliance, and I understand better
your goal. Trying to get a secure and reliable generic kernel is a lost
game. However, building a very specific kernel for an appliance is often
quite achievable, because you know exactly the hardware, usage patterns,
etc... which help you stabilize it. I think people who don't agree with
you are simply thinking about a generic file server. While I would not
like my company's NFS exports to rely on such a technology for the same
concerns as exposed here, I would love to have such a beast for logs
analysis, build farms, or various computations which require more than
an average PC's RAM, and would benefit from the data to remain consistent
across planned downtime.If you consider that the risk of a crash is 1/year and that you have to
work one day to rebuild everything in case of a crash, it is certainly
worth using this technology for many things. But if you consider that
your data cannot suffer a loss even at a rate of 1/year, then you have
to use something else.BTW, I would say that IMHO nothing here makes RAID impossible to use :-)
Just wire 2 of these beasts to a central server with 10 Gbps NICs and
you have a nice server :-)Regards,
Willy--
So we have a flock of people arguing that you can't trust Linux. Well
maybe there are situations were you can't, but what can you trust?
Disk firmware? Bios? Big maybes everywhere. In my experience, Linux
is very reliable. I think Linus, Andrew and others care an awful lot
about that and go to considerable lengths to make it true. Got a list
of Linux kernel flaws that bring down a system? Tell me and I will not
use that version to run a transaction processing system, or I will fix
them or get them fixed.But please do not tell me that Linux is too unreliable to run a
transaction processing system. If Linux can't do it, then what can?By the way, the huge ramdisk that Violin ships runs Linux inside, to
manage the raided, hotswappable memory modules. (Even cooler: they
run Linux on a soft processor implemented on a big FPGA.) Does anybody
think that they did not test to make sure Linux does not compromise
their MTBF in any way?In practice, for the week I was able to test the box remotely and the
10 days I had it in my hands, the thing was solid as a rock. GoodSure. Leaving out dodgy stuff like hald, other bits I could mention,
is probably a good idea. Scary thing is, thinks like hald are actually
being run on servers but that is another issue entirely.It wasn't too long ago that NFS client was in the dodgy category, with
oops, lockups, whathaveyou. It is pretty solid now, but it takes a
while for the bad experiences to fade from memory. On the other hand,
knfsd has never been the slightest bit of a problem. Helpful
suggestion: don't run NFS client on your transaction processing unit.
It may well be solid, but who needs to find out experimentally? Might
as well toss gamin, dbus and udev while you are at it, for a further
marginal reliability increase. Oh, and alsa, no offense to the great
work there, but it just does not belong on a server. Definitely doI guess I am actually going to run evaluations on some mission critical
systems using the arrangem...
On Wed, 12 Mar 2008 00:17:56 -0800
The traditional and proven method to constructing a reliable system is
to assume that no component can be fully trusted. This is especially
true for new code.By being paranoid about everything, failures in one component are
usually contained well enough that one failure is not catastrophic.In order for ramback to get appeal with the people who are paranoid
about data integrity (probably a vast majority of users), you will
need some guarantees about flush order, etc...--
All Rights Reversed
--
I disagree. Never mind that it already does provide such guarantees,
just echo 1 >/proc/driver/ramback/name. But if you want the full
performance you need to satisfy your paranoia at a higher level in
the traditional way: by running two in parallel or whatever.Daniel
--
You've eluded to NBD needing deadlock fixes quite a few times in the
past. I've even had some discussions with you on where you see NBD
lacking (userspace nbd-server doesn't lock memory or set PF_MEMALLOC,
etc). But I've lost track of what changes you have in mind for NBD.
Are you talking about a complete re-write or do you have specific
patches that will salvage the existing NBD client and/or server? Has
this work already been done and you just need to dust it off?As an aside, using a kernel with the new per bdi dirty page accounting
I've not been able to hit any deadlock scenarios with NBD. Am I not
trying hard enough? Or are they now mythical? If real, do you have a
reproducible scenario that will cause NBD to deadlock?I'm not interested in swap over NBD (e.g. network memory reserves?)
because in practice I've found that the VM doesn't allow non-swap NBD
use-cases to actually need that "netvm" sophistication... any other
workload that deadlocks NBD would interesting.thanks,
Mike
--
Good idea, it would be nice to offer that operating mode. But linear
sweeping is going to put the most data onto rotating media the fastest,
thus making the loss-of-line-power flush window as small as possible,
which is what the current incarnation of this driver optimizes for.Note that half a TB worth of dirty ramdisk chunks will need 1 GB of
linked list storage, so this imposes a limit on total dirty data forWhat happens with the linked list gets too long? Fall back to linear
sweep? Or accept suboptimal write caching?A linked list would work for linking together dirty bitmap pages, one
level up, thus 2**15 rarer. Even there I prefer the linear sweep. I
intend to implement a dirty map of the dirty map, at least because I
have not seen one of those before, but also because I think it willIndeed. That is the entire reason I did it that way. In fact ramback
used to write the ramdisk and backing store from the same application
source, so the writethrough code was significantly shorter. But notThe only time ramback cares about barriers is when it switches to
writethrough mode. It would be nice to have a mode where barriers are
respected at the backing store level, but there is no way you will get
the same write performance. The central idea here is that ramback
relies on a UPS to achieve the ultimate in disk performance. I agree
that other modes would be very nice, but not necessary for this thing
to be actually useful. I suspect than early users will be looking forAs a block device it is very flexible, and as a block device it is
fairly simple. As a block device, the only interesting userspace setup
is the hookup to power management scripts. Dnotify... probably you
meant inotify, and even then it sounds daunting, but maybe somebodyThat would a nice hack, why not take a run at it?
Regards,
Daniel
--
| Davide Libenzi | Re: [patch 7/8] fdmap v2 - implement sys_socket2 |
| Bart Van Assche | Integration of SCST in the mainstream Linux kernel |
| Greg Kroah-Hartman | [PATCH 005/196] Chinese: add translation of SubmittingDrivers |
| Mariusz Kozlowski | [KJ PATCHES] mostly kmalloc + memset conversion to k[cz]alloc |
git: | |
| KOSAKI Motohiro | [bug?] tg3: Failed to load firmware "tigon/tg3_tso.bin" |
| Stefan Richter | Re: [GIT]: Networking |
| David Miller | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Gerrit Renker | [PATCH 0/37] dccp: Feature negotiation - last call for comments |
