Disk images formats have nothing to do with the I/O interface. I believe Gerd added it for running unmodified Xen guests in qemu, but he can explain more of it. I've only mentioned it here because it's the one I easily have access to. Given Xen's about 4 different I/O backends and the various forked If you pass it on it has the same semantics, but given that you'll usually end up having multiple guest disks on a single volume using lvm or similar you'll end up draining even more I/O as there is one queue for all of them. That way you can easily have one guest starve others. Note that we're going to get rid of the draining for common cases anyway, but that's a separate discussion thread the "relaxed barriers" If they are in Linux/Posix userspace they can't because there are not system calls to archive that. And then again there really is no need to implement all this in the host anyway - the draining is something we enforced on ourselves in Linux without good reason, Just read Documentation/block/barriers.txt, it's very well described there. Even the naming of the various ORDERED constant should It's one of the many backends written to the protocol specification, I don't think it's fair to call it irrelevant. And as mentioned before I'd be very surprised if the other backends all get it right. If you send me pointers to one or two backends you considered "relevent" I'm happy to look at them. --
Well, you can boot pv kernels with upstream qemu. qemu must be compiled with xen support enabled, you need xen underneath and xenstored must run, but nothing else (xend, tapdisk, ...) is required. qemu will call xen libraries to build the domain and run the pv kernel. qemu provides backends for console, framebuffer, network and disk. There was also the plan to allow xen being emulated, so you could run pv kernels in qemu without xen (using tcg or kvm). Basically xenner merged into qemu. That project was never finished though and I didn't spend any time on it for at least one year ... Hope this clarifies, Gerd --
That one is read and well understood. I presently don't see a point in having the frontend perform its own pre or post flushes as long as there's a single queue in the block layer. But if the kernel drops the plain _TAG mode, there is no problem with that. Essentially the frontend may drain the queue as much as as it wants. It just won't buy you much if the backend I/O was actually buffered, other than adding latency to the transport. The only thing which matters is that the frontend lld gets to see the actual barrier point, anything else needs to be sorted out next to the physical layer anyway, so it's better left to the backends. Not sure if I understand your above comment regarding the flush and fua bits. Did you mean to indicate that _TAG on the frontend's request_queue is presently not coming up with the empty barrier request to make _explicit_ cache flushes happen? That would be something which definitely needs a workaround in the frontend then. In that case, would PRE/POSTFLUSH help, to get a call into prepare_flush_fn, which might insert the tag itself then? It's sounds a bit over the top to combine this with a queue drain on the transport, but I'm rather after correctness. Regarding the potential starvation problems when accessing shared physical storage you mentioned above: Yes, good point, we discussed that too, although only briefly, and it's a todo which I don't think has been solved in any present backend. But again, scheduling/merging drain/flush/fua on shared physical nodes more carefully would be something better *enforced*. The frontend can't even avoid it. I wonder if there's a userspace solution for that. Does e.g. fdatasync() deal with independent invocations other than serializing? Couldn't find anything which indicates that, but I might not have looked hard enough. The blktap userspace component presently doesn't buffer, so a _DRAIN is sufficient. But if it did, then it'd be kinda cool if handled more carefully. If the kernel does it, all the ...
Given that xen blkfront does not actually implement cache flushes You do need the _FLUSH or _FUA modes (either with TAGs or DRAIN) to get the block layer to send you pure cache flush requests (aka "empty barriers") without this they don't work. They way the current barrier code is implemented means you will always get manual cache flushes before the actual barrier requests once you implement that. By using the _FUA mode you can still do your own post flush. I've been through doing all this, and given how hard it is to do a semi-efficient drain in a backend driver, and given that non-Linux guests don't even benefit from it just leaving the draining to the guest is the easiest solution. If you already have the draining around and are confident that it gets all corner cases right you can of course keep it and use the QUEUE_ORDERED_TAG_FLUSH/QUEUE_ORDERED_TAG_FUA modes. But from dealing with data integrity issues in virtualized environment I'm not confident that things will just work, both on the backend side, especially if image formats are around, and also on the Doesn't buffer as in using O_SYNC/O_DYSNC or O_DIRECT? You still need to call fdatsync for the latter, to flush out transaction for block allocations in sparse / fallocated images and to flush the volatile write cache of the host disks. --
Well, given the stuff below I might actually go and read it again, maybe Okay. Well that's a frontend thing, let's absolutely fix according to Stop, now that's different thing, if we want to keep stuff simple (we really want) this asks for making draining the default mode for everyone? You basically want everybody to commit to a preflush, right? Only? Is that everything? Does the problem relate to merging barrier points, grouping frontends on shared storage, or am I missing something more general? Because otherwise it still sounds relatively straightforward. I wouldn't be against defaulting frontend barrier users to DRAIN if it's clearly beneficial for any kind of backend involved. For present blkback it's a no-brainer because we just map to the the blk queue, boring as we are. But even considering group synchronization, instead of just dumb serialization, the need for a guest queue drain doesn't look obvious. The backend would have to drain everybody else on its own terms anyway to find a good merge point. So I'm still wondering. Can you explain a little more what makes your backend depend on it? Otherwise one could always go and impose a couple extra flags on frontend authors, provided there's a benefit and it doesn't result in just mapping the entire QUEUE_ORDERED set into the control interface. :P But either way, that doesn't sound like a preferrable solution if we can The image format stuff in Xen servers is not sitting immediately behind the frontend ring anymore. This one indeed orders with with a drain, but again we use the block layer to take care of that. Works, cheaply for Well, I understand that _TAG is the only model in there which doesn't map easily to the concept of a full cache flush on the normal data path, after all it's the only one in there where the device wants to deal with it alone. Which happens to be exactly the reason why we wanted it in xen-blkfront. If it doesn't really work like that for a linux guest, tough luck. It ...
Witht the barrier model we have in current kernels you basically need to
a) do a drain (typically inside the guest) and you need to have a cache
flush command if you have volatile write cache semantics. The cache
flush command will be used for pre-flushes, standalone flushes and
Which backend? Currently filesystems can in theory rely on the ordering
semantics, although very few do. And we've not seen a working
implementation except for draining for it - the _TAG versions exist,
but they are basically untested, and no one has solved the issues of
Basically the only think you need it a cache flush command right now,
that solves everything the Linux kernel needs, as does windows or
possibly other guests. The draining is something imposed on us by
the current Linux barrier semantics, and I'm confident it will be a
I thikn you're another victim of the overloaded barrier concept. What
the Linux barrier flags does is two only slightly related things:
a) give the filesystem a way to flush volatile write caches and thus
gurantee data integrity
b) provide block level ordering losly modeled after the SCSI ordered
tag model
Practice has shown that we just need (a) in most cases, there's only
two filesystems that can theoretically take advantage of (b), and even
there I'm sure we could do better without the block level draining.
The _TAG implementation of barriers is related to (b) only - the pure
QUEUE_ORDERED_TAG is only safe if you do not have a volatile write
cache - to do cache flushing you need the QUEUE_ORDERED_TAG_FLUSH or
QUEUE_ORDERED_TAG_FUA modes. In theory we could also add another
mode that not only integrates the post-flush in the _FUA mode but
also a pre-flush into a single command, but so far there wasn't
any demand, most likely because no on the wire storage protocol
The typical one is f(data)sync for the case where there have been no
modifications of metadata, or when using an external log device.
No metadata modifications are quite ...That's all true at the physical layer. I'm rather about the virtual one -- what consistutes the transport between the frontend and backend. So if the block queue above xen-blkfront wants to jump through a couple extra loops, such as declaring TAG_FUA mode, to realize proper out-of-band cache flushing, fine. Underneath a backend, whether that's blkback or qemu, that draining and flushing will happen on to the physical layer, too. Agreed. That still doesn't mean you have to impose a drain the transport in between. The block layer underneath the backend does all the draining necessary, with a request stream just submitted in-order and barrier bits set where the guest saw fit. Including an empty one for an explicit cache flush. Neither does a backend want to know how the physical layer will deal with it in detail, or can. Except for the NONE case, of course. And I still don't see where any backends can claim overall benefit from requiring the guest to drain. At that level, a "TAG" is the much simpler and efficient one. Even if it neither applies to a Linux guest, nor a caching disk. Especially the ones far below, underneath some image format. It maps well to the bio layer, it even maps well to a trivial datasync() implementation in userspace, and I don't see why it wouldn't map well to a non-trivial one either. These aren't just two shorted Linux block layers. So far I'd suggest we keep the ring model as TAG vs. NONE, fix xen-blkfront to keep the empty barrier stuff going, and keep additional details where they belong, which is on either end, respectively. On the Linux frontend side, does TAG_FUA sound about right to you? Because to me that appears to be the one with the least noise around the actual barrier request. According to barrier.txt, then I guess we will map the flush to an empty barrier on the ring and in turn drop a gratuitous empty barrier following that (?). I obviously didn't try that out yet. Please absolutely correct me so we maybe get it right ...
Yes, that's unfortunate. In the normal case the IO streams would
actually be independent so they wouldn't need to be serialized with
respect to each other. But I don't know if that kind of partial-order
Userspace might not be relying on the kernel to do storage (it might
I've gone over it a few times. Since the blkback barriers do both
ordering and flushing, it seems to me that plain _TAG is the right
choice; we don't need _TAG_FLUSH or _TAG_FUA. I still don't understand
what you mean about "explicit cache flush (aka empty barrier) is
silently dropped". Who drops it where? Do you mean the block subsystem
will drop an empty write, even if it has a barrier associated with it,
but if I set PREFLUSH and POSTFLUSH/FUA then those will still come
You can see the current state in
git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen.git
xen/dom0/backend/blkback is the actual backend part. It can either
attach directly to a file/device, or go via blktap for usermode processing.
J
--
