Re: Distributed storage. Move away from char device ioctls.

Previous thread: [PATCH] add consts where appropriate in sound/pci/hda/* by Denys Vlasenko on Friday, September 14, 2007 - 1:48 pm. (13 messages)

Next thread: [PATCH] Refactor hypercall infrastructure by Anthony Liguori on Friday, September 14, 2007 - 3:45 pm. (29 messages)
To: <netdev@...>
Cc: <linux-kernel@...>, <linux-fsdevel@...>
Date: Friday, September 14, 2007 - 2:54 pm

Hi.

I'm pleased to announce fourth release of the distributed storage
subsystem, which allows to form a storage on top of remote and local
nodes, which in turn can be exported to another storage as a node to
form tree-like storages.

This release includes new configuration interface (kernel connector over
netlink socket) and number of fixes of various bugs found during move
to it (in error path).

Further TODO list includes:
* implement optional saving of mirroring/linear information on the remote
nodes (simple)
* new redundancy algorithm (complex)
* some thoughts about distributed filesystem tightly connected to DST
(far-far planes so far)

Homepage:
http://tservice.net.ru/~s0mbre/old/?section=projects&item=dst

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

diff --git a/Documentation/dst/algorithms.txt b/Documentation/dst/algorithms.txt
new file mode 100644
index 0000000..bfc6984
--- /dev/null
+++ b/Documentation/dst/algorithms.txt
@@ -0,0 +1,115 @@
+Each storage by itself is just a set of contiguous logical blocks, with
+allowed number of operations. Nodes, each of which has own start and size,
+are placed into storage by appropriate algorithm, which remaps
+logical sector number into real node's sector. One can create
+own algorithms, since DST has pluggable interface for that.
+Currently mirrored and linear algorithms are supported.
+
+Let's briefly describe how they work.
+
+Linear algorithm.
+Simple approach of concatenating storages into single device with
+increased size is used in this algorithm. Essentially new device
+has size equal to sum of sizes of underlying nodes and nodes are
+placed one after another.
+
+ /----- Node 1 ---\ /------ Node 3 ----\
+start end start end
+ |==================|========================|==================|
+ | start end |
+ | \------- Node 2 ---------/ ...

To: Evgeniy Polyakov <johnpol@...>
Cc: <netdev@...>, <linux-kernel@...>, <linux-fsdevel@...>
Date: Friday, September 14, 2007 - 3:07 pm

My thoughts. But first a disclaimer: Perhaps you will recall me as
one of the people who really reads all your patches, and examines your
code and proposals closely. So, with that in mind...

I question the value of distributed block services (DBS), whether its
your version or the others out there. DBS are not very useful, because
it still relies on a useful filesystem sitting on top of the DBS. It
devolves into one of two cases: (1) multi-path much like today's SCSI,
with distributed filesystem arbitrarion to ensure coherency, or (2) the
filesystem running on top of the DBS is on a single host, and thus, a
single point of failure (SPOF).

It is quite logical to extend the concepts of RAID across the network,
but ultimately you are still bound by the inflexibility and simplicity
of the block device.

In contrast, a distributed filesystem offers far more scalability,
eliminates single points of failure, and offers more room for
optimization and redundancy across the cluster.

A distributed filesystem is also much more complex, which is why
distributed block devices are so appealing :)

With a redundant, distributed filesystem, you simply do not need any
complexity at all at the block device level. You don't even need RAID.

It is my hope that you will put your skills towards a distributed
filesystem :) Of the current solutions, GFS (currently in kernel)
scales poorly, and NFS v4.1 is amazingly bloated and overly complex.

I've been waiting for years for a smart person to come along and write a
POSIX-only distributed filesystem.

Jeff

-

To: Jeff Garzik <jeff@...>
Cc: Evgeniy Polyakov <johnpol@...>, <netdev@...>, <linux-kernel@...>, <linux-fsdevel@...>
Date: Saturday, September 15, 2007 - 9:56 am

it's called Lustre.
works well, scales well, is widely used, is GPL.
sadly it's not in mainline.

cheers,
robin
-

To: Robin Humble <rjh@...>
Cc: Evgeniy Polyakov <johnpol@...>, <netdev@...>, <linux-kernel@...>, <linux-fsdevel@...>
Date: Saturday, September 15, 2007 - 10:35 am

Lustre is tilted far too much towards high-priced storage, and needs
improvement before it could be considered for mainline.

Jeff

-

To: Jeff Garzik <jeff@...>
Cc: Evgeniy Polyakov <johnpol@...>, <netdev@...>, <linux-kernel@...>, <linux-fsdevel@...>
Date: Saturday, September 15, 2007 - 12:20 pm

many (most?) Lustre deployments are with SATA and md raid5 and GigE -
can't get much cheaper than that.

if you want storage node failover capabilities (which larger sites often
do) or want to saturate an IB link then the price of the storage goes
up but this is a consequence of wanting more reliability or performance,
not anything to do with lustre.

interestingly, one of the ways to provide dual-attached storage behind
a failover pair of lustre servers (apart from buying SAS) would be via
a networked-raid-1 device like Evgeniy's, so I don't see distributed
block devices and distributed filesystems as being mutually exclusive.

quite likely.
from what I understand (hopefully I am mistaken) they consider a merge
task to be too daunting as the number of kernel subsystems that any
scalable distributed filesystem touches is necessarily large.

roadmaps indicate that parts of lustre are likely to move to userspace
(partly to ease solaris and ZFS ports) so perhaps those performance
critical parts that remain kernel space will be easier to merge.

cheers,
robin
-

To: Robin Humble <rjh@...>
Cc: Jeff Garzik <jeff@...>, Evgeniy Polyakov <johnpol@...>, <netdev@...>, <linux-kernel@...>, <linux-fsdevel@...>
Date: Saturday, September 15, 2007 - 1:51 pm

I have to agree - while Lustre CAN scale up to huge servers and fat pipes,
it can definitely also scale down (which is a LOT easier to do :-). I can
run a client + MDS + 5 OSTs in a single UML instance using loop devices

That is definitely true, and there are a number of users who run in
this mode. We're also working to make Lustre handle the replication
internally (RAID5/6+ at the OST level) so you wouldn't need any kind of
block-level redundancy at all. I suspect some sites may still use RAID5/6
back-ends anyways to avoid performance loss from taking out a whole OST

It's definitely true, and we are always working at improving it. It
used to be in the past that one of the reasons we DIDN'T want to go
into mainline was because this would restrict our ability to make
network protocol changes. Because our install base is large enough
and many of the large sites with mutliple supercomputers mounting
multiple global filesystems we aren't at liberty to change the network
protocol at will anymore. That said, we also have network protocol
versioning that is akin to the ext3 COMPAT/INCOMPAT feature flags, so

That's partly true - Lustre has its own RDMA RPC mechanism, but it does
not need kernel patches anymore (we removed the zero-copy callback and
do this at the protocol level because there was too much resistance to it).
We are now also able to run a client filesystem that doesn't require any
kernel patches, since we've given up on trying to get the intents and
raw operations into the VFS, and have worked out other ways to improve
the performance to compensate. Likewise with parallel directory operations.

It's a bit sad, in a way, because these are features that other filesystems

This is also true - when that is done the only parts that will remain
in the kernel are the network drivers. With some network stacks there
is even direct userspace acceleration. We'll use RDMA and direct IO to
avoid doing any user<->kernel data copies.

Cheers, Andreas
--
Andreas Dilge...

To: Jeff Garzik <jeff@...>
Cc: <netdev@...>, <linux-kernel@...>, <linux-fsdevel@...>
Date: Saturday, September 15, 2007 - 8:29 am

Hi Jeff.

Yes, block device itself is not able to scale well, but it is the place
for redundancy, since filesystem will just fail if underlying device
does not work correctly and FS actually does not know about where it
should place redundancy bits - it might happen to be the same broken
disk, so I created a low-level device which distribute requests itself.
It is not allowed to mount it via multiple points, that is where
distributed filesystem must enter the show - multiple remote nodes
export its devices via network, each client gets address of the remote
node to work with, connect to it and process requests. All those bits
are already in the DST, next logical step is to connect it with

Well, originally (about half a year ago) I started to draft a generic
filesystem which would be just superior to existing designs, not
overbloated like zfs, and just faster.
I do believe it can be implemented.
Further I added network capabilities (since what I saw that time
(AFS was proposed) I did not like - I'm not saying it is bad or
something like that at all, but I would implement things differently)
into design drafts.

When Chris Mason announced btrfs, I found that quite a few new ideas
are already implemented there, so I postponed project (although
direction of the developement of the btrfs seems to move to the zfs side
with some questionable imho points, so I think I can jump to the wagon
of new filesystems right now).

DST is low level for my (theoretical so far) filesystem (actually its
network part) like kevent was a low level system for network AIO (originally).

No matter what filesystem works with network it implements some kind
of logic completed in DST.
Sometimes it is very simple, sometimes a bit more complex, but
eventually it is a network entity with parts of stuff I put into DST.
Since I postponed the project (looking at btrfs and its results), I
completed DST as a standalone block device.

So, essentially, a filesystem with simple distributed facilities is on...

To: Evgeniy Polyakov <johnpol@...>
Cc: Jeff Garzik <jeff@...>, <netdev@...>, <linux-kernel@...>, <linux-fsdevel@...>
Date: Saturday, September 15, 2007 - 1:24 pm

I actually think there is a place for this - and improvements are
definitely welcome. Even Lustre needs block-device level redundancy
currently, though we will be working to make Lustre-level redundancy
available in the future (the problem is WAY harder than it seems at

This is an area I'm always a bit sad about in OSS development - the need
everyone has to make a new {fs, editor, gui, etc} themselves instead of
spending more time improving the work we already have. Imagine where the
internet would be (or not) if there were 50 different network protocols
instead of TCP/IP? If you don't like some things about btrfs, maybe you
can fix them?

To be honest, developing a new filesystem that is actually widely useful
and used is a very time consuming task (see Reiserfs and Reiser4). It
takes many years before the code is reliable enough for people to trust it,
so most likely any effort you put into this would be wasted unless you can
come up with something that is dramatically better than something existing.

The part that bothers me is that this same effort could have been used to
improve something that more people would use (btrfs in this case). Of
course, sometimes the new code is substantially better than what currently
exists, and I think btrfs may have laid claim to the current generation of
filesystems.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-

To: Andreas Dilger <adilger@...>
Cc: Jeff Garzik <jeff@...>, <netdev@...>, <linux-kernel@...>, <linux-fsdevel@...>
Date: Sunday, September 16, 2007 - 9:43 am

If that would be true, we would be still in the stone age.
Or not, actually I think the first cell in the universe would not bother
itself dividing into the two just because it could spent infinite time

When some idea is implemented it is virtually impossible to change it,
only recreate new one with fixed issues. So, we have multiple ext,
reiser and many others. I do not say btrfs is broken or has design
problems, it is really interesting filesystem, but all we have our own
opinions about how things should be done, that's it.

Btw, we do have so many network protocols for different purposes, that
number of (storage) filesystems is negligebly small compared to it.
Internet as is popular today is just a subset of where network is used.

And we do invent new protocols each time we need something new, which
does not fit into existing models (for example TCP by design can not
work with very long-distance links with tooo long RTT). We have sctp to
fix some tcp issues. Number of IP layer 'neighbours' is even more.
Physical media layer has many different protocols too.

Yep, I know.

Call me greedy bastard, but I do not care about world happiness, it is
just impossible to achieve. So I like what I do right now.
If it will be rest under the layer of dust I do not care, I like the

--
Evgeniy Polyakov
-

To: Andreas Dilger <adilger@...>
Cc: Evgeniy Polyakov <johnpol@...>, Jeff Garzik <jeff@...>, <netdev@...>, <linux-kernel@...>, <linux-fsdevel@...>
Date: Sunday, September 16, 2007 - 3:07 am

I really think that to get proper non-block-device-level filesystem
redundancy you need to base it on something similar to the GIT
model. Data replication is done in specific-sized chunks indexed by
SHA-1 sum and you actually have a sort of "merge algorithm" for when
local and remote changes differ. The OS would only implement a very
limited list of merge algorithms, IE one of:

(A) Don't merge, each client gets its own branch and merges are manual
(B) Most recent changed version is made the master every X-seconds/
open/close/write/other-event.
(C) The tree at X (usually a particular client/server) is always
used as the master when there are conflicts.

This lets you implement whatever replication policy you want: You
can require that some files are replicated (cached) on *EVERY*
system, you can require that other files are cached on at least X
systems. You can say "this needs to be replicated on at least X% of
the online systems, or at most Y". Moreover, the replication could
be done pretty easily from userspace via a couple syscalls. You also
automatically keep track of history with some default purge policy.

The main point is that for efficiency and speed things are *not*
always replicated; this also allows for offline operation. You would
of course have "userspace" merge drivers which notice that the tree
on your laptop is not a subset/superset of the tree on your desktop
and do various merges based on per-file metadata. My address-book,
for example, would have a custom little merge program which knows
about how to merge changes between two address book files, asking me
useful questions along the way. Since a lot of this merging is
mechanical, some of the code from GIT could easily be made into a
"merge library" which knows how to do such things.

Moreover, this would allow me to have a "shared" root filesystem on
my laptop and desktop. It would have 'sub-project'-type trees, so
that "/" would be an independent br...

To: Kyle Moffett <mrmacman_g4@...>
Cc: Andreas Dilger <adilger@...>, Jeff Garzik <jeff@...>, <netdev@...>, <linux-kernel@...>, <linux-fsdevel@...>
Date: Friday, October 26, 2007 - 6:44 am

Returning back to this, since block based storage, which can act as a
shared storage/transport layer, is ready with 5'th release of the DST.

My couple of notes on proposed data distribution algorithm in FS.

This looks like a good way to work with offline clients (where I recall
Coda), after offline node modified data, it should be merged back to the
cluster with different algorithms.

Data (supposed to be) written to the failed node during its offline time
will be resynced from other nodes when failed one is online, there are
no problems and/or special algorithms to be used here.

Filesystem replication is not a 100% 'git way' - git tree contains
already combined objects - i.e. the last blob for given path does not
contain its history, only ready to be used data, while filesystem,
especially that one which requires simultaneous write from different
threads/nodes, should implement copy-on-write semantics, essentially
putting all new data (git commit) to the new location and then collect
it from different extents to present a ready file.

At least that is how I see the filesystem I'm working on.

Git semantics and copy-on-write has actually quite a lot in common (on
high enough level of abstraction), but SHA1 index is not a possible
issue in filesystem - even besides amount of data to be hashed before
key is ready. Key should also contain enough information about what
underlying data is - git does not store that information (tree, blob or
whatever) in its keys, since it does not require it. At least that is
how I see it to be implemented.

Overall I see this new project as a true copy-on-write FS.

--
Evgeniy Polyakov
-

To: Jeff Garzik <jeff@...>
Cc: Evgeniy Polyakov <johnpol@...>, <netdev@...>, <linux-kernel@...>, <linux-fsdevel@...>
Date: Friday, September 14, 2007 - 10:54 pm

This distributed storage is very much needed; even if it were to act
as a more capable/performant replacement for NBD (or MD+NBD) in the
near term. Many high availability applications don't _need_ all the
additional complexity of a full distributed filesystem. So given
that, its discouraging to see you trying to gently push Evgeniy away
from all the promising work he has published.

Evgeniy, please continue your current work.

Mike
-

To: Mike Snitzer <snitzer@...>
Cc: Jeff Garzik <jeff@...>, <netdev@...>, <linux-kernel@...>, <linux-fsdevel@...>
Date: Saturday, September 15, 2007 - 8:34 am

Hi Mike.

Thanks Mike, I work on this and will until feel it is completed.

Distributed filesystem is a logical continuation of the whoe idea of
storing data on the several remote nodes - DST and FS must exist
together for the maximum performance, but that does not mean that block
layer itself should be abandoned. As you probably noticed from my mail
to Jeff, distributed storage was originally part of the overall
filesystem design.

--
Evgeniy Polyakov
-

To: Jeff Garzik <jeff@...>
Cc: Evgeniy Polyakov <johnpol@...>, <netdev@...>, <linux-kernel@...>, <linux-fsdevel@...>
Date: Friday, September 14, 2007 - 5:12 pm

What exactly do you mean by "POSIX-only"?

--b.
-

To: J. Bruce Fields <bfields@...>
Cc: Evgeniy Polyakov <johnpol@...>, <netdev@...>, <linux-kernel@...>, <linux-fsdevel@...>
Date: Friday, September 14, 2007 - 5:14 pm

Don't bother supporting attributes, file modes, and other details not
supported by POSIX. The prime example being NFSv4, which is larded down
with Windows features.

NFSv4.1 adds to the fun, by throwing interoperability completely out the
window.

Jeff

-

To: Jeff Garzik <jeff@...>
Cc: Evgeniy Polyakov <johnpol@...>, <netdev@...>, <linux-kernel@...>, <linux-fsdevel@...>
Date: Friday, September 14, 2007 - 5:18 pm

I am sympathetic.... Cutting those out may still leave you with

What parts are you worried about in particular?

--b.
-

To: J. Bruce Fields <bfields@...>
Cc: Evgeniy Polyakov <johnpol@...>, <netdev@...>, <linux-kernel@...>, <linux-fsdevel@...>
Date: Friday, September 14, 2007 - 6:32 pm

I'm not worried; I'm stating facts as they exist today (draft 13):

NFS v4.1 does something completely without precedent in the history of
NFS: the specification is defined such that interoperability is
-impossible- to guarantee.

pNFS permits private and unspecified layout types. This means it is
impossible to guarantee that one NFSv4.1 implementation will be able to
talk another NFSv4.1 implementation.

Even if Linux supports the entire NFSv4.1 RFC (as it stands in draft 13
anyway), there is no guarantee at all that Linux will be able to store
and retrieve data, since it's entirely possible that a proprietary
protocol is required to access your data.

NFSv4.1 is no longer a completely open architecture.

Jeff

-

To: Jeff Garzik <jeff@...>
Cc: Evgeniy Polyakov <johnpol@...>, <netdev@...>, <linux-kernel@...>, <linux-fsdevel@...>
Date: Friday, September 14, 2007 - 6:42 pm

No, servers are required to support ordinary nfs operations to the
metadata server.

At least, that's the way it was last I heard, which was a while ago. I
agree that it'd stink (for any number of reasons) if you ever *had* to
get a layout to access some file.

Was that your main concern?

--b.
-

To: J. Bruce Fields <bfields@...>
Cc: <netdev@...>, <linux-kernel@...>, <linux-fsdevel@...>
Date: Saturday, September 15, 2007 - 12:08 am

I just sorta assumed you could fall back to the NFSv4.0 mode of
operation, going through the metadata server for all data accesses.

But look at that choice in practice: you can either ditch pNFS
completely, or use a proprietary solution. The market incentives are
CLEARLY tilted in favor of makers of proprietary solutions. But it's a
poor choice (really little choice at all).

Overall, my main concern is that NFSv4.1 is no longer an open
architecture solution. The "no-pNFS or proprietary platform" choice
merely illustrate one of many negative aspects of this architecture.

One of NFS's biggest value propositions is its interoperability. To
quote some Wall Street guys, "NFS is like crack. It Just Works. We
love it."

Now, for the first time in NFS's history (AFAIK), the protocol is no
longer completely specified, completely known. No longer a "closed
loop." Private layout types mean that it is _highly_ unlikely that any
OS or appliance or implementation will be able to claim "full NFS
compatibility."

And when the proprietary portion of the spec involves something as basic
as accessing one's own data, I consider that a fundamental flaw. NFS is
no longer completely open.

Jeff

-

To: Jeff Garzik <jeff@...>
Cc: <netdev@...>, <linux-kernel@...>, <linux-fsdevel@...>
Date: Saturday, September 15, 2007 - 12:40 am

Right. So any two pNFS implementations *will* be able to talk to each
other; they just may not be able to use the (possibly higher-bandwidth)

I doubt somebody would go to all the trouble to implement pNFS and then
present their customers with that kind of choice. But maybe I'm missing
something. What market incentives do you see that would make that more
attractive than either 1) using a standard fully-specified layout type,

It's always been possible to extend NFS in various ways if you want.
You could use sideband protocols with v2 and v3, for example. People
have done that. Some of them have been standardized and widely
implemented, some haven't. You could probably add your own compound ops
to v4 if you wanted, I guess.

And there's advantages to experimenting with extensions first and then
standardizing when you figure out what works. I wish it happened that

Do you know of any such "private layout types"?

This is kind of a boring argument, isn't it? I'd rather hear whatever
ideas you have for a new distributed filesystem protocol.

--b.
-

To: Jeff Garzik <jeff@...>, Evgeniy Polyakov <johnpol@...>
Cc: <netdev@...>, <linux-kernel@...>, <linux-fsdevel@...>
Date: Friday, September 14, 2007 - 4:46 pm

This http://lkml.org/lkml/2007/8/12/159 may provide a fast-path to reaching
that goal.

Thanks!

--
Al
-

Previous thread: [PATCH] add consts where appropriate in sound/pci/hda/* by Denys Vlasenko on Friday, September 14, 2007 - 1:48 pm. (13 messages)

Next thread: [PATCH] Refactor hypercall infrastructure by Anthony Liguori on Friday, September 14, 2007 - 3:45 pm. (29 messages)