Distributed storage. Move away from char device ioctls.

Previous thread: [PATCH] add consts where appropriate in sound/pci/hda/* by Denys Vlasenko on Friday, September 14, 2007 - 10:48 am. (13 messages)

Next thread: [PATCH] Refactor hypercall infrastructure by Anthony Liguori on Friday, September 14, 2007 - 12:45 pm. (29 messages)
From: Evgeniy Polyakov
Date: Friday, September 14, 2007 - 11:54 am

Hi.

I'm pleased to announce fourth release of the distributed storage
subsystem, which allows to form a storage on top of remote and local
nodes, which in turn can be exported to another storage as a node to
form tree-like storages.

This release includes new configuration interface (kernel connector over
netlink socket) and number of fixes of various bugs found during move 
to it (in error path).

Further TODO list includes:
* implement optional saving of mirroring/linear information on the remote
	nodes (simple)
* new redundancy algorithm (complex)
* some thoughts about distributed filesystem tightly connected to DST
	(far-far planes so far)

Homepage:
http://tservice.net.ru/~s0mbre/old/?section=projects&item=dst

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

diff --git a/Documentation/dst/algorithms.txt b/Documentation/dst/algorithms.txt
new file mode 100644
index 0000000..bfc6984
--- /dev/null
+++ b/Documentation/dst/algorithms.txt
@@ -0,0 +1,115 @@
+Each storage by itself is just a set of contiguous logical blocks, with
+allowed number of operations. Nodes, each of which has own start and size,
+are placed into storage by appropriate algorithm, which remaps
+logical sector number into real node's sector. One can create
+own algorithms, since DST has pluggable interface for that.
+Currently mirrored and linear algorithms are supported.
+
+Let's briefly describe how they work.
+
+Linear algorithm.
+Simple approach of concatenating storages into single device with
+increased size is used in this algorithm. Essentially new device
+has size equal to sum of sizes of underlying nodes and nodes are
+placed one after another.
+
+  /----- Node 1 ---\                         /------ Node 3 ----\
+start              end                     start               end
+ |==================|========================|==================|
+ |                start                     end                 |
+ |                  \------- Node 2 ---------/                  ...
From: Jeff Garzik
Date: Friday, September 14, 2007 - 12:07 pm

My thoughts.  But first a disclaimer:   Perhaps you will recall me as 
one of the people who really reads all your patches, and examines your 
code and proposals closely.  So, with that in mind...

I question the value of distributed block services (DBS), whether its 
your version or the others out there.  DBS are not very useful, because 
it still relies on a useful filesystem sitting on top of the DBS.  It 
devolves into one of two cases:  (1) multi-path much like today's SCSI, 
with distributed filesystem arbitrarion to ensure coherency, or (2) the 
filesystem running on top of the DBS is on a single host, and thus, a 
single point of failure (SPOF).

It is quite logical to extend the concepts of RAID across the network, 
but ultimately you are still bound by the inflexibility and simplicity 
of the block device.

In contrast, a distributed filesystem offers far more scalability, 
eliminates single points of failure, and offers more room for 
optimization and redundancy across the cluster.

A distributed filesystem is also much more complex, which is why 
distributed block devices are so appealing :)

With a redundant, distributed filesystem, you simply do not need any 
complexity at all at the block device level.  You don't even need RAID.

It is my hope that you will put your skills towards a distributed 
filesystem :)  Of the current solutions, GFS (currently in kernel) 
scales poorly, and NFS v4.1 is amazingly bloated and overly complex.

I've been waiting for years for a smart person to come along and write a 
POSIX-only distributed filesystem.

	Jeff



-

From: Al Boldi
Date: Friday, September 14, 2007 - 1:46 pm

This http://lkml.org/lkml/2007/8/12/159 may provide a fast-path to reaching 
that goal.


Thanks!

--
Al
-

From: J. Bruce Fields
Date: Friday, September 14, 2007 - 2:12 pm

What exactly do you mean by "POSIX-only"?

--b.
-

From: Jeff Garzik
Date: Friday, September 14, 2007 - 2:14 pm

Don't bother supporting attributes, file modes, and other details not 
supported by POSIX.  The prime example being NFSv4, which is larded down 
with Windows features.

NFSv4.1 adds to the fun, by throwing interoperability completely out the 
window.

	Jeff



-

From: J. Bruce Fields
Date: Friday, September 14, 2007 - 2:18 pm

I am sympathetic....  Cutting those out may still leave you with

What parts are you worried about in particular?

--b.
-

From: Jeff Garzik
Date: Friday, September 14, 2007 - 3:32 pm

I'm not worried; I'm stating facts as they exist today (draft 13):

NFS v4.1 does something completely without precedent in the history of 
NFS:  the specification is defined such that interoperability is 
-impossible- to guarantee.

pNFS permits private and unspecified layout types.  This means it is 
impossible to guarantee that one NFSv4.1 implementation will be able to 
talk another NFSv4.1 implementation.

Even if Linux supports the entire NFSv4.1 RFC (as it stands in draft 13 
anyway), there is no guarantee at all that Linux will be able to store 
and retrieve data, since it's entirely possible that a proprietary 
protocol is required to access your data.

NFSv4.1 is no longer a completely open architecture.

	Jeff




-

From: J. Bruce Fields
Date: Friday, September 14, 2007 - 3:42 pm

No, servers are required to support ordinary nfs operations to the
metadata server.

At least, that's the way it was last I heard, which was a while ago.  I
agree that it'd stink (for any number of reasons) if you ever *had* to
get a layout to access some file.

Was that your main concern?

--b.
-

From: Jeff Garzik
Date: Friday, September 14, 2007 - 9:08 pm

I just sorta assumed you could fall back to the NFSv4.0 mode of 
operation, going through the metadata server for all data accesses.

But look at that choice in practice:  you can either ditch pNFS 
completely, or use a proprietary solution.  The market incentives are 
CLEARLY tilted in favor of makers of proprietary solutions.  But it's a 
poor choice (really little choice at all).

Overall, my main concern is that NFSv4.1 is no longer an open 
architecture solution.  The "no-pNFS or proprietary platform" choice 
merely illustrate one of many negative aspects of this architecture.

One of NFS's biggest value propositions is its interoperability.  To 
quote some Wall Street guys, "NFS is like crack.  It Just Works.  We 
love it."

Now, for the first time in NFS's history (AFAIK), the protocol is no 
longer completely specified, completely known.  No longer a "closed 
loop."  Private layout types mean that it is _highly_ unlikely that any 
OS or appliance or implementation will be able to claim "full NFS 
compatibility."

And when the proprietary portion of the spec involves something as basic 
as accessing one's own data, I consider that a fundamental flaw.  NFS is 
no longer completely open.

	Jeff



-

From: J. Bruce Fields
Date: Friday, September 14, 2007 - 9:40 pm

Right.  So any two pNFS implementations *will* be able to talk to each
other; they just may not be able to use the (possibly higher-bandwidth)

I doubt somebody would go to all the trouble to implement pNFS and then
present their customers with that kind of choice.  But maybe I'm missing
something.  What market incentives do you see that would make that more
attractive than either 1) using a standard fully-specified layout type,

It's always been possible to extend NFS in various ways if you want.
You could use sideband protocols with v2 and v3, for example.  People
have done that.  Some of them have been standardized and widely
implemented, some haven't.  You could probably add your own compound ops
to v4 if you wanted, I guess.

And there's advantages to experimenting with extensions first and then
standardizing when you figure out what works.  I wish it happened that

Do you know of any such "private layout types"?

This is kind of a boring argument, isn't it?  I'd rather hear whatever
ideas you have for a new distributed filesystem protocol.

--b.
-

From: Mike Snitzer
Date: Friday, September 14, 2007 - 7:54 pm

This distributed storage is very much needed; even if it were to act
as a more capable/performant replacement for NBD (or MD+NBD) in the
near term.  Many high availability applications don't _need_ all the
additional complexity of a full distributed filesystem.  So given
that, its discouraging to see you trying to gently push Evgeniy away
from all the promising work he has published.

Evgeniy, please continue your current work.

Mike
-

From: Evgeniy Polyakov
Date: Saturday, September 15, 2007 - 5:34 am

Hi Mike.


Thanks Mike, I work on this and will until feel it is completed.

Distributed filesystem is a logical continuation of the whoe idea of
storing data on the several remote nodes - DST and FS must exist
together for the maximum performance, but that does not mean that block
layer itself should be abandoned. As you probably noticed from my mail
to Jeff, distributed storage was originally part of the overall
filesystem design.

-- 
	Evgeniy Polyakov
-

From: Evgeniy Polyakov
Date: Saturday, September 15, 2007 - 5:29 am

Hi Jeff.


Yes, block device itself is not able to scale well, but it is the place
for redundancy, since filesystem will just fail if underlying device
does not work correctly and FS actually does not know about where it
should place redundancy bits - it might happen to be the same broken 
disk, so I created a low-level device which distribute requests itself.
It is not allowed to mount it via multiple points, that is where
distributed filesystem must enter the show - multiple remote nodes
export its devices via network, each client gets address of the remote
node to work with, connect to it and process requests. All those bits
are already in the DST, next logical step is to connect it with

Well, originally (about half a year ago) I started to draft a generic
filesystem which would be just superior to existing designs, not
overbloated like zfs, and just faster.
I do believe it can be implemented.
Further I added network capabilities (since what I saw that time 
(AFS was proposed) I did not like - I'm not saying it is bad or 
something like that at all, but I would implement things differently) 
into design drafts.

When Chris Mason announced btrfs, I found that quite a few new ideas 
are already implemented there, so I postponed project (although
direction of the developement of the btrfs seems to move to the zfs side
with some questionable imho points, so I think I can jump to the wagon
of new filesystems right now). 

DST is low level for my (theoretical so far) filesystem (actually its 
network part) like kevent was a low level system for network AIO (originally).

No matter what filesystem works with network it implements some kind 
of logic completed in DST.
Sometimes it is very simple, sometimes a bit more complex, but
eventually it is a network entity with parts of stuff I put into DST.
Since I postponed the project (looking at btrfs and its results), I
completed DST as a standalone block device.

So, essentially, a filesystem with simple distributed facilities ...
From: Andreas Dilger
Date: Saturday, September 15, 2007 - 10:24 am

I actually think there is a place for this - and improvements are
definitely welcome.  Even Lustre needs block-device level redundancy
currently, though we will be working to make Lustre-level redundancy
available in the future (the problem is WAY harder than it seems at

This is an area I'm always a bit sad about in OSS development - the need
everyone has to make a new {fs, editor, gui, etc} themselves instead of
spending more time improving the work we already have.  Imagine where the
internet would be (or not) if there were 50 different network protocols
instead of TCP/IP?  If you don't like some things about btrfs, maybe you
can fix them?

To be honest, developing a new filesystem that is actually widely useful
and used is a very time consuming task (see Reiserfs and Reiser4).  It
takes many years before the code is reliable enough for people to trust it,
so most likely any effort you put into this would be wasted unless you can
come up with something that is dramatically better than something existing.

The part that bothers me is that this same effort could have been used to
improve something that more people would use (btrfs in this case).  Of
course, sometimes the new code is substantially better than what currently
exists, and I think btrfs may have laid claim to the current generation of
filesystems.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-

From: Kyle Moffett
Date: Sunday, September 16, 2007 - 12:07 am

I really think that to get proper non-block-device-level filesystem  
redundancy you need to base it on something similar to the GIT  
model.  Data replication is done in specific-sized chunks indexed by  
SHA-1 sum and you actually have a sort of "merge algorithm" for when  
local and remote changes differ.  The OS would only implement a very  
limited list of merge algorithms, IE one of:

(A)  Don't merge, each client gets its own branch and merges are manual
(B)  Most recent changed version is made the master every X-seconds/ 
open/close/write/other-event.
(C)  The tree at X (usually a particular client/server) is always  
used as the master when there are conflicts.

This lets you implement whatever replication policy you want:  You  
can require that some files are replicated (cached) on *EVERY*  
system, you can require that other files are cached on at least X  
systems.  You can say "this needs to be replicated on at least X% of  
the online systems, or at most Y".  Moreover, the replication could  
be done pretty easily from userspace via a couple syscalls.  You also  
automatically keep track of history with some default purge policy.

The main point is that for efficiency and speed things are *not*  
always replicated; this also allows for offline operation.  You would  
of course have "userspace" merge drivers which notice that the tree  
on your laptop is not a subset/superset of the tree on your desktop  
and do various merges based on per-file metadata.  My address-book,  
for example, would have a custom little merge program which knows  
about how to merge changes between two address book files, asking me  
useful questions along the way.  Since a lot of this merging is  
mechanical, some of the code from GIT could easily be made into a  
"merge library" which knows how to do such things.

Moreover, this would allow me to have a "shared" root filesystem on  
my laptop and desktop.  It would have 'sub-project'-type trees, so  
that "/" would be an ...
From: Evgeniy Polyakov
Date: Friday, October 26, 2007 - 3:44 am

Returning back to this, since block based storage, which can act as a
shared storage/transport layer, is ready with 5'th release of the DST.

My couple of notes on proposed data distribution algorithm in FS.


This looks like a good way to work with offline clients (where I recall
Coda), after offline node modified data, it should be merged back to the
cluster with different algorithms.

Data (supposed to be) written to the failed node during its offline time
will be resynced from other nodes when failed one is online, there are
no problems and/or special algorithms to be used here.

Filesystem replication is not a 100% 'git way' - git tree contains
already combined objects - i.e. the last blob for given path does not
contain its history, only ready to be used data, while filesystem,
especially that one which requires simultaneous write from different
threads/nodes, should implement copy-on-write semantics, essentially
putting all new data (git commit) to the new location and then collect
it from different extents to present a ready file.

At least that is how I see the filesystem I'm working on.


Git semantics and copy-on-write has actually quite a lot in common (on
high enough level of abstraction), but SHA1 index is not a possible 
issue in filesystem - even besides amount of data to be hashed before
key is ready. Key should also contain enough information about what
underlying data is - git does not store that information (tree, blob or
whatever) in its keys, since it does not require it. At least that is
how I see it to be implemented.

Overall I see this new project as a true copy-on-write FS.


-- 
	Evgeniy Polyakov
-

From: Evgeniy Polyakov
Date: Sunday, September 16, 2007 - 6:43 am

If that would be true, we would be still in the stone age. 
Or not, actually I think the first cell in the universe would not bother 
itself dividing into the two just because it could spent infinite time 

When some idea is implemented it is virtually impossible to change it,
only recreate new one with fixed issues. So, we have multiple ext,
reiser and many others. I do not say btrfs is broken or has design
problems, it is really interesting filesystem, but all we have our own 
opinions about how things should be done, that's it.

Btw, we do have so many network protocols for different purposes, that
number of (storage) filesystems is negligebly small compared to it. 
Internet as is popular today is just a subset of where network is used.

And we do invent new protocols each time we need something new, which
does not fit into existing models (for example TCP by design can not
work with very long-distance links with tooo long RTT). We have sctp to
fix some tcp issues. Number of IP layer 'neighbours' is even more.
Physical media layer has many different protocols too.

Yep, I know. 

Call me greedy bastard, but I do not care about world happiness, it is
just impossible to achieve. So I like what I do right now.
If it will be rest under the layer of dust I do not care, I like the

-- 
	Evgeniy Polyakov
-

From: Robin Humble
Date: Saturday, September 15, 2007 - 6:56 am

it's called Lustre.
works well, scales well, is widely used, is GPL.
sadly it's not in mainline.

cheers,
robin
-

From: Jeff Garzik
Date: Saturday, September 15, 2007 - 7:35 am

Lustre is tilted far too much towards high-priced storage, and needs 
improvement before it could be considered for mainline.

	Jeff



-

From: Robin Humble
Date: Saturday, September 15, 2007 - 9:20 am

many (most?) Lustre deployments are with SATA and md raid5 and GigE -
can't get much cheaper than that.

if you want storage node failover capabilities (which larger sites often
do) or want to saturate an IB link then the price of the storage goes
up but this is a consequence of wanting more reliability or performance,
not anything to do with lustre.

interestingly, one of the ways to provide dual-attached storage behind
a failover pair of lustre servers (apart from buying SAS) would be via
a networked-raid-1 device like Evgeniy's, so I don't see distributed
block devices and distributed filesystems as being mutually exclusive.

quite likely.
from what I understand (hopefully I am mistaken) they consider a merge
task to be too daunting as the number of kernel subsystems that any
scalable distributed filesystem touches is necessarily large.

roadmaps indicate that parts of lustre are likely to move to userspace
(partly to ease solaris and ZFS ports) so perhaps those performance
critical parts that remain kernel space will be easier to merge.

cheers,
robin
-

From: Andreas Dilger
Date: Saturday, September 15, 2007 - 10:51 am

I have to agree - while Lustre CAN scale up to huge servers and fat pipes,
it can definitely also scale down (which is a LOT easier to do :-).  I can
run a client + MDS + 5 OSTs in a single UML instance using loop devices

That is definitely true, and there are a number of users who run in
this mode.  We're also working to make Lustre handle the replication
internally (RAID5/6+ at the OST level) so you wouldn't need any kind of
block-level redundancy at all.  I suspect some sites may still use RAID5/6
back-ends anyways to avoid performance loss from taking out a whole OST

It's definitely true, and we are always working at improving it.  It
used to be in the past that one of the reasons we DIDN'T want to go
into mainline was because this would restrict our ability to make
network protocol changes.  Because our install base is large enough
and many of the large sites with mutliple supercomputers mounting
multiple global filesystems we aren't at liberty to change the network
protocol at will anymore.  That said, we also have network protocol
versioning that is akin to the ext3 COMPAT/INCOMPAT feature flags, so

That's partly true - Lustre has its own RDMA RPC mechanism, but it does
not need kernel patches anymore (we removed the zero-copy callback and
do this at the protocol level because there was too much resistance to it).
We are now also able to run a client filesystem that doesn't require any
kernel patches, since we've given up on trying to get the intents and
raw operations into the VFS, and have worked out other ways to improve
the performance to compensate.  Likewise with parallel directory operations.

It's a bit sad, in a way, because these are features that other filesystems

This is also true - when that is done the only parts that will remain
in the kernel are the network drivers.  With some network stacks there
is even direct userspace acceleration.  We'll use RDMA and direct IO to
avoid doing any user<->kernel data copies.

Cheers, Andreas
--
Andreas ...
Previous thread: [PATCH] add consts where appropriate in sound/pci/hda/* by Denys Vlasenko on Friday, September 14, 2007 - 10:48 am. (13 messages)

Next thread: [PATCH] Refactor hypercall infrastructure by Anthony Liguori on Friday, September 14, 2007 - 12:45 pm. (29 messages)