From: Evgeniy Polyakov <johnpol@...>
Subject: Re: Distributed storage. Move away from char device ioctls.
Date: Sep 15, 8:29 am 2007
Hi Jeff.
On Fri, Sep 14, 2007 at 03:07:46PM -0400, Jeff Garzik (jeff@garzik.org) wrote:
> >Further TODO list includes:
> >* implement optional saving of mirroring/linear information on the remote
> > nodes (simple)
> >* new redundancy algorithm (complex)
> >* some thoughts about distributed filesystem tightly connected to DST
> > (far-far planes so far)
> >
> >Homepage:
> >http://tservice.net.ru/~s0mbre/old/?section=projects&item=dst
> >
> >Signed-off-by: Evgeniy Polyakov
>
> My thoughts. But first a disclaimer: Perhaps you will recall me as
> one of the people who really reads all your patches, and examines your
> code and proposals closely. So, with that in mind...
:)
> I question the value of distributed block services (DBS), whether its
> your version or the others out there. DBS are not very useful, because
> it still relies on a useful filesystem sitting on top of the DBS. It
> devolves into one of two cases: (1) multi-path much like today's SCSI,
> with distributed filesystem arbitrarion to ensure coherency, or (2) the
> filesystem running on top of the DBS is on a single host, and thus, a
> single point of failure (SPOF).
>
> It is quite logical to extend the concepts of RAID across the network,
> but ultimately you are still bound by the inflexibility and simplicity
> of the block device.
Yes, block device itself is not able to scale well, but it is the place
for redundancy, since filesystem will just fail if underlying device
does not work correctly and FS actually does not know about where it
should place redundancy bits - it might happen to be the same broken
disk, so I created a low-level device which distribute requests itself.
It is not allowed to mount it via multiple points, that is where
distributed filesystem must enter the show - multiple remote nodes
export its devices via network, each client gets address of the remote
node to work with, connect to it and process requests. All those bits
are already in the DST, next logical step is to connect it with
higher-layer filesystem.
> In contrast, a distributed filesystem offers far more scalability,
> eliminates single points of failure, and offers more room for
> optimization and redundancy across the cluster.
>
> A distributed filesystem is also much more complex, which is why
> distributed block devices are so appealing :)
>
> With a redundant, distributed filesystem, you simply do not need any
> complexity at all at the block device level. You don't even need RAID.
>
> It is my hope that you will put your skills towards a distributed
> filesystem :) Of the current solutions, GFS (currently in kernel)
> scales poorly, and NFS v4.1 is amazingly bloated and overly complex.
>
> I've been waiting for years for a smart person to come along and write a
> POSIX-only distributed filesystem.
Well, originally (about half a year ago) I started to draft a generic
filesystem which would be just superior to existing designs, not
overbloated like zfs, and just faster.
I do believe it can be implemented.
Further I added network capabilities (since what I saw that time
(AFS was proposed) I did not like - I'm not saying it is bad or
something like that at all, but I would implement things differently)
into design drafts.
When Chris Mason announced btrfs, I found that quite a few new ideas
are already implemented there, so I postponed project (although
direction of the developement of the btrfs seems to move to the zfs side
with some questionable imho points, so I think I can jump to the wagon
of new filesystems right now).
DST is low level for my (theoretical so far) filesystem (actually its
network part) like kevent was a low level system for network AIO (originally).
No matter what filesystem works with network it implements some kind
of logic completed in DST.
Sometimes it is very simple, sometimes a bit more complex, but
eventually it is a network entity with parts of stuff I put into DST.
Since I postponed the project (looking at btrfs and its results), I
completed DST as a standalone block device.
So, essentially, a filesystem with simple distributed facilities is on
(my) radar, but so far you are first who requested it :)
--
Evgeniy Polyakov
-
From: Andreas Dilger <adilger@...>
Subject: Re: Distributed storage. Move away from char device ioctls.
Date: Sep 15, 1:24 pm 2007
On Sep 15, 2007 16:29 +0400, Evgeniy Polyakov wrote:
> Yes, block device itself is not able to scale well, but it is the place
> for redundancy, since filesystem will just fail if underlying device
> does not work correctly and FS actually does not know about where it
> should place redundancy bits - it might happen to be the same broken
> disk, so I created a low-level device which distribute requests itself.
I actually think there is a place for this - and improvements are
definitely welcome. Even Lustre needs block-device level redundancy
currently, though we will be working to make Lustre-level redundancy
available in the future (the problem is WAY harder than it seems at
first glance, if you allow writeback caches at the clients and servers).
> When Chris Mason announced btrfs, I found that quite a few new ideas
> are already implemented there, so I postponed project (although
> direction of the developement of the btrfs seems to move to the zfs side
> with some questionable imho points, so I think I can jump to the wagon
> of new filesystems right now).
This is an area I'm always a bit sad about in OSS development - the need
everyone has to make a new {fs, editor, gui, etc} themselves instead of
spending more time improving the work we already have. Imagine where the
internet would be (or not) if there were 50 different network protocols
instead of TCP/IP? If you don't like some things about btrfs, maybe you
can fix them?
To be honest, developing a new filesystem that is actually widely useful
and used is a very time consuming task (see Reiserfs and Reiser4). It
takes many years before the code is reliable enough for people to trust it,
so most likely any effort you put into this would be wasted unless you can
come up with something that is dramatically better than something existing.
The part that bothers me is that this same effort could have been used to
improve something that more people would use (btrfs in this case). Of
course, sometimes the new code is substantially better than what currently
exists, and I think btrfs may have laid claim to the current generation of
filesystems.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
-
From: Evgeniy Polyakov <johnpol@...>
Subject: Re: Distributed storage. Move away from char device ioctls.
Date: Sep 16, 9:43 am 2007
On Sat, Sep 15, 2007 at 11:24:46AM -0600, Andreas Dilger (adilger@clusterfs.com) wrote:
> > When Chris Mason announced btrfs, I found that quite a few new ideas
> > are already implemented there, so I postponed project (although
> > direction of the developement of the btrfs seems to move to the zfs side
> > with some questionable imho points, so I think I can jump to the wagon
> > of new filesystems right now).
>
> This is an area I'm always a bit sad about in OSS development - the need
> everyone has to make a new {fs, editor, gui, etc} themselves instead of
> spending more time improving the work we already have. Imagine where the
If that would be true, we would be still in the stone age.
Or not, actually I think the first cell in the universe would not bother
itself dividing into the two just because it could spent infinite time
trying to make itself better.
> internet would be (or not) if there were 50 different network protocols
> instead of TCP/IP? If you don't like some things about btrfs, maybe you
> can fix them?
When some idea is implemented it is virtually impossible to change it,
only recreate new one with fixed issues. So, we have multiple ext,
reiser and many others. I do not say btrfs is broken or has design
problems, it is really interesting filesystem, but all we have our own
opinions about how things should be done, that's it.
Btw, we do have so many network protocols for different purposes, that
number of (storage) filesystems is negligebly small compared to it.
Internet as is popular today is just a subset of where network is used.
And we do invent new protocols each time we need something new, which
does not fit into existing models (for example TCP by design can not
work with very long-distance links with tooo long RTT). We have sctp to
fix some tcp issues. Number of IP layer 'neighbours' is even more.
Physical media layer has many different protocols too.
And that is just what exists in the linux tree...
> To be honest, developing a new filesystem that is actually widely useful
> and used is a very time consuming task (see Reiserfs and Reiser4). It
> takes many years before the code is reliable enough for people to trust it,
> so most likely any effort you put into this would be wasted unless you can
> come up with something that is dramatically better than something existing.
Yep, I know.
Wasting my time is one of the most pleasant things I ever tried in my life.
> The part that bothers me is that this same effort could have been used to
> improve something that more people would use (btrfs in this case). Of
> course, sometimes the new code is substantially better than what currently
> exists, and I think btrfs may have laid claim to the current generation of
> filesystems.
Call me greedy bastard, but I do not care about world happiness, it is
just impossible to achieve. So I like what I do right now.
If it will be rest under the layer of dust I do not care, I like the
process of creating, so if it will fail, I just will get new knowledge.
:)
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
--
Evgeniy Polyakov
-
When Chris Mason announced
It would be interesting to hear Chris's actual criticism towards btrfs (and ZFS); it's definitely not too late to change anything in btrfs yet.
GlusterFS
An alternative to Lustre (if you want a "POSIX-only distributed filesystem") is GlusterFS.
http://gluster.org/docs/index.php/GlusterFS
(Disclosure: I'm one of the developers)
I respect that your one of
I respect that your one of the glusterfs developers, but I cant personally recommend that fuse file system yet. It doesnt handle much of a load, bugs creep in very easily with every update, severe protocol bloat, seriously unstable.
Glusterfs is a great idea, but the implementation currently seems brittle.