Debating Distributed Block Devices

Submitted by Jeremy
on September 17, 2007 - 11:44pm

"I'm pleased to announce [the] fourth release of the distributed storage subsystem, which allows [you] to form a storage [block device] on top of remote and local nodes, which in turn can be exported to another storage [block device] as a node to form tree-like storage [block devices]," Evgeniy Polyakov stated on the Linux Kernel mailing list. The new release includes a new configuration interface and several bug fixes.

Network device driver and SATA subsystem maintainer, Jeff Garzik, was not impressed with the concept, "[distributed block devices] are not very useful, because it still relies on a useful filesystem sitting on top of the DBS." He went on to explain the problem, "it devolves into one of two cases: (1) multi-path much like today's SCSI, with distributed filesystem arbitrarion to ensure coherency, or (2) the filesystem running on top of the DBS is on a single host, and thus, a single point of failure (SPOF)." He proposed instead that time would be better spent developing a POSIX-only distributed filesystem, "in contrast, a distributed filesystem offers far more scalability, eliminates single points of failure, and offers more room for optimization and redundancy across the cluster." Jeff went on to caution, "a distributed filesystem is also much more complex, which is why distributed block devices are so appealing :)" When Lustre was pointed out as an existing option, Jeff noted, "Lustre is tilted far too much towards high-priced storage, and needs improvement before it could be considered for mainline."

Evgeniy explained that he has bigger ideas for filesystem development, "originally (about half a year ago) I started to draft a generic filesystem which would be just superior to existing designs, not overbloated like zfs, and just faster. I do believe it can be implemented." He continued:

"When Chris Mason announced btrfs, I found that quite a few new ideas are already implemented there, so I postponed project (although direction of the developement of the btrfs seems to move to the zfs side with some questionable imho points, so I think I can jump to the wagon of new filesystems right now)."

He went on to explain that his distributed storage subsystem was the first step toward a bigger goal, "so, essentially, a filesystem with simple distributed facilities is on (my) radar, but so far you are first who requested it :)"

Andreas Dilger expressed that he'd prefer to see Evgeniy work to improve existing code, such as btrfs, rather than starting something new, "to be honest, developing a new filesystem that is actually widely useful and used is a very time consuming task (see Reiserfs and Reiser4). It takes many years before the code is reliable enough for people to trust it, so most likely any effort you put into this would be wasted unless you can come up with something that is dramatically better than something existing.

Evgeniy made it clear that he was content working on his distributed storage subsystem no matter the outcome, humorously quipping, "wasting my time is one of the most pleasant things I ever tried in my life." He went on to add that his motivation is to learn, "I like what I do right now. If it will be [laid to] rest under [a] layer of dust I do not care, I like the process of creating, so if it will fail, I just will get new knowledge."


From: Evgeniy Polyakov <johnpol@...>
Subject: Distributed storage. Move away from char device ioctls.
Date: Sep 14, 2:54 pm 2007



From: Jeff Garzik <jeff@...> Subject: Re: Distributed storage. Move away from char device ioctls. Date: Sep 14, 3:07 pm 2007

Evgeniy Polyakov wrote:
> Hi.
>
> I'm pleased to announce fourth release of the distributed storage
> subsystem, which allows to form a storage on top of remote and local
> nodes, which in turn can be exported to another storage as a node to
> form tree-like storages.
>
> This release includes new configuration interface (kernel connector over
> netlink socket) and number of fixes of various bugs found during move
> to it (in error path).
>
> Further TODO list includes:
> * implement optional saving of mirroring/linear information on the remote
> nodes (simple)
> * new redundancy algorithm (complex)
> * some thoughts about distributed filesystem tightly connected to DST
> (far-far planes so far)
>
> Homepage:
> http://tservice.net.ru/~s0mbre/old/?section=projects&item=dst
>
> Signed-off-by: Evgeniy Polyakov

My thoughts. But first a disclaimer: Perhaps you will recall me as
one of the people who really reads all your patches, and examines your
code and proposals closely. So, with that in mind...

I question the value of distributed block services (DBS), whether its
your version or the others out there. DBS are not very useful, because
it still relies on a useful filesystem sitting on top of the DBS. It
devolves into one of two cases: (1) multi-path much like today's SCSI,
with distributed filesystem arbitrarion to ensure coherency, or (2) the
filesystem running on top of the DBS is on a single host, and thus, a
single point of failure (SPOF).

It is quite logical to extend the concepts of RAID across the network,
but ultimately you are still bound by the inflexibility and simplicity
of the block device.

In contrast, a distributed filesystem offers far more scalability,
eliminates single points of failure, and offers more room for
optimization and redundancy across the cluster.

A distributed filesystem is also much more complex, which is why
distributed block devices are so appealing :)

With a redundant, distributed filesystem, you simply do not need any
complexity at all at the block device level. You don't even need RAID.

It is my hope that you will put your skills towards a distributed
filesystem :) Of the current solutions, GFS (currently in kernel)
scales poorly, and NFS v4.1 is amazingly bloated and overly complex.

I've been waiting for years for a smart person to come along and write a
POSIX-only distributed filesystem.

Jeff

-


From: Robin Humble <rjh@...> Subject: Re: Distributed storage. Move away from char device ioctls. Date: Sep 15, 9:56 am 2007

On Fri, Sep 14, 2007 at 03:07:46PM -0400, Jeff Garzik wrote:
>It is my hope that you will put your skills towards a distributed
>filesystem :) Of the current solutions, GFS (currently in kernel)
>scales poorly, and NFS v4.1 is amazingly bloated and overly complex.
>
>I've been waiting for years for a smart person to come along and write a
>POSIX-only distributed filesystem.

it's called Lustre.
works well, scales well, is widely used, is GPL.
sadly it's not in mainline.

cheers,
robin
-


From: Jeff Garzik <jeff@...> Subject: Re: Distributed storage. Move away from char device ioctls. Date: Sep 15, 10:35 am 2007

Robin Humble wrote:
> On Fri, Sep 14, 2007 at 03:07:46PM -0400, Jeff Garzik wrote:
>> It is my hope that you will put your skills towards a distributed
>> filesystem :) Of the current solutions, GFS (currently in kernel)
>> scales poorly, and NFS v4.1 is amazingly bloated and overly complex.
>>
>> I've been waiting for years for a smart person to come along and write a
>> POSIX-only distributed filesystem.
>
> it's called Lustre.
> works well, scales well, is widely used, is GPL.
> sadly it's not in mainline.

Lustre is tilted far too much towards high-priced storage, and needs
improvement before it could be considered for mainline.

Jeff

-

From: Evgeniy Polyakov <johnpol@...>
Subject: Re: Distributed storage. Move away from char device ioctls.
Date: Sep 15, 8:29 am 2007

Hi Jeff.

On Fri, Sep 14, 2007 at 03:07:46PM -0400, Jeff Garzik (jeff@garzik.org) wrote:
> >Further TODO list includes:
> >* implement optional saving of mirroring/linear information on the remote
> > nodes (simple)
> >* new redundancy algorithm (complex)
> >* some thoughts about distributed filesystem tightly connected to DST
> > (far-far planes so far)
> >
> >Homepage:
> >http://tservice.net.ru/~s0mbre/old/?section=projects&item=dst
> >
> >Signed-off-by: Evgeniy Polyakov
>
> My thoughts. But first a disclaimer: Perhaps you will recall me as
> one of the people who really reads all your patches, and examines your
> code and proposals closely. So, with that in mind...

:)

> I question the value of distributed block services (DBS), whether its
> your version or the others out there. DBS are not very useful, because
> it still relies on a useful filesystem sitting on top of the DBS. It
> devolves into one of two cases: (1) multi-path much like today's SCSI,
> with distributed filesystem arbitrarion to ensure coherency, or (2) the
> filesystem running on top of the DBS is on a single host, and thus, a
> single point of failure (SPOF).
>
> It is quite logical to extend the concepts of RAID across the network,
> but ultimately you are still bound by the inflexibility and simplicity
> of the block device.

Yes, block device itself is not able to scale well, but it is the place
for redundancy, since filesystem will just fail if underlying device
does not work correctly and FS actually does not know about where it
should place redundancy bits - it might happen to be the same broken
disk, so I created a low-level device which distribute requests itself.
It is not allowed to mount it via multiple points, that is where
distributed filesystem must enter the show - multiple remote nodes
export its devices via network, each client gets address of the remote
node to work with, connect to it and process requests. All those bits
are already in the DST, next logical step is to connect it with
higher-layer filesystem.

> In contrast, a distributed filesystem offers far more scalability,
> eliminates single points of failure, and offers more room for
> optimization and redundancy across the cluster.
>
> A distributed filesystem is also much more complex, which is why
> distributed block devices are so appealing :)
>
> With a redundant, distributed filesystem, you simply do not need any
> complexity at all at the block device level. You don't even need RAID.
>
> It is my hope that you will put your skills towards a distributed
> filesystem :) Of the current solutions, GFS (currently in kernel)
> scales poorly, and NFS v4.1 is amazingly bloated and overly complex.
>
> I've been waiting for years for a smart person to come along and write a
> POSIX-only distributed filesystem.

Well, originally (about half a year ago) I started to draft a generic
filesystem which would be just superior to existing designs, not
overbloated like zfs, and just faster.
I do believe it can be implemented.
Further I added network capabilities (since what I saw that time
(AFS was proposed) I did not like - I'm not saying it is bad or
something like that at all, but I would implement things differently)
into design drafts.

When Chris Mason announced btrfs, I found that quite a few new ideas
are already implemented there, so I postponed project (although
direction of the developement of the btrfs seems to move to the zfs side
with some questionable imho points, so I think I can jump to the wagon
of new filesystems right now).

DST is low level for my (theoretical so far) filesystem (actually its
network part) like kevent was a low level system for network AIO (originally).

No matter what filesystem works with network it implements some kind
of logic completed in DST.
Sometimes it is very simple, sometimes a bit more complex, but
eventually it is a network entity with parts of stuff I put into DST.
Since I postponed the project (looking at btrfs and its results), I
completed DST as a standalone block device.

So, essentially, a filesystem with simple distributed facilities is on
(my) radar, but so far you are first who requested it :)

--
Evgeniy Polyakov
-


From: Andreas Dilger <adilger@...> Subject: Re: Distributed storage. Move away from char device ioctls. Date: Sep 15, 1:24 pm 2007

On Sep 15, 2007 16:29 +0400, Evgeniy Polyakov wrote:
> Yes, block device itself is not able to scale well, but it is the place
> for redundancy, since filesystem will just fail if underlying device
> does not work correctly and FS actually does not know about where it
> should place redundancy bits - it might happen to be the same broken
> disk, so I created a low-level device which distribute requests itself.

I actually think there is a place for this - and improvements are
definitely welcome. Even Lustre needs block-device level redundancy
currently, though we will be working to make Lustre-level redundancy
available in the future (the problem is WAY harder than it seems at
first glance, if you allow writeback caches at the clients and servers).

> When Chris Mason announced btrfs, I found that quite a few new ideas
> are already implemented there, so I postponed project (although
> direction of the developement of the btrfs seems to move to the zfs side
> with some questionable imho points, so I think I can jump to the wagon
> of new filesystems right now).

This is an area I'm always a bit sad about in OSS development - the need
everyone has to make a new {fs, editor, gui, etc} themselves instead of
spending more time improving the work we already have. Imagine where the
internet would be (or not) if there were 50 different network protocols
instead of TCP/IP? If you don't like some things about btrfs, maybe you
can fix them?

To be honest, developing a new filesystem that is actually widely useful
and used is a very time consuming task (see Reiserfs and Reiser4). It
takes many years before the code is reliable enough for people to trust it,
so most likely any effort you put into this would be wasted unless you can
come up with something that is dramatically better than something existing.

The part that bothers me is that this same effort could have been used to
improve something that more people would use (btrfs in this case). Of
course, sometimes the new code is substantially better than what currently
exists, and I think btrfs may have laid claim to the current generation of
filesystems.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-


From: Evgeniy Polyakov <johnpol@...> Subject: Re: Distributed storage. Move away from char device ioctls. Date: Sep 16, 9:43 am 2007

On Sat, Sep 15, 2007 at 11:24:46AM -0600, Andreas Dilger (adilger@clusterfs.com) wrote:
> > When Chris Mason announced btrfs, I found that quite a few new ideas
> > are already implemented there, so I postponed project (although
> > direction of the developement of the btrfs seems to move to the zfs side
> > with some questionable imho points, so I think I can jump to the wagon
> > of new filesystems right now).
>
> This is an area I'm always a bit sad about in OSS development - the need
> everyone has to make a new {fs, editor, gui, etc} themselves instead of
> spending more time improving the work we already have. Imagine where the

If that would be true, we would be still in the stone age.
Or not, actually I think the first cell in the universe would not bother
itself dividing into the two just because it could spent infinite time
trying to make itself better.

> internet would be (or not) if there were 50 different network protocols
> instead of TCP/IP? If you don't like some things about btrfs, maybe you
> can fix them?

When some idea is implemented it is virtually impossible to change it,
only recreate new one with fixed issues. So, we have multiple ext,
reiser and many others. I do not say btrfs is broken or has design
problems, it is really interesting filesystem, but all we have our own
opinions about how things should be done, that's it.

Btw, we do have so many network protocols for different purposes, that
number of (storage) filesystems is negligebly small compared to it.
Internet as is popular today is just a subset of where network is used.

And we do invent new protocols each time we need something new, which
does not fit into existing models (for example TCP by design can not
work with very long-distance links with tooo long RTT). We have sctp to
fix some tcp issues. Number of IP layer 'neighbours' is even more.
Physical media layer has many different protocols too.
And that is just what exists in the linux tree...

> To be honest, developing a new filesystem that is actually widely useful
> and used is a very time consuming task (see Reiserfs and Reiser4). It
> takes many years before the code is reliable enough for people to trust it,
> so most likely any effort you put into this would be wasted unless you can
> come up with something that is dramatically better than something existing.

Yep, I know.
Wasting my time is one of the most pleasant things I ever tried in my life.

> The part that bothers me is that this same effort could have been used to
> improve something that more people would use (btrfs in this case). Of
> course, sometimes the new code is substantially better than what currently
> exists, and I think btrfs may have laid claim to the current generation of
> filesystems.

Call me greedy bastard, but I do not care about world happiness, it is
just impossible to achieve. So I like what I do right now.
If it will be rest under the layer of dust I do not care, I like the
process of creating, so if it will fail, I just will get new knowledge.

:)

> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.

--
Evgeniy Polyakov
-


When Chris Mason announced

Anonymous (not verified)
on
September 18, 2007 - 7:11am

When Chris Mason announced btrfs, I found that quite a few new ideas are already implemented there, so I postponed project (although direction of the developement of the btrfs seems to move to the zfs side with some questionable imho points, so I think I can jump to the wagon of new filesystems right now).

It would be interesting to hear Chris's actual criticism towards btrfs (and ZFS); it's definitely not too late to change anything in btrfs yet.

GlusterFS

Vikas Gorur (not verified)
on
September 18, 2007 - 4:23pm

An alternative to Lustre (if you want a "POSIX-only distributed filesystem") is GlusterFS.

http://gluster.org/docs/index.php/GlusterFS

(Disclosure: I'm one of the developers)

I respect that your one of

Anonymous (not verified)
on
September 18, 2007 - 10:00pm

I respect that your one of the glusterfs developers, but I cant personally recommend that fuse file system yet. It doesnt handle much of a load, bugs creep in very easily with every update, severe protocol bloat, seriously unstable.

Glusterfs is a great idea, but the implementation currently seems brittle.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.