login
Header Space

 
 

Re: POHMELFS high performance network filesystem. Transactions, failover, performance.

Previous thread: [BUG] remount ro on loopback mount leaves unmountable filesystem behind by David Chinner on Tuesday, May 13, 2008 - 4:50 am. (6 messages)

Next thread: dfs path construction fix - send dfs paths on all path based operations on share in dfs by Steve French on Tuesday, May 13, 2008 - 6:45 pm. (3 messages)
To: Evgeniy Polyakov <johnpol@...>
Cc: <linux-kernel@...>, <netdev@...>, <linux-fsdevel@...>
Date: Tuesday, May 13, 2008 - 3:09 pm

This continues to be a neat and interesting project :)

Where is the best place to look at client&lt;-&gt;server protocol?

Are you planning to support the case where the server filesystem dataset 
does not fit entirely on one server?

What is your opinion of the Paxos algorithm?

	Jeff



--
To: Jeff Garzik <jeff@...>
Cc: <linux-kernel@...>, <netdev@...>, <linux-fsdevel@...>
Date: Tuesday, May 13, 2008 - 4:51 pm

Hi.



Hmm, in sources I think, I need to kick myself to write a somewhat good
spec for the next release.

Basically protocol contains of fixed sized header (struct netfs_cmd) and
attached data, which size is embedded into above header. Simple commands
are finished here (essentially all except write/create commands), you
can check them in approrpiate address space/inode operations.
Transactions follow netlink (which is very ugly but exceptionally
extendible) protocol: there is main header (above structure), which
holds size of the embedded data, which can be dereferenced as header/data
parts, where each inner header corresponds to any command (except
transaction header). So one can pack (upto 90 pages of data or different
commands on x86, this is limit of the page size devoted to headers)
requested number of commands into single 'frame' and submit it to
system, which will care about atomicity of that request in regards of

Sure. First by allowing whole object to be placed on different servers
(i.e. one subdir is on server1 and another on server2), probably in the
future there will be added support for the same object being distributed
to different servers (i.e. half of the big file on server1 and another

It is slow. But it does solve failure cases.
So far POHMELFS does not work as distributed filesystem, so it should
not care about it at all, i.e. at most in the very nearest future it
will just have number of acceptors in paxos terminology (metadata
servers in others) without need for active dynamical reconfiguration,
so protocol will be greatly reduced, with addition of dynamical
metadata cluster extension protocol will have to be extended.

As practice shows, the smaller and simpler initial steps are, the better
results eventually become :)

-- 
	Evgeniy Polyakov
--
To: Evgeniy Polyakov <johnpol@...>
Cc: Jeff Garzik <jeff@...>, <linux-kernel@...>, <netdev@...>, <linux-fsdevel@...>
Date: Wednesday, May 14, 2008 - 9:35 am

For writes, Paxos is actually more or less optimal (in the non-failure 
cases, at least).  Reads are trickier, but there are ways to keep that 
fast as well.  FWIW, Ceph extends basic Paxos with a leasing mechanism to 
keep reads fast, consistent, and distributed.  It's only used for cluster 
state, though, not file data.

I think the larger issue with Paxos is that I've yet to meet anyone who 
wants their data replicated 3 ways (this despite newfangled 1TB+ disks not 
having enough bandwidth to actualy _use_ the data they store).  
Similarly, if only 1 out of 3 replicas is surviving, most people want to 
be able to read their data, while Paxos demands a majority to ensure it is 
correct.  (This is why Paxos is typically used only for critical cluster 
configuration/state, not regular data.)

sage
--
To: Sage Weil <sage@...>
Cc: Evgeniy Polyakov <johnpol@...>, <linux-kernel@...>, <netdev@...>, <linux-fsdevel@...>
Date: Wednesday, May 14, 2008 - 2:24 pm

I've seen clusters in the field that planned for this -- they don't want 

This isn't necessarily true -- it's quite easy for most applications to 
come up with an alternate method for ensuring correctness of retrieved 
data, if one assumes Paxos consensus was achieved during the write-data 
phase earlier in time.  Checksumming is a common solution, but not the 
only one.  Domain- or app-specific solution, as noted, of course.


Yep, I'm working on a config daemon a la Chubby or zookeeper, based on 
Paxos, that does just this.  :)

	Jeff


--
To: Jeff Garzik <jeff@...>
Cc: Evgeniy Polyakov <johnpol@...>, <linux-kernel@...>, <netdev@...>, <linux-fsdevel@...>
Date: Wednesday, May 14, 2008 - 4:00 pm

You mean if, say, some verifiable metadata or a trusted third party stores 
that checksum?  Sure.  This is just pushing the what-has-committed 
information to some other party, though, who will presumably face the same 
problem of requiring a majority for verifiable correctness.  This is more 
or less what most people do in practice... using Paxos for critical state 

Cool.  Do you have a URL?  I'd be interested in seeing how you diverge 
from classic paxos.  For Ceph's monitor daemon, the main requirements 
(besides strict correctness guarantees) were scalable (distributed) read 
access, and a history of state changes.  Nothing too unusual.

sage
--
To: Sage Weil <sage@...>
Cc: Evgeniy Polyakov <johnpol@...>, <linux-kernel@...>, <netdev@...>, <linux-fsdevel@...>
Date: Wednesday, May 14, 2008 - 5:49 pm

More like receiving a guarantee of consensus (just like any signature on 

Is there a URL?  Yes.  http://linux.yyz.us/projects/cld.html

It it useful?  No.  It's just a skeleton code right now.  I am 
experimenting with various Paxos algorithms as we speak, which is why 
it's fresh in my mind at the moment.

I also forgot to mention hyperspace, which is another up-and-coming 
player in this area, alongside Chubby and zookeeper.

	Jeff



--
To: Jeff Garzik <jeff@...>
Cc: Evgeniy Polyakov <johnpol@...>, <linux-kernel@...>, <netdev@...>, <linux-fsdevel@...>
Date: Wednesday, May 14, 2008 - 6:26 pm

It's the 'single node' part that concerns me.  As long as that node is 
ensuring there is consensus behind the scenes before handing out said 
signature.  Otherwise you can't be sure you're not getting an old 
signature for old data..

This is more or less what I ended up doing.  Since the workload is 
mostly-read, the paxos leader gives non-leaders leases to process reads in 
parallel, and new elections or state changes wait if necessary to ensure 
old leases are revoked or expire before any new leases on new values are 
issued.

sage
--
To: Sage Weil <sage@...>
Cc: Jeff Garzik <jeff@...>, Evgeniy Polyakov <johnpol@...>, <linux-kernel@...>, <netdev@...>, <linux-fsdevel@...>
Date: Wednesday, May 14, 2008 - 6:35 pm

My design has something similar, but I think more like MOESI protocol
analogous to CPUs.  There are read leases which are revoked by
explicit confirmation or waiting for them to expire, but that's only
required when serialisation forces a particular access order, and it
can be speculated around.  Like MOESI, it adapts between mostly-read
and mostly-write workloads.

-- Jamie
--
To: Sage Weil <sage@...>
Cc: Evgeniy Polyakov <johnpol@...>, Jeff Garzik <jeff@...>, <linux-kernel@...>, <netdev@...>, <linux-fsdevel@...>
Date: Wednesday, May 14, 2008 - 10:09 am

For critical metadata which is needed to access a lot of data, it's
done: even ext3 replicates superblocks.

These days there are content and search indexes, and journals.  They
aren't replication but are related in some ways since parts of the
data are duplicated and voting protocols can feed into that.

There's also RAID6 and similar parity/coding.  The data is not fully
replicated, saving space, but the coordination is similar to N&gt;=3 way
replication.  Now apply that over a network.  Or even local disks, if

(Generalising to any "quorum" (majority vote) protocol).

That's true if you require that all results are guaranteed consistent
or blocked, in the event of any kind of network failure.

But if you prefer incoherent results in the event of a network split
(and those are often mergable later), and only want to protect against
media/node failures to the best extent possible at any given time,
then quorum protocols can gracefully degrade so you still have access
without a majority of working nodes.

That is a very useful property.  (I think it more closely mimics the
way some human organisations work too: we try to coordinate, but when
communications are down, we do the best we can and sync up later.)

In that model, neighbour sensing is used to find the largest coherency
domains fitting a set of parameters (such as "replicate datum X to N
nodes with maximum comms latency T").  If the parameters are able to
be met, quorum gives you the desired robustness in the event of
node/network failures.  During any time while the coherency parameters
cannot be met, the robustness reduces to the best it can do
temporarily, and recovers when possible later.  As a bonus, you have
some timing guarantees if they are more important.

This is pretty much the same as RAID durability.  You have robustness
against failures, still have access in the event of disk failures, and
degraded robustness (and performance) temporarily while awaiting a new
disk and resynchronising it.

-- Jamie
-...
To: Jamie Lokier <jamie@...>
Cc: Evgeniy Polyakov <johnpol@...>, Jeff Garzik <jeff@...>, <linux-kernel@...>, <netdev@...>, <linux-fsdevel@...>
Date: Wednesday, May 14, 2008 - 12:09 pm

Right.  In my case, I require guaranteed consistent results for critical 
cluster state, and use (slightly modified) Paxos for that.  For file data, 
I leverage that cluster state to still maintain perfect consistency in 
most failure scenarios, while also degrading gracefully to a read/write 
access to a single replica.

When problem situations arise (e.g., replicating to A+B, A fails, 
read/write to just B for a while, B fails, A recovers), an administrator 
can step in and explicitly indicate we want to relax consistency to 
continue (e.g., if B is found to be unsalvageable and a stale A is the 

Anything that silently relaxes consistency like that scares me.  Does 
anybody really do that in practice?

sage
--
To: Sage Weil <sage@...>
Cc: Evgeniy Polyakov <johnpol@...>, Jeff Garzik <jeff@...>, <linux-kernel@...>, <netdev@...>, <linux-fsdevel@...>
Date: Wednesday, May 14, 2008 - 5:19 pm

I'm doing it on a 2000 node system across a country.  There are so
many links down at any given time, we have to handle long stretches of
inconsistency, and have strategies for merging local changes when
possible to reduce manual overhead.  But we like opportunistic
consistency so that people at site A can phone people at site B and
view/change the same things in real time if a path between them is up
and fast enough (great for support and demos), otherwise their actions
are queued or refused depending on policy.

It makes sense to configure which data and/or operations require
global consistency or block, and which data it's ok to modify locally
and merge automatically in a netsplit scenario.  Think DVCS during
splits and coherent when possible.

E.g. as a filesystem, during netsplits you might configure the system
to allow changes to /home/* locally if global coherency is down.  If
all changes (or generally, transaction traces) to /home/user1 are in
just one coherent subgroup, on recovery they can be distributed
silently to the others, unaffected by changes to /home/user2
elsewhere.  But if multiple separated coherent subgroups all change
/home/user1, recovery might be configured to flag them as conflicts,
queue them for manual inspection, and maybe have a policy for the
values used until a person gets involved.

Or instead of paths you might distinguish on user ids, or by explicit
flags in requests (you should really allow that anyway).  Or by
tracing causal relationships requiring programs to follow some rules
(see "virtual synchrony"; the rule is "don't depend on hidden
communications").

That's a policy choice, but in some systems, typically those with many
nodes and fluctuating communications, it's really worth it.  It
increases some kinds of robustness, at cost of others.

-- Jamie
--
To: Sage Weil <sage@...>
Cc: Jamie Lokier <jamie@...>, Evgeniy Polyakov <johnpol@...>, <linux-kernel@...>, <netdev@...>, <linux-fsdevel@...>
Date: Wednesday, May 14, 2008 - 3:11 pm

Well, there's Amazon Dynamo, a distributed system that places most 
importance on writes succeeding, if inconsistent.  They choose to relax 
consistency up front, and on the backend absorb the cost of merging 
multiple versions of objects:

http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html
(full paper)

	Jeff



--
To: Sage Weil <sage@...>
Cc: Jeff Garzik <jeff@...>, <linux-kernel@...>, <netdev@...>, <linux-fsdevel@...>
Date: Wednesday, May 14, 2008 - 9:52 am

Hi Sage.


Well, it depends... If we are talking about single node perfromance,
then any protocol, which requries to wait for authorization (or any
approach, which waits for acknowledge just after data was sent) is slow.

If we are talking about agregate parallel perfromance, then its basic
protocol with 2 messages is (probably) optimal, but still I'm not

I.e. having more than single node to be failed? Google uses 3-way
replication, but I can not see any factor, which will force people from
lowering failure recovering expectations.

-- 
	Evgeniy Polyakov
--
To: Evgeniy Polyakov <johnpol@...>
Cc: Sage Weil <sage@...>, <linux-kernel@...>, <netdev@...>, <linux-fsdevel@...>
Date: Wednesday, May 14, 2008 - 3:03 pm

Quite true, but IMO single-node performance is largely an academic 
exercise today.  What production system is run without backups or 

I think part of Paxos' attraction is that it is provably correct for the 
chosen goal, which historically has not been true for hand-rolled 
consensus algorithms often found these days.

There are a bunch of variants (fast paxos, byzantine paxos, fast 
byzantine paxos, etc., etc.) based on Classical Paxos which make 
improvements in the performance/latency areas.  There is even a Paxos 
Commit which appears to be more efficient than the standard transaction 
two-phase commit used by several existing clustered databases.

	Jeff



--
To: Jeff Garzik <jeff@...>
Cc: Sage Weil <sage@...>, <linux-kernel@...>, <netdev@...>, <linux-fsdevel@...>
Date: Wednesday, May 14, 2008 - 3:38 pm

If cluster is made out of 2-3-4-10 machines, it does want to get maximum
single node performance. But I agree that in some cases we have to
sacrifice of something in order to find something new. And the larger
cluster becomes, for more things we can close eyes on.
 
-- 
	Evgeniy Polyakov
--
To: Evgeniy Polyakov <johnpol@...>
Cc: Jeff Garzik <jeff@...>, Sage Weil <sage@...>, <linux-kernel@...>, <netdev@...>, <linux-fsdevel@...>
Date: Wednesday, May 14, 2008 - 5:57 pm

With the right topology and hardware, you can get _faster_ than single
node performance with as many nodes as you like, except when there is
a node/link failure and the network pauses briefly to reorganise - and
even that is solvable.

Consider:

    Client &lt;-&gt; A &lt;-&gt; B &lt;-&gt; C &lt;-&gt; D

A to D are servers.  &lt;-&gt; are independent network links.  Each server
has hardware which can forward a packet at the same time it's being
received like the best switches (wormhole routing), while performing
minor transformations on it (I did say the right hardware ;-)

Client sends a request message.  It is forwarded along the whole
chain, and reaches D with just a few microseconds of delay compared
with A.

All servers process the message, and produce a response in about the
same time.  However, (think of RAID) they don't all process all data
in the message, just part they are responsible for, so they might do
it faster than a single node would processing the whole message.

The aggregate response is a function of all of them.  D sends its
response.  C forwards that packet while modifying the answer to
include its own response.  B, A do the same.  The answer at Client
arrives just a few microseconds later than it would have with just a
single server.

If desired, arrange it in a tree to reduce even the microseconds.

Such network hardware is quite feasible, indeed quite easy with an
FPGA based NIC.

Enjoy the speed :-)

-- Jamie
--
To: Jamie Lokier <jamie@...>
Cc: Jeff Garzik <jeff@...>, Sage Weil <sage@...>, <linux-kernel@...>, <netdev@...>, <linux-fsdevel@...>
Date: Wednesday, May 14, 2008 - 6:32 pm

And if client-server link is fully saturated by messages
we do not win :) We also lose if client-server is slower than
server-server... I completely agree that there are cases where each
approach is more beneficial, and likely client-to-many os better in
terms of management and/or failover, but for speed there is always a
different side of the coin.

-- 
	Evgeniy Polyakov
--
To: Evgeniy Polyakov <johnpol@...>, Jamie Lokier <jamie@...>, Sage Weil <sage@...>, <linux-kernel@...>, <netdev@...>, <linux-fsdevel@...>
Date: Wednesday, May 14, 2008 - 6:06 pm

This is the core reason why I am so interested in distributed storage... 
  a single storage device is usually slower than network wire speed. 
Multiple nodes helps remove that limitation and max out the network.

I want to be able to stream data _faster_ than a single hard drive can 
handle :)

	Jeff



--
To: Jeff Garzik <jeff@...>
Cc: Jamie Lokier <jamie@...>, Sage Weil <sage@...>, <linux-kernel@...>, <netdev@...>, <linux-fsdevel@...>
Date: Wednesday, May 14, 2008 - 6:41 pm

In that case yes, network will _not_ be saturated and multiple
simultaneous streams will have a win, but POHMELFS client design was
specially created to increase network performance as much as possible,
since we can increase storage speed (add more drives, more RAM
for caches, better hardware), but can not easily increase network
bandwidth.

But, as was already noted, even being network bound, client-to-many
is likely a better solution from other points of view (like management
and/ro failure cases).

-- 
	Evgeniy Polyakov
--
To: Jeff Garzik <jeff@...>
Cc: Jamie Lokier <jamie@...>, Sage Weil <sage@...>, <linux-kernel@...>, <netdev@...>, <linux-fsdevel@...>
Date: Wednesday, May 14, 2008 - 6:50 pm

Actually experiments with async processing in POHMELFS lead to the
conclusion, that if protocol waits until request is completed and does
not proceed with the next one (like CIFS and somewhat NFS), such design
does not scale to multiple parallel IOs.
That from different angle shows benefits of caching and aiming at
getting as much performance as possible from single node connection :)

-- 
	Evgeniy Polyakov
--
To: Evgeniy Polyakov <johnpol@...>
Cc: Sage Weil <sage@...>, Jeff Garzik <jeff@...>, <linux-kernel@...>, <netdev@...>, <linux-fsdevel@...>
Date: Wednesday, May 14, 2008 - 10:31 am

Look up "one-phase commit" or even "zero-phase commit".  (The
terminology is cheating a bit.)  As I've understood it, all commit
protocols have a step where each node guarantees it can commit if
asked and node failure at that point does not invalidate the guarantee
if the node recovers (if it can't maintain the guarantee, the node
doesn't recover in a technical sense and a higher level protocol may
reintegrate the node).  One/zero-phase commit extends that to
guaranteeing a certain amounts and types of data can be written before
it knows what the data is, so write messages within that window are
sufficient for global commits.  Guarantees can be acquired
asynchronously in advance of need, and can have time and other limits.
These guarantees are no different in principle from the 1-bit
guarantee offered by the "can you commit" phase of other commit
protocols, so they aren't as weak as they seem.

Now combine it with a quorum protocol like Paxos, you can commit with
async guarantees from a subset of nodes.  Guarantees can be
piggybacked on earlier requests.  There, single node write
performance with quorum robustness.

-- Jamie
--
To: Evgeniy Polyakov <johnpol@...>, Sage Weil <sage@...>, Jeff Garzik <jeff@...>, <linux-kernel@...>, <netdev@...>, <linux-fsdevel@...>
Date: Wednesday, May 14, 2008 - 3:05 pm

For several common Paxos usages, you can obtain consensus guarantees 
well in advance of actually needing that guarantee, making the entire 
process quite a bit more async and parallel.

Sort of a "write ahead" for consensus.

	Jeff



--
To: Jeff Garzik <jeff@...>
Cc: Evgeniy Polyakov <johnpol@...>, Sage Weil <sage@...>, <linux-kernel@...>, <netdev@...>, <linux-fsdevel@...>
Date: Wednesday, May 14, 2008 - 5:38 pm

That's a lovely concise summary.

It seems all the classical texts on two-phase commit have made it
over-complicated all along.  "write ahead consensus" is both faster
and simpler in many respects.

-- Jamie
--
To: Jamie Lokier <jamie@...>
Cc: Sage Weil <sage@...>, Jeff Garzik <jeff@...>, <linux-kernel@...>, <netdev@...>, <linux-fsdevel@...>
Date: Wednesday, May 14, 2008 - 11:00 am

If I understood that, client has to connect to all servers and send data
there, so that after single reply things got committed. That is
definitely not the issue, when there are lots of servers.

That can be the case if client connects to some gate server, which in
turn broadcasts data further, that is how I plan to implement things at
first.

Another approach, which seems also intersting is leader election (per
client), so that leader would broadcast all the data.

-- 
	Evgeniy Polyakov
--
To: Evgeniy Polyakov <johnpol@...>
Cc: Sage Weil <sage@...>, Jeff Garzik <jeff@...>, <linux-kernel@...>, <netdev@...>, <linux-fsdevel@...>
Date: Wednesday, May 14, 2008 - 5:32 pm

Look up Bittorrent, and bandwidth diffusion generally.  Also look up
multicast trees.

Sometimes it's faster for a client to send to many servers; sometimes
it's faster to send fewer and have them relayed by intermediaries -
because every packet takes time to transmit, and network topologies
aren't always homogenous or symmetric.


Leader election is part of standard Paxos too :-)

If you have a single data forwarder elected per client, then if one
client generates a lot of traffic, you concentrate a lot of traffic to
one network link and one CPU.  Sometimes it's better to elect several
leaders per client, and hash requests onto them.  You diffuse CPU and
traffic, but reduce opportunities to aggregate transactions into fewer
message.  It's an interesting problem, again probably with different
optimal results for different networks.

-- Jamie
--
To: Jamie Lokier <jamie@...>
Cc: Sage Weil <sage@...>, Jeff Garzik <jeff@...>, <linux-kernel@...>, <netdev@...>, <linux-fsdevel@...>
Date: Wednesday, May 14, 2008 - 6:02 pm

Yep, having multiple connections is worse for high-performance networks
and is a great win for long latency links. If client to server
connection is slower than server-server one, than having single gate
server, which broadcasts data to others is a win over multiple
connection to different servers. But if communication is roughly the
same over all links, than... I think I agree that client-to-many is a
better approach, since perormance of client-to-many will be the same as
client-to-gate-to-others (since link is the same everywhere), but if
gate server fails, reconnection and other management tasks introduce
huge latency here (new gate server, new connection and so on), while


Probably idea I described in other mail to Jeff, when client just
connects to number of servers and can process command of adding/dropping
server from that group, and balances reading between them and sends
writes/metadata update to all of them, and all logic behind that group
selection is hidded in the servers cloud, is the best choice...

-- 
	Evgeniy Polyakov
--
To: Evgeniy Polyakov <johnpol@...>
Cc: Sage Weil <sage@...>, Jeff Garzik <jeff@...>, <linux-kernel@...>, <netdev@...>, <linux-fsdevel@...>
Date: Wednesday, May 14, 2008 - 6:28 pm

Not just long latency.  If you have a low latency link which is very
busy, perhaps a client doing lots of requests, or doing other things,

I think that's a fine choice, but it doesn't solve difficult
problems.  You still have to implement the server cloud. :-)

It's possible that implementing server cloud protocol _and_ simple
client protocol may be more work than just server cloud protocol.  I'm
not sure.  Thoughts welcome.

-- Jamie
--
To: Jamie Lokier <jamie@...>
Cc: Sage Weil <sage@...>, Jeff Garzik <jeff@...>, <linux-kernel@...>, <netdev@...>, <linux-fsdevel@...>
Date: Wednesday, May 14, 2008 - 6:45 pm

If that would not be dificult problem, it would not be interesting at

Well, getting that client protocol is mostly ready, and its design
allows infinite (blah!) extensions and extremely (blah!) flexible
processing, we are close to just difficult server one :)

-- 
	Evgeniy Polyakov
--
To: Evgeniy Polyakov <johnpol@...>
Cc: Sage Weil <sage@...>, Jeff Garzik <jeff@...>, <linux-kernel@...>, <netdev@...>, <linux-fsdevel@...>
Date: Wednesday, May 14, 2008 - 9:10 pm

That's what I thought when I had my system working fine with just one
server.  The client was very simple. :-)

Since, I learned that my clients need to have parts of the complex
server protocol for fast, safe transactions (think ACID (or ACI)) over
relatively slow links, especially with multiple servers.

Also, efficiently recovering from a link/server failure, when clients
have large zero-latency caches (using leases), appears similar to the
synchronising protocol between recovering servers.

But, on the bright side, these things are only necessary for
performance in scenarios you might not encounter or care about :-)

I'm finding it's a really interesting but large problem.

-- Jamie
--
To: Jamie Lokier <jamie@...>
Cc: Sage Weil <sage@...>, Jeff Garzik <jeff@...>, <linux-kernel@...>, <netdev@...>, <linux-fsdevel@...>
Date: Thursday, May 15, 2008 - 3:34 am

That's a part of the 'simple' client protocol already, there are
transactions, which are only committed completed, when server replies
that they are, there is also failover reconnection and timeout detection
features as long as switching to different servers in case of failure.

Yes, it is a bit more than 'simple' protocol, but I think that's what it

Yeah, it is far from 'small' problem :)
Really simple protocol was in the first version, and it was also fast,
but yes, it was rather miserable from failure point of view.

-- 
	Evgeniy Polyakov
--
To: Evgeniy Polyakov <johnpol@...>, Sage Weil <sage@...>, Jeff Garzik <jeff@...>, <linux-kernel@...>, <netdev@...>, <linux-fsdevel@...>
Date: Wednesday, May 14, 2008 - 5:37 pm

Definitely.  "several leaders" aka partitioning is also becoming 
increasing paired with efforts at enhancing locality of reference.  Both 
Google and Amazon sort their distributed tables lexographically, which 
[ideally] results in similar data being stored near each other.

A bit of an improvement over partitioning-by-hash, anyway, for some 
workloads.

	Jeff


--
To: Jeff Garzik <jeff@...>
Cc: Evgeniy Polyakov <johnpol@...>, Sage Weil <sage@...>, <linux-kernel@...>, <netdev@...>, <linux-fsdevel@...>
Date: Wednesday, May 14, 2008 - 5:43 pm

As with B-trees on disks, and in-memory structures, application
knowledge of locality is very much worth passing to the storage layer.

-- Jamie
--
To: Evgeniy Polyakov <johnpol@...>
Cc: Jamie Lokier <jamie@...>, Sage Weil <sage@...>, <linux-kernel@...>, <netdev@...>, <linux-fsdevel@...>
Date: Wednesday, May 14, 2008 - 3:08 pm

That means you are less optimal than the direct-to-storage-server path 
in NFSv4.1, then........

&lt;waves red flag in front of the bull&gt;

If access controls permit, the ideal would be for the client to avoid an 
intermediary when storing data.  The client only _needs_ a consensus 
reply that their transaction was committed.  They don't necessarily need 
an intermediary to do the boring data transfer work.

	Jeff



--
To: Jeff Garzik <jeff@...>
Cc: Jamie Lokier <jamie@...>, Sage Weil <sage@...>, <linux-kernel@...>, <netdev@...>, <linux-fsdevel@...>
Date: Wednesday, May 14, 2008 - 3:32 pm

No, server to connect is the server, which stores data. By addition it
will also store it to some other places according to distributed

Sure the less number of machines between client and storage we have, the
faster and more robust we are.

Either client has to write data to all servers, or it has to write it to
one and wait utill that server will broadcast it further (to quorum or any
number of machines it wants). Having pure client to think to what
servers it has to put its data is a bit wrong (if not saying more),
since it has to join not only data network, but also control one, to
check that some servers are alive or not, to be able not to race, when
server is recovering and so on...

-- 
	Evgeniy Polyakov
--
To: Evgeniy Polyakov <johnpol@...>
Cc: Jamie Lokier <jamie@...>, Sage Weil <sage@...>, <linux-kernel@...>, <netdev@...>, <linux-fsdevel@...>
Date: Wednesday, May 14, 2008 - 4:37 pm

[Empty message]
To: Jeff Garzik <jeff@...>
Cc: Jamie Lokier <jamie@...>, Sage Weil <sage@...>, <linux-kernel@...>, <netdev@...>, <linux-fsdevel@...>
Date: Wednesday, May 14, 2008 - 5:19 pm

Well, that's how things exist today - POHMELFS client connects to number
of servers and can send data to all of them (currently it doest that for
only 'active' server, i.e. that which was not failed, but that can be
trivially changed). It should be extended to receive 'add/remove server
to the group' command and liekly that's all (modulo other todo items
which are not yet resolved). Then that group becomes quorum and client
has to get response from them. Kind of that...

What I do not like, is putting lots of logic into client, like following
inner server state changes (sync/not sync, quorum election and so on).
With above dumb scheme it should not, but some other magic in the server
land will tell client with whom to start working.


-- 
	Evgeniy Polyakov
--
To: Evgeniy Polyakov <johnpol@...>
Cc: Jamie Lokier <jamie@...>, Sage Weil <sage@...>, <linux-kernel@...>, <netdev@...>, <linux-fsdevel@...>
Date: Wednesday, May 14, 2008 - 5:34 pm

The client need not (and should not) worry about quorum, elections or 
server cloud state management.  The client need only support these 
basics: some method of read balancing, parallel data writes, and a 
method to retrieve a list of active servers.

The server cloud and/or cluster management can handle the rest, 
including telling the client if the transaction failed or succeeded (as 
it must), or if it should store to additional replicas before the 
transaction may proceed.

	Jeff



--
To: Evgeniy Polyakov <johnpol@...>
Cc: Jeff Garzik <jeff@...>, <linux-kernel@...>, <netdev@...>, <linux-fsdevel@...>
Date: Tuesday, May 13, 2008 - 8:52 pm

I feel you have glossed over the more difficult parts of transactions

Yours does sound a very interesting project.  Do you know how it
compares with NFSv4 for performance?  I think that has some similar

I think you are right.  I am struggling with the opposite approach
(too big steps, trying to be too clever with algorithms) on a similar
project!  That said, I did try simpler steps earlier, and it worked
but showed a lot of tricky problems.

Fwiw, I've been working on what started as a distributed database that
is coming to be a filesystem too.  It has many qualities of both,
hopefully the best ones.  I'm aiming for high LAN file performance
similar to what you report with POHMELFS and would expect from any
modern fs, while also supporting database style transactions and
coherent queries, in a self-organising distributed system that handles
LAN/WAN/Internet each at their best.  Mention of Paxos stirred me to
reply - a relative of that is in there somewhere.  I have a long way
to go before a release.

If anyone is working on something similar, I would be delighted to
hear from them.

It scares me that I'm actually trying to do this.  But very exciting
it is too.

It seems there's quite a bit of interesting work on Linux in this area
right now, with BTRFS and CRFS too.

-- Jamie
--
To: Jeff Garzik <jeff@...>, <linux-kernel@...>, <netdev@...>, <linux-fsdevel@...>
Date: Wednesday, May 14, 2008 - 3:57 am

NFSv4 does not use similar caching scheme, but it has interesting
batching abilities for bulk data transfer. CRFS was originally a source
of inspiration for this project (before it was opened, we had some talk
with Zach Brown, so I decided that it worth deeper investigation and
started this FS). CRFS performance is also very good, but the fact, that

The more we develop, more problems arises, so it is possible (and I had
such situation), when very complex, but solvable problem, during
development translates into multiple problems of the same complexity,
which takes more and more time... Essentially this can be solved, until
something is dropped and added when other things are completed.

For example there is a really interesting lockless algorithm for storing
path cache in the POHMELFS, but implementation is really complex, so I


I belive that (only block?) FS which exports its structure in database
accessible way, i.e. ability to search objects not only by name key and
assign that new keys similar to how it is done in database, but not via
assign_xattr(search(name)), is a very interesting and useful approach.
Also the more I follow general purpose fs developments, the more I
become sure that they (general purpose) will never be the best on any
given workload, and instead special purpose (like databasefs or

Scares problems which we can not solve, this one kind of increases

Yeah, probably this is time to move further in given area, so lots of
interesting new developments.

-- 
	Evgeniy Polyakov
--
To: Evgeniy Polyakov <johnpol@...>, Jeff Garzik <jeff@...>, <linux-kernel@...>, <netdev@...>, <linux-fsdevel@...>
Date: Tuesday, May 13, 2008 - 9:16 pm

Hi,


I am currently working on mysqlfs which is a fuse fs which can be used 
in conjunction with mysql-ndb cluster.

You can find the details here: http://sourceforge.net/projects/mysqlfs/
and a howto (in german, though) here: 
http://www.netz-guru.de/2008/04/03/mysqlfs-mit-mysql-ndb-cluster-als-verteiltes-dateis...

It is working quite well, but still lacks of caching which makes it slow
if your connection between the DB-servers have high latency/many hops.

I don't know BTRFS nor CRFS or POHMELFS but i will take a look at them.

Hope you'll find that usefull.


--
Florian Wiessner

--
To: Florian Wiessner <ich@...>
Cc: Jeff Garzik <jeff@...>, <linux-kernel@...>, <netdev@...>, <linux-fsdevel@...>
Date: Wednesday, May 14, 2008 - 4:10 am

Hi.


Did FUSE start to make a fiendship with performance? Last time I saw it,
they hated each other...

Caching actually useful not only on slow, but also very fast links
because of its ability to batch data and greatly reduce latencies of
reply-request protocols, which in turn (for that protocols) greatly
increases performance. If you are using async processing (like POHMELFS,
iirc it is the only such approach in networked fs, cifs/smbfs and others
wait after request is sent and only then proceed with the next one) that
will allow to drain the cache very quickly and proceed with the next
data set.

-- 
	Evgeniy Polyakov
--
Previous thread: [BUG] remount ro on loopback mount leaves unmountable filesystem behind by David Chinner on Tuesday, May 13, 2008 - 4:50 am. (6 messages)

Next thread: dfs path construction fix - send dfs paths on all path based operations on share in dfs by Steve French on Tuesday, May 13, 2008 - 6:45 pm. (3 messages)
speck-geostationary