Hi.
I'm pleased to announce POHMEL high performance network parallel
distributed filesystem.
POHMELFS stands for Parallel Optimized Host Message Exchange Layered File System.Development status can be tracked in filesystem section [1].
This is a high performance network filesystem with local coherent cache of data
and metadata. Its main goal is distributed parallel processing of data.This release brings following features:
* Read requests (data read, directory listing, lookup requests) balancing
between multiple servers.
* Write requests are sent to multiple servers and completed only
when all of them sent an ack.
* Ability to add and/or remove servers from working set at run-time from
userspace (via netlink, so the same command can be processed from
real network though, but since server does not support it yet,
I dropped network part).
* Documentation (overall view and protocol commands)!
* Rename command (oops, forgot it in previous releases :)
* Several new mount options to control client behaviour instead of
hardcoded numbers.
* Bug fixes.Very likely it is one of the last non-bug-fixing release of the kernel
client side, next release will incorporate features, needed for distributed
parallel data processing (like ability to add new servers via network
command from another servers), so most of the work will be devoted to server
code.Basic POHMELFS features:
* Local coherent (notes [2]) cache for data and metadata).
* Completely async processing of all events (hard and symlinks are the only
exceptions) including object creation and data reading/writing.
* Flexible object architecture optimized for network processing. Ability to
create long pathes to object and remove arbitrary huge directoris in
single network command.
* High performance is one of the main design goals.
* Very fast and scalable multithreaded userspace server. Being in userspace
it works with any underlying filesystem and still is much faster than
asyn...
Neat :) Thanks for protocol documentation, too. Do you plan to add
write-pages in addition to write-page? Also, write-page does not appear
to be documented.Is race-across-directories race-free? That is a sticky area, see
Documentation/filesystems/directory-locking in particular.With the exception of encryption, do you think the POHMELFS client is
mostly complete, at this point?Jeff
--
->writepage() is not needed at all (it does not even exist anymore :),
POHMELFS relies on VFS to handle that cases, it does not invent own stuff here.
I think I will extend its command structure to support checksum (i.e.
add 64bit field unused for now), all other protocol changes are supposed
to be on the highest level (like new commands), so it should not hurt others.I have to think about locking (file locks on server, not POHMELFS internal
locking :) some more, but so far I do not see, how it can change the picture.Another task is to move from slab allocation (kmalloc and friends) to
memory pools, like it was done for transaction destinations.I do not plan serious changes in client (I frankly do not know, what
else I want there :), so, yes, I think that most of the client side is ready.--
Evgeniy Polyakov
--
POHMELFS client code.
Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
diff --git a/fs/Kconfig b/fs/Kconfig
index c509123..59935cd 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -1566,6 +1566,8 @@ menuconfig NETWORK_FILESYSTEMSif NETWORK_FILESYSTEMS
+source "fs/pohmelfs/Kconfig"
+
config NFS_FS
tristate "NFS file system support"
depends on INET
diff --git a/fs/Makefile b/fs/Makefile
index 1e7a11b..6ce6a35 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -119,3 +119,4 @@ obj-$(CONFIG_HPPFS) += hppfs/
obj-$(CONFIG_DEBUG_FS) += debugfs/
obj-$(CONFIG_OCFS2_FS) += ocfs2/
obj-$(CONFIG_GFS2_FS) += gfs2/
+obj-$(CONFIG_POHMELFS) += pohmelfs/
diff --git a/fs/pohmelfs/Kconfig b/fs/pohmelfs/Kconfig
new file mode 100644
index 0000000..5178514
--- /dev/null
+++ b/fs/pohmelfs/Kconfig
@@ -0,0 +1,26 @@
+config POHMELFS
+ tristate "POHMELFS filesystem support"
+ select CONNECTOR
+ help
+ POHMELFS stands for Parallel Optimized Host Message Exchange Layered File System.
+ This is a network filesystem which supports coherent caching of data and metadata
+ on clients.
+
+config POHMELFS_DEBUG
+ bool "POHMELFS debugging"
+ depends on POHMELFS
+ default n
+ help
+ Turns on excessive POHMELFS debugging facilities.
+ You usually do not want to slow things down noticebly and get really lots of kernel
+ messages in syslog.
+
+config POHMELFS_CC_GROUP
+ bool "POHMELFS cache coherency protocol"
+ depends on POHMELFS
+ default y
+ help
+ This allows to broadcast data and metadata cache coherency messages between clients.
+ Usually you want this facility, although without locking you can get different from
+ POSIX expectation behaviour. For more details check POHMELFS homepage and development
+ section.
diff --git a/fs/pohmelfs/Makefile b/fs/pohmelfs/Makefile
new file mode 100644
index 0000000..aa415a3
--- /dev/null
+++ b/fs/pohmelfs/Makefile
@@ -0,0 +1,3 @@
+obj-$(CONFIG_POHMELFS) += pohmelfs.o
+
+pohmelfs-y := inode.o confi...
Hi,
I have just one question yet :-)
I'm having a hard time convincing myself that the error handling here
is correct. You have this kind of setup:1. for each config in config list {
2. for each config in superblock state list {
pohmelfs_config_eql();
...
}
}And according to your code, if pohmelfs_config_eql returns non-zero in
the last iteration of #1, then -EEXISTS will be the return value of
the whole function (but the config _will_ be copied; it is not undone
in this case). But if pohmenlfs_config_eql returns non-zero in any but
the last iteration of #1, then 0 will be the return value. Is this
your intention?Vegard
--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036
--
Hi Vegard.
Task of this function is to copy as much new configs added by user (or
by remote server) as we can.
If config already exists (was copied in previous iterations) we skip it.
If it does not, we allocate new structure and initialize it. If
allocation fails, it is a serious error and unlikely we want to proceed,
so we jump out of the loop and drop all states and return error. If we
just failed to initialie new state (like connection was refused by
remote server), we simply drop that failed case and proceed further. In
theory we still can leave that half-initialized states in the list, and
any attempt to send request via them will try to initialize its network
part, but thread creation and allocation itself will be tried to
recover, so I just drop such state here. I think initialization function
should not return error if it failed to connect or create a socket,
since it can/will be recovered later if needed.We should not return eexist, from non-error label, but at the only
place, where this return value is checked (mount time initialization),
superblock list is empty and thus this error can not happen.--
Evgeniy Polyakov
--
Design notes, usage cases and protocol description.
Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
diff --git a/Documentation/filesystems/pohmelfs/design_notes.txt b/Documentation/filesystems/pohmelfs/design_notes.txt
new file mode 100644
index 0000000..c9a9379
--- /dev/null
+++ b/Documentation/filesystems/pohmelfs/design_notes.txt
@@ -0,0 +1,61 @@
+POHMELFS: Parallel Optimized Host Message Exchange Layered File System.
+
+ Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+
+Homepage: http://tservice.net.ru/~s0mbre/old/?section=projects&item=pohmelfs
+
+It was first started as network filesystem with coherent local data and metadata caches,
+but it is being evolved into parallel distibuted filesystem now.
+
+Main features of this FS include:
+ * Local coherent (notes cache for data and metadata:
+ http://tservice.net.ru/~s0mbre/blog/devel/fs/2008_05_17.html)
+ * Completely async processing of all events (hard, symlinks and rename are the
+ only exceptions) including object creation and data reading and writing.
+ * Flexible object architecture optimized for network processing.
+ Ability to create long pathes to object and remove arbitrary huge
+ directoris in single network command.
+ (like removing the whole kernel tree via single network command).
+ * Very high performance.
+ * Fast and scalable multithreaded userspace server. Being in userspace it works
+ with any underlying filesystem and still is much faster than async in-kernel NFS one.
+ * Client is able to switch between different servers (if one goes down, client
+ automatically reconnects to second and so on).
+ * Transactions support. Full failover for all operations.
+ Resending transactions to different servers on timeout or error.
+ * Read requests (data read, directory listing, lookup requests) balancing between multiple servers.
+ * Write requests are sent to multiple servers and completed only when all of them sent an ack.
+ * Ability to add and/or remove servers from working set at run-t...
That sounds great, but what do you mean by 'novel'? Don't other
modern network filesystems use asynchronous requests and replies inBy transactions, do you mean an atomic set of writes/changes?
This is extremely cool, and obviously the right thing to do. No sane
network filesystem would be without it, one naively hopes :-)How is it different from NFSv4 leases and SMB oplocks? Or are they
the same basic idea?With all those asynchronous requests, are your writeback caches fully
coherent? Example. Client A reads file X (data: x0), then writes X
(new data: x1), then reads Y (data: y0), then writes Y (data: y1).
Client B reads Y then reads X. Is it guaranteed that client B cannot
ever get data y1 and x0? A fully coherent system (meaning behaves
like a local filesystem) does guarantee that. If cache requests for
file X and file Y are independent, this is not guaranteed.-- Jamie
--
Noreover, that's true :)
I regulary run and post various benchmarks comparing POHMELFS, NFS,
XFS and Ext4, main goal of POHMELFS at this stage is to be
essentially as fast as underlying local filesystem. And it is...
Though there is a single place (random reading, all others reached
FS speed, so it is from 10 to 300% faster than NFS in various loads :),Maybe it was a bit naive though :)
But I checked lots of implementation, all of them use send()/recv()
approach. NFSv4 uses a bit different, but it is a cryptic, and at least
from its names it is not clear:
like nfs_pagein_multi() -> nfs_pageio_complete() -> add_stats. Presumably
we add stats when we have data handy...
CIFS/SMB use synchronous approach.From those projects, which are not in kernel, like CRFS and CEPH, the
former uses async receiving thread, while the latter is synchronous,It covers all operations, including reading, directory listing, lookups,
attribite changes and so on. Its main goal is to allow transaparentOplocks and leases are essentially lock on given file, which allows one
client to operate on it. POHMELFS does not have locks now, and they will
be created depending on how distributed server will require them. In the
simplesst case it can just lock file for writing and do not allow its
updates from other clients. Lock aciquite can be done at write_begin
time. Without lock and writeback cache in your case writeback for file Y
can happen before writeback for file X, but if client does not only
write, but also sync after its write, then yes, client will see later
updates after more earlier. POHMELFS does not broadcast its interest in
the file content until real writing happens, i.e. at writeback time.
Although I can add a mode, when the same will be done during--
Evgeniy Polyakov
--
Hi Evgeniy,
By synchronous/asynchronous, are you talking about whether writepages()
blocks until the write is acked by the server? (Really, any FS that doesWell... Ceph writes synchronously (i.e. waits for ack in write()) only
when write-sharing on a single file between multiple clients, when it is
needed to preserve proper write ordering semantics. The rest of the time,
it generates nice big writes via writepages(). The main performance issue
is with small files... the fact that writepages() waits for an ack and is
usually called from only a handful of threads limits overall throughput.
If the writeback path was asynchronous as well that would definitely help
(provided writeback is still appropriately throttled). Is that whatYour meaning of "transaction" confused me as well. It sounds like you
just mean that the read/write operation is retried (asynchronously), and
may be redirected at another server if need be. And that writes can be
directed at multiple servers, waiting for an ack from both. Is that
right?I my view the writeback metadata cache is definitely the most exciting
part about this project. Is there a document that describes where the
design ended up? I seem to remember a string of posts describing your
experiements with client-side inode number assignment and how that is
reconciled with the server. Keeping things consistent between clients is
definitely the tricky part, although I suspect that even something with
very coarse granularity (e.g., directory/subtree-based locking/leasing)
will capture most of the performance benefits for most workloads.Cheers-
sage
--
Hi Sage.
Yes, not only writepage, but any request - if it sends sequest and then
receives reply (i.e. doing send/recv sequence without ability to do
something else in between or allow other users to do sends or receives
into the same socket), then it is synchronous. If it only sends, and
someone else receives, it is possible to send multiple requests from
different users who do reads or writes or lookups or whatever and
asynchronously in different thread receive replies not in particularNot exactly. Transaction in a nutshell is a wrapper on top of command
(or multiple commands if needed like in writing), which contains all
information needed to perform appropriate action. When user calls read()
or 'ls' or write() or whatever, POHMELFS creates transaction for that
operation and tries to perform it (if operation is not cached, in that
case nothing actually happens). When transaction is submitted, it
becomes part of the failover state machine which will check if data has
to be read from different server or written to new one or dropped.
original caller may not even know from which server its data will be
received. If request sending failed in the middle, the whole transaction
will be redirected to new one. It is also possible to redo transaction
against different server, if server sent us error (like I'm busy), but
this functionality was dropped in previous release iirc, this can be
resurrected though. Having generic transaction tree callers do not
bother about how to store theirs requests, how to wait for results and
how to complete them - transactions do it for them. It is not rocket
science, but extrmely effective and simple way to help rule outThat was somewhat old approach, currently inode numbers and things like
open-by-inode or NFS style open-by-cookie are not used. I tried to
describe caching bits in docuementation I ent, although its a bit rough
and likely incomplete :) Feel free to ask if there are some white areas
there.--
Evgeniy Polyakov
--
Oh, so you just mean that the caller doesn't, say, hold a mutex for the
socket for the duration of the send _and_ recv? I'm kind of shocked that
anyone does that, although I suppose in some cases the protocolGot it. Tracking pending requests in some generic way is definitely key
So what happens if the user creates a new file, and then does a stat() to
expose i_ino. Does that value change later? It's not just
open-by-inode/cookie that make ino important.It looks like the client/server protocol is primarily path-based. What
happens if you do something likehosta$ cd foo
hosta$ touch foo.txt
hostb$ mv foo bar
hosta$ rm foo.txtWill hosta realize it really needs to do "unlink /bar/foo.txt"?
sage
--
First, socket has own internal lock, which protects against simultaneous
access to its structures, but POHMELFS has own mutex, which guards
network operations for given network state, so if server disconnected,
socket can be released and zeroed if needed, so that subsequent access
could detect it and made appropriate decision like try to reconnect.I really do not understand your surprise :)
But it does possible to create a scheme, when you do not need to hold a
lock between commands for successfull complete. It is even possible not
to _expect_ that something will be received from given socket or
received at all. Courtesy of transactions: system locks only data, which
has to be processed, it does not lock sequence of commands which areLocal inode number is returned. Inode number does not change during
lifetime of the inode, so while it is alive always the same number willNo, since it got a reference to object in local cache. But it will fail
to do something interesting with it, since it does not really exist on
server anymore.
When 'hosta' will reread higher directory (it will when needed, since
server will send it cache coherency message, but thanks to your example,
rename really does not send it, only remove :), so I will update server),
it will detect that directory changed its name and later will use it.
After reread system actually can not know if directory was renamed or it
is completely new one with the same files.You pointed to very interesting behaviour of the path based approach,
which bothers me quite for a while:
since cache coherency messages have own round-trip time, there is always
a window when one client does not know that another one updated object
or removed it and created new one with the same name.
It is trivially possible to extend path cache with storing remote ids,
so that attempt to access old object would not harm new one with the
same name, but I want to think about it some more.
Correct solution is to use locks of course, and I'm not 100% it ...
Well, I must still be misunderstanding you :(. It sounded like you were
saying other network filesystems take the socket exclusively for the
duration of an entire operation (i.e., only a single RPC call oustanding
with the server at a time). And I'm pretty sure that isn't the case...Which means I'm still confused as to how POHMELFS's transactions are
fundamentally different here from, say, NFS's use of RPC. In both cases,
multiple requests can be in flight, and the server is free to reply to
requests in any order. And in the case of a timeout, RPC requests are
resent (to the same server.. let's ignore failover for the moment). Am II see. And if the inode drops out of the client cache, and is later
reopened, the st_ino seen by an application may change? st_ino isn't used
for much, but I wonder if that would impact a large cp or rsync's abilityNot if the server waits for the cache invalidation to be acked before
applying the update. That is, treat the client's cached copy as a lease
or read lock. I believe this is how NFSv4 delegations behave, and it's
how Ceph metadata leases (dentries, inode contents) and file access
capabilities (which control sync vs async file access) behave. I'm not
all that familiar with samba, but my guess is that its leases are brokenThat's half of it... ideally, though, the client would have a reference to
the real object as well, so that the original foo.txt would be removed.
I.e. not only avoid doing the wrong thing, but also do the right thing.I have yet to come up with a satisfying solution there. Doing a d_drop on
dentry lease revocation gets me most of the way there (Ceph's path
generation could stop when it hits an unhashed dentry and make the request
path relative to an inode), but the problem I'm coming up against is that
there is no explicit communication of the CWD between the VFS and fs
(well, that I know of), so the client doesn't know when it needs a real
reference to the directory (and I'm no...
Hi.
Well, RPC is quite similar to what transaction is, at least its approach
There is number of cases when inode number will be preserved, like
parent inode holds its number in own subcache, so when it will lookup
object it will give it the same inode number, but generally if inode wasThat's why I still did not implement locking in POHMELFS - I do not want
to drop to sync case for essentially all operations, which will end up
broadcasting cache coherency messages. But this may be unavoidable case,
so I will have to implement it that way.NFS-like delegation is really the simplest and not interesting case,
since it drops parallelism for multiple clients accessing the same data,Well, the same code was in previous POHMELFS releases and I dropped it.
I'm not sure yet what is exact requirements for locking and cache
coherency expected from such kind of distributed filesystem, so there is
no yet locking.There will always be some kind of tradeoffs between parallel access and
caching, so drawing that line closer or far from what we have in local
filesystem will anyway have some drawbacks.--
Evgeniy Polyakov
--
You're confusing write gathering with asynchronous I/O...
NFS attempts to send multiple contiguous pages in one I/O request, and
so it has a mechanism for collecting them and dispatching the I/O as
soon as we have enough pages for an RPC call.The actual RPC call is then handled by the sunrpc layer and is done
fully asynchronously using non-blocking I/O.Trond
--
Well, yes, I did not dig into rpc as is deep enough, but checked how
callbacks are prepared in respect to ioflags namely RPC_FLAGS_ASYNC. I
was confused with the fact, that system did not yet process request, but
accounted stats for it, but likely that stats are just intended to
show exactly what was queued for (later) processing.--
Evgeniy Polyakov
--
For /locking/, life is easy, you don't have to worry about disallowing
client updates, because locking is advisory. However, there are some
guarantees you need for locking WRT write commit, and of course leases
are a totally different animal where you do block client updates.Jeff
--
Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru
diff --git a/mm/filemap.c b/mm/filemap.c
index 07e9d92..5e5ad6b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -495,6 +495,7 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
lru_cache_add(page);
return ret;
}
+EXPORT_SYMBOL_GPL(add_to_page_cache_lru);#ifdef CONFIG_NUMA
struct page *__page_cache_alloc(gfp_t gfp)
diff --git a/mm/filemap.c b/mm/filemap.c
index 07e9d92..1e7ef37 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -610,6 +610,7 @@ int __lock_page_killable(struct page *page)
return __wait_on_bit_lock(page_waitqueue(page), &wait,
sync_page_killable, TASK_KILLABLE);
}
+EXPORT_SYMBOL_GPL(__lock_page_killable);/**
* __lock_page_nosync - get a lock on the page, without calling sync_page()--
Evgeniy Polyakov
--
| Tarkan Erimer | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Andi Kleen | [PATCH x86] [0/16] Various i386/x86-64 changes |
| Vladislav Bolkhovitin | Re: Integration of SCST in the mainstream Linux kernel |
| Pavel Roskin | ndiswrapper and GPL-only symbols redux |
git: | |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| Natalie Protasevich | [BUG] New Kernel Bugs |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Arjan van de Ven | Re: [GIT]: Networking |
