Re: [patch] epoll use a single inode ...

Previous thread: [patch 3/3] epoll optimizations and cleanups ... by Davide Libenzi on Tuesday, March 6, 2007 - 5:34 pm. (3 messages)

Next thread: RE: [PATCH] s2io: add PCI error recovery support by Ramkrishna Vepa on Tuesday, March 6, 2007 - 5:42 pm. (1 message)
From: Davide Libenzi
Date: Tuesday, March 6, 2007 - 5:37 pm

I currently have the dentry to carry the name of the file* class it is 
linked to. I'd prefer to keep it that way, unless there are huge factors 
against.



- Davide


-

From: Linus Torvalds
Date: Tuesday, March 6, 2007 - 5:51 pm

[ Al Viro added to Cc: as the arbiter of good taste in the VFS layer. He 
  has veto powers even over my proposals ;^]


I assume that the *only* reason for having multiple dentries is really 
just the output in /proc/<pid>/fd/, right? Or is there any other reason to 
have separate dentries for these pseudo-files?

It's a bit sad to waste that much memory (and time) on something like 
that. I bet that the dentry setup is a noticeable part of the whole 
sigfd()/timerfd() setup. It's likely also a big part of any memory 
footprint if you have lots of them.

So how about just doing:
 - do a single dentry
 - make a "struct file_operations" member function that prints out the 
   name of the thing in /proc/<pid>/fd/, and which *defaults* to just 
   doing the d_path() on the dentry, but special filesystems like this 
   could do something else (like print out a fake inode number from the 
   "file->f_private_data" information)

There seems to really be no downsides to that approach. No existing 
filesystem will even notice (they'll all have NULL in the new f_op 
member), and it would allow pipes etc to be sped up and use less memory.

			Linus
-

From: Davide Libenzi
Date: Tuesday, March 6, 2007 - 6:01 pm

I'm OK with everything that avoid code duplication due to those fake 
inodes. The change can be localized inside the existing API, so it doesn't 
really affect me externally.



- Davide


-

From: Linus Torvalds
Date: Tuesday, March 6, 2007 - 6:27 pm

Can you try with the first patch version that doesn't do anything special 
at all, and just uses a single dentry.

Yeah, the dentry name will be identical, and so you would see something 
like "7 -> signalfd:signalfd" when you do "ls -l /proc/<pid>/fd/" on a 
task that has such a special file descriptor (with no way to tell 
different timerfd's and signalfd's apart), but I think it's better to 
start off simple than to overdesign things.

And trying to tell them apart sounds a bit like overdesign, if only 
because I really don't see why anybody would really *care*. So it's a 
timer for poll/select/epoll - why care about anything else?

If it really really turns out that people care, we know how we can do it. 
We'd hook into "proc_fd_link()" and we'd allow a per-file mntget/dget that 
we could use to let special filesystems create fake entries on demand. So 
it's not impossible, it's just likely simply not needed.

			Linus
-

From: Davide Libenzi
Date: Tuesday, March 6, 2007 - 6:47 pm

The code would not change/shrink much with the single dentry. We'd save 
memory, and we'd lose the capability of seeing aino:[CLASS].
Both ways are fine with me.



- Davide


-

From: Eric Dumazet
Date: Tuesday, March 6, 2007 - 11:52 pm

I would definitly *love* saving dentries for pipes (and sockets too), but how 
are you going to get the inode ?

pipes()/sockets() can use read()/write()/rw_verify_area() and thus need 
file->f_path.dentry->d_inode (so each pipe needs a separate dentry)

Are you suggesting adding a new "struct file_operations" member to get the inode ?
Or re-intoducing an inode pointer in struct file ?

-

From: Davide Libenzi
Date: Wednesday, March 7, 2007 - 12:15 am

I was not planning to touch anything but epoll, signalfd and timerfd 

Currently, they use a single inode, and multiple dentries (to give the 
name of the class). But this could be changed to a single dentry like 
Linus was suggesting. I'll wait for Al's reply before doing anything.
Memory saving can be something, on top of the already big one of avoiding 
code duplication.



- Davide


-

From: Eric Dumazet
Date: Wednesday, March 7, 2007 - 12:16 am

Crazy ideas : (some readers are going to kill me)

1) Use the low order bit of f_path.dentry to say : this pointer is not a 
pointer to a dentry but the inode pointer (with the low order bit set to 1)

OR

2) file->f_path.dentry set to NULL for this special files (so that we dont 
need to dput() and cache line ping pong the common dentry each time we 
__fput() a pipe/socket.

Same trick could be used for file->f_path.mnt, because there is a big SMP 
cache line ping/pong to maintain a mnt_count on pipe/sockets mountpoint while 
these file systems cannot be un-mounted)

If dentry is NULL, we get the inode pointer from an overlay of struct 
file_ra_state    f_ra; (because for this special files readahead is unused)

This adds some conditional branches of course, but being able to save ram and 
better use cpu caches might be worth them.


-

From: Christoph Hellwig
Date: Wednesday, March 7, 2007 - 1:56 am

No way on either one.  f_path.dentry always beeing there is an assumption
we make all over the place, and changing that would be a big regression
for code qualityand reliability all over the place.

Face it folks, memory is generally cheap, and we're not going to uglify
huge amounts of code to shave of a little bit.

[and that is only in reply to this one, the single dentry optimizations

Same thing as above.  We might do a hack to not refcount these vfsmounts,
but we definitively want to keep the invariant of f_path.mnt never
beeing NULL.

-

From: Linus Torvalds
Date: Wednesday, March 7, 2007 - 10:21 am

First off, as noted earlier, you don't need crazy ideas. 


No, we don't want to do that. There's a lot of code that wants to just 
follow the dentry/inode chain unconditionally, and we want to have people 
able to just do

	i_size_read(file->f_path.dentry->d_inode);

kind of thing, without having to check anything in between. It's costly 
enough that we have a chain of pointers, it's much *worse* if we then have 

I agree abot the ping pong, but I suspect it's a lot less than the current 
pingpong on dcache_lock, which is a lot hotter than a pipe-dentry counter 
would be.

If it becomes a big issue, we could just have a fixed number of dentries, 
and hash them out some way (to dilute the ping-ponging). So making things 
NULL would just be horrible for everybody, since we have lots of generic 
code that just looks up the dentry and inode, and we don't want to make 
that worse.

(I'm sure using "filp->f_private_data" has some downsides too, but it 
seems fairly simple at least for pipes. The biggest problem is likely that 
we currently use the mutex in the inode in "filp->f_dentry->d_inode" as a 
way to protect the inode->i_pipe pointer itself, and there is no 
equivalent locking at the "struct file *" level at all. We might have to 
make the "f_ep_lock" spinlock be unconditional)

		Linus
-

From: Linus Torvalds
Date: Wednesday, March 7, 2007 - 10:02 am

No, at least pipes could easily just use "file->f_private_data" instead.

Now, sockets really do want the inode (or it would be really really big 
changes), but pipes really just want a "struct pipe_inode_info" pointer, 
which we could hide away directly in the file descriptor itself.

That's what Davide already did (on my suggestion) for signalfd - there's a 
*single* inode, and the real data is in the per-fd f_private_data.

		Linus
-

From: Eric Dumazet
Date: Wednesday, March 7, 2007 - 10:31 am

sockets already uses file->private_data.

But calls to read()/write() (not send()/recv()) still need to go through the 


Please find enclosed the following patch, to prepare this path.

[PATCH] Delay the dentry name generation on sockets and pipes.

1) Introduces a new method in 'struct dentry_operations'. This method called 
d_dname() might be called from d_path() to be able to provide a dentry name 
for special filesystems. It is called without locks.

Future patches (if we succeed in having one common dentry for all pipes) may 
need to change prototype of this method, but we now use :
char *d_dname(struct dentry *dentry, char *buffer, int buflen)


2) Use this new method for sockets : No more sprintf() at socket creation. 
This is delayed up to the moment someone does an access to /proc/pid/fd/...
We also avoid mntput()/mntget() on sock_mnt

3) Use this new method for pipes : No more sprintf() at pipe creation. This is 
delayed up to the moment someone does an access to /proc/pid/fd/...
We also avoid mntput()/mntget() on pipe_mnt

A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a 
*nice* speedup on my Pentium(M) 1.6 Ghz :

3.090 s instead of 3.450 s

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

 fs/dcache.c            |    3 +++
 fs/pipe.c              |   16 +++++++++++-----
 include/linux/dcache.h |    1 +
 net/socket.c           |   15 +++++++++++----
 4 files changed, 26 insertions(+), 9 deletions(-)

From: Eric Dumazet
Date: Wednesday, March 7, 2007 - 10:36 am

(resending with a more convenient attachment)

Please find enclosed the following patch, to prepare this path.

[PATCH] Delay the dentry name generation on sockets and pipes.

1) Introduces a new method in 'struct dentry_operations'. This method calle=
d=20
d_dname() might be called from d_path() to be able to provide a dentry name=
=20
for special filesystems. It is called without locks.

=46uture patches (if we succeed in having one common dentry for all pipes) =
may=20
need to change prototype of this method, but we now use :
char *d_dname(struct dentry *dentry, char *buffer, int buflen)


2) Use this new method for sockets : No more sprintf() at socket creation.=
=20
This is delayed up to the moment someone does an access to /proc/pid/fd/...
We also avoid mntput()/mntget() on sock_mnt

3) Use this new method for pipes : No more sprintf() at pipe creation. This=
 is=20
delayed up to the moment someone does an access to /proc/pid/fd/...
We also avoid mntput()/mntget() on pipe_mnt

A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a=
=20
*nice* speedup on my Pentium(M) 1.6 Ghz :

3.090 s instead of 3.450 s

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

=C2=A0fs/dcache.c =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0| =C2=A0 =C2=A03=
 +++
=C2=A0fs/pipe.c =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0| =C2=A0 16=
 +++++++++++-----
=C2=A0include/linux/dcache.h | =C2=A0 =C2=A01 +
=C2=A0net/socket.c =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 | =C2=A0 15 +++++++++=
++----
=C2=A04 files changed, 26 insertions(+), 9 deletions(-)
From: Linus Torvalds
Date: Wednesday, March 7, 2007 - 10:45 am

Sure. The dentry and the inode need to *exist*, but they can be one single 
static dentry/inode per "file descriptor type".

We always pass in the "struct file *" to read/write too, since we need it 
anyway for things like file control information (eg "is it a nonblocking 
read or write" kinds of things).

So I'm not suggesting a NULL dentry/inode, I'm suggesting a single static 
one per type.

And yeah, it may be harder than it looks. Some things "know" that all the 
relevant info is in the inode, so they just pass in the inode. In the pipe 
layer, for example, you'd need to change free_pipe_info() and 
alloc_pipe_info() to pass in the file descriptor instead, same goes for 
pipe_release(). But the "struct file *" is always available, it's just 
that since the code was originally written to have all the info in the 
inode, some of the code isn't set up to use it or pass it on..

But your patch is independent of that, and looks fine. Except I don't like 
this part:

-       file->f_path.mnt = mntget(sock_mnt);
+       file->f_path.mnt = NULL;

since I'd be much happer with always having f_path.mnt available, the same 
way we should always have f_path.dentry there.

(Btw, your patch is *not* going to work with the file->f_private_data 
approach, because d_path() is not passed down the "file *" thing. So we'd 
need to do that, and that's more intrusive (it can be NULL, since for 
things like cwd/pwd we don't have a "struct file").

But I like your patch as a totally independent thing. "It just makes 
sense". 

(Apart from the f_path.mnt thing, which I think was something else ;)

			Linus
-

From: Eric Dumazet
Date: Wednesday, March 7, 2007 - 11:06 am

Yes, but mntget()/mntput() are protected against NULL.
I was quite happy to remove two locked operations :)

I tried this path today and failed...
Too many changes to do (nameidata) to propagate a 'struct file *' 

OK no problem here is the patch without messing f_path.mnt 

(benchmark results not really different on my little machine, SMP kernel but 
one CPU only... maybe because lock suffix is changed by a nop)


[PATCH] Delay the dentry name generation on sockets and pipes.

1) Introduces a new method in 'struct dentry_operations'. This method called 
d_dname() might be called from d_path() to be able to provide a dentry name 
for special filesystems. It is called without locks.

Future patches (if we succeed in having one common dentry for all pipes) may 
need to change prototype of this method, but we now use :
char *d_dname(struct dentry *dentry, char *buffer, int buflen)


2) Use this new method for sockets : No more sprintf() at socket creation. 
This is delayed up to the moment someone does an access to /proc/pid/fd/...

3) Use this new method for pipes : No more sprintf() at pipe creation. This is 
delayed up to the moment someone does an access to /proc/pid/fd/...

A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a 
*nice* speedup on my Pentium(M) 1.6 Ghz :

3.090 s instead of 3.450 s

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
 fs/dcache.c            |    3 +++
 fs/pipe.c              |   12 +++++++++---
 include/linux/dcache.h |    1 +
 net/socket.c           |   13 ++++++++++---
 4 files changed, 23 insertions(+), 6 deletions(-)
From: Linus Torvalds
Date: Wednesday, March 7, 2007 - 11:30 am

All right. I like it. Definitely worth putting into -mm, or just 
re-sending to me after 2.6.21 is out (I'll forget all about it otherwise).

I have one more worry, namely this::

	-       char name[32];
	-       this.len = sprintf(name, "[%lu]", SOCK_INODE(sock)->i_ino);
	-       this.name = name;
	+       this.len = 0;
	+       this.name = NULL;

I think that's fine, and it *happens* to work, but it happens to work just 
because then d_alloc() will do:

	memcpy(dname, name->name, name->len);
	dname[name->len] = 0;

and passing in NULL to memcpy() is generally ok when len is 0.

HOWEVER. 

Not only might memcpy() do a "prefetch for read" on the source for some 
architectures (which in turn may end up being slow for an address that 
isn't in the TLB, like NULL), but you depend on a very much internal 
detail, since it *could* have been using something like

	memcpy(dname, name->name, name->len+1);

instead, and expected to get the final '\0' character from the source 
string.

So I would actually much prefer it to be written as

	this.len = 0;
	this.name = "";

just because it's safer.

But other than that small detail, I think this is not only an 
optimization, it's an actual cleanup, and we migth some day want to use 
something like this for some other things too (eg maybe this kind of 
approach is usable for /proc/<pid> entries too, to avoid instantiating 
them).

As to avoiding the mntget(), I'm not violently opposed to it, but I do 
think that it's a totally unrelated matter, so even if it's decided it's 
worth it, it should be done as a separate patch regardless.

			Linus
-

From: Eric Dumazet
Date: Wednesday, March 7, 2007 - 11:52 am

Well, I hope a prefetch(NULL) is OK because we are doing millions of them (see 

Yes very true, I will change that and push to Andrew for mm testing

I was thinking about being able to cache the name into the dentry, do you 
think it's worth the pain ? (its not SMP safe for example...)

static char * pipefs_dname(struct dentry *dentry, char *buffer, int buflen)
{
        if (!dentry->d_iname[0])
                snprintf(dentry->d_iname,
			DNAME_INLINE_LEN_MIN,
			"pipe:[%lu]",
			dentry->d_inode->i_ino);
        strlcpy(buffer, dentry->d_iname, buflen);
        return buffer;
}
-

From: Linus Torvalds
Date: Wednesday, March 7, 2007 - 12:07 pm

Actually, it *can* be SMP-safe, if you do it right. Something like

	len = dentry->d_name.len;

	if (!len) {
		len = snprintf(dentry->d_iname,
			DNAME_INLINE_LEN_MIN,
			"pipe:[%lu]",
			dentry->d_inode->i_ino);
		smp_wmb();
		dentry->d_name.len = len;
	}
	if (len >= buflen)
		len = buflen-1;
	memcpy(buffer, dentry->d_iname, len);
	buffer[len] = 0;
	return buffer;

should work, although it depends on the fact that our snprintf() 
implementation should be "stable" (ie if snprintf() modifies the buffer 
temporarily as it goes along, that would break, but I think our 
vsnprintf is good in that respect).

So you could have two different CPU's doing the snprintf() on the same 
buffer at the same time (and assigning the length at the same time), but 
since they'll write the same thing, you don't really care.

It's a bit subtle, though. And probably not really worth it.

		Linus
-

From: Anton Blanchard
Date: Wednesday, March 7, 2007 - 3:14 pm

Funny you mention this. We found some noticeable ppc64 regressions when
moving the dcache to standard list macros and had to do this to fix it
up:

static inline void prefetch(const void *x)
{
        if (unlikely(!x))
                return;

        __asm__ __volatile__ ("dcbt 0,%0" : : "r" (x));
}

Urgh :)

Anton
-

From: Linus Torvalds
Date: Wednesday, March 7, 2007 - 3:57 pm

Yeah, I'm not at all surprised. Any implementation of "prefetch" that 
doesn't just turn into a no-op if the TLB entry doesn't exist (which makes 
them weaker for *actual* prefetching) will generally have a hard time with 
a NULL pointer. Exactly because it will try to do a totally unnecessary 
TLB fill - and since most CPU's will not cache negative TLB entries, that 
unnecessary TLB fill will be done over and over and over again..

In general, using software prefetching is just a stupid idea, unless

 - the prefetch really is very strict (ie for a linked list you do exactly 
   the above kinds of things to make sure that you don't try to prefetch 
   the non-existent end entry)
AND
 - the CPU is stupid (in-order in particular).

I think Intel even suggests in their optimization manuals to *not* do 
software prefetching, because hw can usually simply do better without it.

		Linus
-

From: Michael K. Edwards
Date: Wednesday, March 7, 2007 - 6:25 pm

Data prefetch instructions should indeed avoid page table walks.
(Instruction prefetch mechanisms often do induce table walks on ITLB
miss.)  Not just because of the null pointer case, but because it's
quite normal to run off the end of an array in a loop with an embedded
prefetch instruction.  If you have an extra instruction issue unit
that shares the same DTLB, and you know you will really want that
data, you can sometimes use it to force DTLB preloads by doing an
actual data fetch from the foreseeable page.  This is potentially one
of the best uses of chip multi-threading on an architecture like Sun's
Niagara.

(I don't think Intel's hyper-threading works for this purpose; the
DTLB is shared but the entries are marked as owned by one thread or
the other.  HT can be used for L2 cache prefetching, although the
results so far seem to be mixed:

Not the XScale -- it performs quite poorly without prefetch, as people
who have run ARMv5-optimized binaries on it can testify.  From the
XScale Core Developer's Manual:

<quote>
The Intel XScale(r) core has a true prefetch load instruction (PLD).
The purpose of this instruction is to preload data into the data and
mini-data caches. Data prefetching allows hiding of memory transfer
latency while the processor continues to execute instructions. The
prefetch is important to compiler and assembly code because judicious
use of the prefetch instruction can enormously improve throughput
performance of the core. Data prefetch can be applied not only to
loops but also to any data references within a block of code. Prefetch
also applies to data writing when the memory type is enabled as write
allocate

The Intel XScale(r) core prefetch load instruction is a true prefetch
instruction because the load destination is the data or mini-data
cache and not a register. Compilers for processors which have data
caches, but do not support prefetch, sometimes use a load instruction
to preload the data cache. This technique has the disadvantages ...
From: Kyle Moffett
Date: Wednesday, March 7, 2007 - 7:48 pm

Prefetching is also fairly critical on a Power4 or G5 PowerPC system  
as they have a long memory latency; an L2-cache miss can cost 200+  
cycles.  On such systems the "dcbt" prefetch instruction brings in a  
single 128-byte cacheline and has no serializing effects whatsoever,  
making it ideal for use in a linked-list-traversal inner loop.

Cheers,
Kyle Moffett

-

From: Eric Dumazet
Date: Thursday, March 8, 2007 - 12:24 am

OK, 200 cycles...

But what is the cost of the conditional branch you added in prefetch(x) ?

if (!x) return;

(correctly predicted or not, but do powerPC have a BTB ?)

About the NULL 'potential problem', maybe we could use a dummy nil (but 
mapped) object, and use its address in lists, ie compare for &nil instead of 
NULL. This would avoid :

- The conditional test in some prefetch() implementations
- The potential TLB problem with the NULL value.




-

From: Valdis.Kletnieks
Date: Thursday, March 8, 2007 - 9:57 am

You avoid those two, but add the very high likelyhood that a statement of=
 the
form 'if (=21x) =7B....=7D' will creep back in and bug out.  I doubt that=
 changing
the form of a basic C idiom to save a few cycles is worth it, especially =
when
the (=21x) form can be tested without a memory fetch, but (x =21=3D nil) =
may require
fetching 'nil' (remember - the x86 is a very popular chipset, but registe=
r
starved - if the loop is any size at all, 'nil' may require a reload each=

time around, costing you the win you thought you had....
From: Kyle Moffett
Date: Thursday, March 8, 2007 - 6:34 pm

Well, PowerPC "dcbt" does prefetch() correctly, it doesn't ever raise  
exceptions, doesn't have any side effects, takes only enough CPU to  
decode the address, and is ignored if it would have to do anything  
other than load the cacheline from RAM.  Prefetch streams are halted  
when they reach the end of a page boundary (no trapping to the MMU)  
and if the TLB entry isn't present then they would asynchronously  
abort.  This is a suitable prefetch implementation for G5:

   #define prefetch(x) __asm__ __volatile__ ("dcbt 0, %0"::"r"(x));

On GCC4+ the "__builtin_prefetch" built-in functions is probably  
mildly better tuned for most architectures:

   #define PREFETCH_RD 0
   #define PREFETCH_WR 1
   #define PREFETCH_LOCALITY_NONE 0
   #define PREFETCH_LOCALITY_LOW  1
   #define PREFETCH_LOCALITY_MED  2
   #define PREFETCH_LOCALITY_HIGH 3
   #define prefetch(x, args...) __builtin_prefetch(x ,##args)


Cheers,
Kyle Moffett

-

From: Anton Blanchard
Date: Thursday, March 8, 2007 - 7:52 pm

It depends on the implementation and the HID bit settings. Some do walk
the MMU hashtable if it isnt in the TLB.

Anton
-

From: Anton Blanchard
Date: Thursday, March 8, 2007 - 7:51 pm

Much less than the tablewalk. On ppc64 a tablewalk of an address that is
not populated in the hashtable will incur 2 cacheline lookups (primary
and secondary buckets). This plus the MMU state machine overhead adds up.

Cue Linus rant about PowerPC MMU :)

Anton
-

From: Linus Torvalds
Date: Wednesday, March 7, 2007 - 8:20 pm

No, I just checked, and Intel's own optimization manual makes it clear 
that you should be careful. They talk about performance penalties due to 
resource constraints - which makes tons of sense with a core that is good 
at handling its own resources and could quite possibly use those resources 
better to actually execute the loads and stores deeper down the 
instruction pipeline.

So it's not just 3DNow! making AMD look bad, or Intel would obviously 

Blah. XScale isn't even an OoO CPU, *of*course* it needs prefetching. 
Calling that "getting it right" is ludicrous. If anything, it gets things 
so wrong that prefetching is *required* for good performance.

I'm talking about real CPU's with real memory pipelines that already do 
prefetching in hardware. The better the core is, the less the prefetch 
helps (and often the more it hurts in comparison to how much it helps).

But if you mean "doesn't try to fill the TLB on data prefetches", then 


I just suspect that the upside for Core 2 Due is likely fairly low. The L2 
cache is good, the memory re-ordering is working.. I doubt "prefetch" 
helps in generic code that much for things like linked list following, you 
should probably limit it to code that has *known* access patterns and you 
know it's not going to be in the cache.

(In other words, I bet prefetching can help a lot with MMX/media kind of 
code, I doubt it's a huge win for "for_each_entry()")

		Linus
-

From: Michael K. Edwards
Date: Thursday, March 8, 2007 - 1:37 am

Certainly you should be careful -- and that usually means leaving it
up to the compiler.  But hinting to the compiler can help; there may
be an analogue of the (un)likely macros waiting to be implemented for
loop prefetch.  And while out-of-order execution and fancy hardware
prefetch streams greatly reduce the need for explicit prefetch in
general code, there's no substitute for the cache _bypassing_
instructions when trying to avoid excessive cache eviction in DDoS
situations.

For instance, if I wind up working on a splay-tree variant of Robert
Olsson's trie/hash work, I'll try to measure the effect of using SSE2
non-temporal stores to write half-open connections to the leaves of
the tree.  That may give some additional improvement in the ability to

Intel puts a lot of effort into educating compiler writers about the
optimal prefetch insertion strategies for particular cache
architectures.  At the same time, they put out the orange cones to
warn people off of hand-tuning prefetch placement using
micro-benchmarks.  People did that when 3DNow! first came out, with

That's not an accident.  Hardware prefetch units cost a lot in power
consumption.  Omitting the hardware prefetch unit and drastically
simplifying the pipeline is how they got a design whose clock they
could crank into the stratosphere and still run on battery power.  And
in the network processor space, they can bolt a myriad of on-chip
microengines and still have some prayer of accurately simulating the
patterns of internal bus cycles.  Errors in simulation can still be
fixed up with prefetch instruction placement to put memory accesses
from the XScale core into phases where the data path processors aren't
working so hard.

Moreover, because they're embedded targets and rarely have to run
third-party binaries originally compiled for older cores, it didn't
really cost them anything to say, "Sorry, this chip's performance is
going to suck if your compiler's prefetch insertion isn't properly
tuned."  The _only_ ...
From: Anton Blanchard
Date: Thursday, March 8, 2007 - 7:46 pm

Yeah this is exactly what we were seeing :)

Anton
-

From: Christoph Hellwig
Date: Thursday, March 8, 2007 - 1:56 am

This patch needs a lot more documentation.  It needs some really big
comments on why this should never ever be used for a real filesystem
(real as in user mountable), and probably add an assert for that
invariant somewhere.  Please also update Documentation/filesystems/Locking
and Documentation/filesystems/vfs.txt for it.
-

From: Eric Dumazet
Date: Thursday, March 8, 2007 - 2:42 am

Thank you for your feedback Christoph

Here is the version I was about to push to -mm, so if nobody complains, I ask 
Andrew to push it to mm so that it can reach 2.6.22 target

[PATCH] Delay the dentry name generation on sockets and pipes.

1) Introduces a new method in 'struct dentry_operations'. This method called 
d_dname() might be called from d_path() to build a pathname 
for special filesystems. It is called without locks.

Future patches (if we succeed in having one common dentry for all pipes) may 
need to change prototype of this method, but we now use :
char *d_dname(struct dentry *dentry, char *buffer, int buflen)


2) Use this new method for sockets : No more sprintf() at socket creation. 
This is delayed up to the moment someone does an access to /proc/pid/fd/...

3) Use this new method for pipes : No more sprintf() at pipe creation. This is 
delayed up to the moment someone does an access to /proc/pid/fd/...

A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a 
*nice* speedup on my Pentium(M) 1.6 Ghz :

3.090 s instead of 3.450 s

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
CC: Christoph Hellwig <hch@infradead.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Davide Libenzi <davidel@xmailserver.org>
CC: Al Viro <viro@zeniv.linux.org.uk>

 Documentation/filesystems/Locking |    2 ++
 Documentation/filesystems/vfs.txt |   12 +++++++++++-
 fs/dcache.c                       |    3 +++
 fs/pipe.c                         |   12 +++++++++---
 include/linux/dcache.h            |    1 +
 net/socket.c                      |   13 ++++++++++---
 6 files changed, 36 insertions(+), 7 deletions(-)
From: Christoph Hellwig
Date: Thursday, March 8, 2007 - 3:21 am

Please don't put braces around the function pointer call.  Also I

normally this would be:

	


Can you add a helper to init this one?

-

From: Eric Dumazet
Date: Thursday, March 8, 2007 - 4:11 am

Thanks again for your feedback Christoph :)

I think I addressed all your remarks.

Thank you

[PATCH] Delay the dentry name generation on sockets and pipes.

1) Introduces a new method in 'struct dentry_operations'. This method called 
d_dname() might be called from d_path() to build a pathname 
for special filesystems. It is called without locks.

Future patches (if we succeed in having one common dentry for all pipes) may 
need to change prototype of this method, but we now use :
char *d_dname(struct dentry *dentry, char *buffer, int buflen)


2) Use this new method for sockets : No more sprintf() at socket creation. 
This is delayed up to the moment someone does an access to /proc/pid/fd/...

3) Use this new method for pipes : No more sprintf() at pipe creation. This is 
delayed up to the moment someone does an access to /proc/pid/fd/...

A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a 
*nice* speedup on my Pentium(M) 1.6 Ghz :

3.090 s instead of 3.450 s

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
CC: Christoph Hellwig <hch@infradead.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Davide Libenzi <davidel@xmailserver.org>
CC: Al Viro <viro@zeniv.linux.org.uk>

 Documentation/filesystems/Locking |    2 ++
 Documentation/filesystems/vfs.txt |   12 +++++++++++-
 fs/dcache.c                       |    7 +++++++
 fs/pipe.c                         |   15 +++++++++------
 include/linux/dcache.h            |    1 +
 net/socket.c                      |   17 ++++++++++-------
 6 files changed, 40 insertions(+), 14 deletions(-)

From: Christoph Hellwig
Date: Thursday, March 8, 2007 - 4:18 am

It's not _some_ filesystems.  If real filesystem did this we'd be in
horrible trouble.  It's "synthetic, non-mountable" filesystem.

Except for this the patch looks fine to me, thanks Eric!
-

From: Linus Torvalds
Date: Thursday, March 8, 2007 - 8:52 am

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>

Except you should fix the subject line when you send it out to Andrew ;)

		Linus
-

From: Eric Dumazet
Date: Thursday, March 8, 2007 - 9:58 am

Hi Andrew

Could you please put this final version in mm for testing ?

Thank's to all contributors.

[PATCH] VFS : Delay the dentry name generation on sockets and pipes.

1) Introduces a new method in 'struct dentry_operations'. This method called 
d_dname() might be called from d_path() to build a pathname 
for special filesystems. It is called without locks.

Future patches (if we succeed in having one common dentry for all 
pipes/sockets) may need to change prototype of this method, but we now use :
char *d_dname(struct dentry *dentry, char *buffer, int buflen);


2) Use this new method for sockets : No more sprintf() at socket creation. 
This is delayed up to the moment someone does an access to /proc/pid/fd/...

3) Use this new method for pipes : No more sprintf() at pipe creation. This is 
delayed up to the moment someone does an access to /proc/pid/fd/...

A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a 
*nice* speedup on my Pentium(M) 1.6 Ghz :

3.090 s instead of 3.450 s

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Christoph Hellwig <hch@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>

 Documentation/filesystems/Locking |    2 ++
 Documentation/filesystems/vfs.txt |   12 +++++++++++-
 fs/dcache.c                       |   10 ++++++++++
 fs/pipe.c                         |   15 +++++++++------
 include/linux/dcache.h            |    1 +
 net/socket.c                      |   17 ++++++++++-------
 6 files changed, 43 insertions(+), 14 deletions(-)

From: Eric Dumazet
Date: Thursday, March 8, 2007 - 11:29 am

Hi Andrew

I am sorry, my previous patch had a /proc/*/fd/ in a comment, so the */ clo=
sed=20
the comment and fs/dcache.c could not compile.

Could you please put this 'final-final' version in mm for testing ?

Thank's to all contributors, sorry for the noise.

[PATCH] VFS : Delay the dentry name generation on sockets and pipes.

1) Introduces a new method in 'struct dentry_operations'. This method calle=
d=20
d_dname() might be called from d_path() to build a pathname=20
for special filesystems. It is called without locks.

=46uture patches (if we succeed in having one common dentry for all=20
pipes/sockets) may need to change prototype of this method, but we now use :
char *d_dname(struct dentry *dentry, char *buffer, int buflen);


2) Use this new method for sockets : No more sprintf() at socket creation.=
=20
This is delayed up to the moment someone does an access to /proc/pid/fd/...

3) Use this new method for pipes : No more sprintf() at pipe creation. This=
 is=20
delayed up to the moment someone does an access to /proc/pid/fd/...

A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a=
=20
*nice* speedup on my Pentium(M) 1.6 Ghz :

3.090 s instead of 3.450 s

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Christoph Hellwig <hch@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>

=C2=A0Documentation/filesystems/Locking | =C2=A0 =C2=A02 ++
=C2=A0Documentation/filesystems/vfs.txt | =C2=A0 12 +++++++++++-
=C2=A0fs/dcache.c =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 | =C2=A0 10 ++++++++++
=C2=A0fs/pipe.c =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 | =C2=A0 15 +++++++++------
=C2=A0include/linux/dcache.h =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0| =C2=
=A0 =C2=A01 +
=C2=A0net/socket.c =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0| =C2=A0 17 ++++++++++-------
=C2=A06 files changed, 43 ...
From: Eric Dumazet
Date: Friday, March 9, 2007 - 3:18 am

Hi Andrew

Please find a new version of this patch : I realized d_path() has very 
uncommon semantic (it seems nobody caught the point in previous patches), and 
had to change the documentation and pipefs_dname() / sockfs_dname() 
accordingly.

Now, readlink("/proc/pid/fd/xx", buffer, 4096) returns the exact number of 
bytes, not the whole 4095 bytes :)

Extract of new Documentation:

	CAUTION : d_path() logic is quite  tricky. 
	The correct way to return for example "Hello" is to put it
	at the end of the buffer, and returns a pointer to the first char.

       Example :

static char *somefs_dname(struct dentry *dent, char *buffer, int buflen)
{
       char *string = "Hello";
       int sz = strlen(string) + 1;
       if (sz > buflen)
               return ERR_PTR(-ENAMETOOLONG);
       buffer += (buflen - sz);
       return memcpy(buffer, string, sz);
}



Thank you

[PATCH] VFS : Delay the dentry name generation on sockets and pipes.

1) Introduces a new method in 'struct dentry_operations'. This method called 
d_dname() might be called from d_path() to build a pathname 
for special filesystems. It is called without locks.

Future patches (if we succeed in having one common dentry for all 
pipes/sockets) may need to change prototype of this method, but we now use :
char *d_dname(struct dentry *dentry, char *buffer, int buflen);


2) Use this new method for sockets : No more sprintf() at socket creation. 
This is delayed up to the moment someone does an access to /proc/pid/fd/...

3) Use this new method for pipes : No more sprintf() at pipe creation. This is 
delayed up to the moment someone does an access to /proc/pid/fd/...

A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a 
*nice* speedup on my Pentium(M) 1.6 Ghz :

3.090 s instead of 3.450 s

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Christoph Hellwig <hch@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>

 ...
From: Linus Torvalds
Date: Friday, March 9, 2007 - 11:22 am

Yeah, it's subtle, since it wants to use a single buffer, and not copy 
things around too much.

But can I ask you to do a take3, and simply have a helper function like

	char *dynamic_dname(struct dentry *dentry, char *buffer, int len,
		const char *fmt, ...)
	{
		va_list args;
		char temp[64];
		int i;

		va_start(args, fmt);
		i = vsnprintf(tmp,sizeof(tmp),fmt,args) + 1;
		va_end(args);

		if (i > len)
			return ERR_PTR(-ENAMETOOLONG);

		buffer += len - i;
		memcpy(buffer, tmp, i);
		return buffer;
	}

and just require that everybody use that function.

Then the pipe code would just become

	static char *pipefs_dname(struct dentry *dentry, char *buffer, int buflen)
	{
		return dynamic_dname(dentry, buffer, buflen, ""pipe:[%lu]", dentry->d_inode->i_ino);
	}

and you're done, and you have only *one* place in the VFS layer 
(preferably right next to d_path() itself) that cares about the
subtle issues that we have.

		Linus
-

From: Eric Dumazet
Date: Friday, March 9, 2007 - 6:21 pm

Hi Andrew

Please find take3 of this patch : Linus suggested to introduce a helper 
function to factorize work done by most d_dname() implementations.

Thank you

[PATCH] VFS : Delay the dentry name generation on sockets and pipes.

1) Introduces a new method in 'struct dentry_operations'. This method called 
d_dname() might be called from d_path() to build a pathname 
for special filesystems. It is called without locks.

Future patches (if we succeed in having one common dentry for all 
pipes/sockets) may need to change prototype of this method, but we now use :
char *d_dname(struct dentry *dentry, char *buffer, int buflen);

2) Adds a dynamic_dname() helper function that eases d_dname() implementations

3) Defines d_dname method for sockets : No more sprintf() at socket creation. 
This is delayed up to the moment someone does an access to /proc/pid/fd/...

4) Defines d_dname method for pipes : No more sprintf() at pipe creation.
This is delayed up to the moment someone does an access to /proc/pid/fd/...

A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a 
*nice* speedup on my Pentium(M) 1.6 Ghz :

3.090 s instead of 3.450 s

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Christoph Hellwig <hch@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
From: Bob Copeland
Date: Thursday, March 8, 2007 - 1:00 pm

Pedantry: it's and don't?

-Bob
-

Previous thread: [patch 3/3] epoll optimizations and cleanups ... by Davide Libenzi on Tuesday, March 6, 2007 - 5:34 pm. (3 messages)

Next thread: RE: [PATCH] s2io: add PCI error recovery support by Ramkrishna Vepa on Tuesday, March 6, 2007 - 5:42 pm. (1 message)