[PATCHv3 0/4] sys_indirect system call

Previous thread: Linux 2.6.24-rc3 by Linus Torvalds on Saturday, November 17, 2007 - 1:33 am. (5 messages)

Next thread: [PATCH v3 00/17] New RT Task Balancing -v3 by Steven Rostedt on Saturday, November 17, 2007 - 2:21 am. (2 messages)
To: <linux-kernel@...>
Cc: <akpm@...>, <mingo@...>, <tglx@...>, <torvalds@...>
Date: Saturday, November 17, 2007 - 1:31 am

wing patches provide an alternative implementation of the
sys_indirect system call which has been discussed a few times.
This no system call allows us to extend existing system call
interfaces with adding more system calls.

Davide's previous implementation is IMO far more complex than
warranted. This code here is trivial, as you can see. I've
discussed this approach with Linus last week and for a brief moment
we actually agreed on something.

We pass an additional block of data to the kernel, it is copied into
the task_struct, and then it is up to the function implementing the system
call to interpret the data. Each system call, which is meant to be
extended this way, has to be white-listed in sys_indirect. The
alternative is to filter out those system calls which absolutely cannot
be handled using sys_indirect (like clone, execve) since they require
the stack layout of an ordinary system call. This is more dangerous
since it is too easy to miss a call.

The code for x86 and x86-64 gets by without a single line of assembly
code. This is likely to be true for most/all the other archs as well.
There is architecture-dependent code, though. For x86 and x86-64 I've
also fixed up UML (although only x86-64 is tested, that's my setup).

The last patch shows the first application of the functionality. It also
shows a complication: we need the test for valid sub-syscalls in the
main implementation and in the compatibility code. And more: the actual
sources and generated binary are very different (the numbers differ).
Duplicating the information is a big problem, though. I've used some macro
tricks to avoid this. All the information about the flags and the system
calls using them is concentrated in one header. This should maintenance
bearable.

This patch to use sys_indirect is just the beginngin. More will follow,
but I want to see how these patches are received before I spend more time
on it. This code is enough to test the implementation with the following
test program...

To: Ulrich Drepper <drepper@...>
Cc: <linux-kernel@...>, <akpm@...>, <mingo@...>, <tglx@...>, <torvalds@...>
Date: Monday, November 19, 2007 - 9:52 am

On Sat, 17 Nov 2007 00:31:36 -0500

This setup forbids future addons to file_flags

In three years, when we want to add a new indirect feature to socket()
call, do we need a new indirect2() syscall ?

So I suggest using :

union indirect_params i = {
.file_flags.flags = O_CLOEXEC,
};
fd = syscall (__NR_indirect, &r, &i, sizeof (i));

Or better, you could avoid using 'union indirect_params' in user code, and
only use the substructs for each function.
/* define this in some include file of course */
struct file_flags {
int flags;
};
/* use it */
struct file_flags ff = { .flags = O_CLOEXEC};
fd = syscall (__NR_indirect, &r, &ff, sizeof(ff));

This to avoid to copy to kernel un-necessary data, and permitting futures changes.

-

To: Eric Dumazet <dada1@...>
Cc: <linux-kernel@...>, <akpm@...>, <mingo@...>, <tglx@...>, <torvalds@...>
Date: Monday, November 19, 2007 - 11:12 am

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

No, it doesn't. The setup is indefinitely expandable.

All you need to do, if it becomes necessary to have more than an int, is
to define a little structure for the system call and then use it. The
only requirement is that the code has to assume a value of zero is what
is used today. That's the whole point.

union indirect_params {
struct {
int flags;
} file_flags;
struct {
int flags;
int new_syscall_data1;
sigset_t and_a_sigmask;
} new_data;
};

Old programs will set only the 'flags' member of 'new_data' while new
once can also set the new elements. New programs on old kernels will
eithe have failing calls since the structure is too big or the call will

There is no overhead introduced through the union. The only reason the
union is there in the first place is to allocate sufficient data in
task_struct to cover all cases.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFHQafd2ijCOnn/RHQRAlSFAJ99lahwCDZGRSlIHCov5bWowrpoiQCgwvW4
LDSEusNUpMfIE1ywBCRDBfc=
=ChVT
-----END PGP SIGNATURE-----
-

To: Ulrich Drepper <drepper@...>
Cc: <linux-kernel@...>, <akpm@...>, <mingo@...>, <tglx@...>, <torvalds@...>
Date: Monday, November 19, 2007 - 11:43 am

On Mon, 19 Nov 2007 07:12:29 -0800

Yes, so now, if you take this new definition of indirect_params,
its size is 12 bytes at least.

So when you recompile your old program (as you post it and as I commented on),
it will pass a >= 12 bytes data to kernel, with only first 4 bytes set to O_CLOEXEC.

Other bytes will contain junk (automatic variables are not set to 0 by compiler),
and your program possibly break - if kernel socket() call was
extended to use new_data struct. (ie wanting O_CLOEXEC features +
new ones with new_syscall_data1)

You have to take care of binary compatibility, but also that an old program

Yes, sure, but it should be a kernel issue, not a user space one at all,
as long as user provides the size of the data he want to pass to kernel (
and your patch does this already)

-

To: Eric Dumazet <dada1@...>
Cc: <linux-kernel@...>, <akpm@...>, <mingo@...>, <tglx@...>, <torvalds@...>
Date: Monday, November 19, 2007 - 11:48 am

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

If you don't initialize the entire structure and you use it all, of
course you get undefined behavior. That's nothing new. The program I
attached is not an example, it's a test for the functionality in this patch.

Like with every kernel interface, you have to use it correctly. The
good news is that user programs should never use this syscall directly
(just like don't for existing ones).

I see no problem at all here.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFHQbBH2ijCOnn/RHQRAkc3AKCxVTWQ3BiQnCBwdbAsT122QWWaiwCggKXN
Z5Sz9/NFojMHZXXTzIMoxX4=
=slte
-----END PGP SIGNATURE-----
-

To: Ulrich Drepper <drepper@...>
Cc: <linux-kernel@...>, <akpm@...>, <mingo@...>, <tglx@...>, <torvalds@...>
Date: Monday, November 19, 2007 - 12:14 pm

On Mon, 19 Nov 2007 07:48:23 -0800

I do see a problem, because some readers will take your example as a reference,
as it will probably sit in a page that google^Wsearch_engines will bring at the
top of search results for next ten years or so.

(I bet for "sys_indirect syscall" -> http://lwn.net/Articles/258708/ )

Next time you post it, please warn users that it will break in some years, or
state clearly this should only be used internally by glibc.

Thank you
-

To: Eric Dumazet <dada1@...>
Cc: Ulrich Drepper <drepper@...>, <linux-kernel@...>, <akpm@...>, <tglx@...>, <torvalds@...>
Date: Monday, November 19, 2007 - 3:08 pm

dont be silly, next time Ulrich should also warn everyone that running
attachments and applying patches from untrusted sources is dangerous?

any code that includes:

fd = syscall (__NR_indirect, &r, &i, sizeof (i));

is by definition broken and unportable in every sense of the word. Apps
will use the proper glibc interfaces (if it's exposed).

really, lets discuss _real_ issues. If you cannot think of any, please
stay silent.

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: Eric Dumazet <dada1@...>, Ulrich Drepper <drepper@...>, <linux-kernel@...>, <akpm@...>, <tglx@...>, <torvalds@...>
Date: Tuesday, November 20, 2007 - 10:11 am

as an application writer how do i access accept(2) with FD_CLOEXEC
functionality? will glibc expose an accept2() with a flags param? if
so... why don't we just have an accept2() syscall?

-dean
-

To: dean gaudet <dean@...>
Cc: Ingo Molnar <mingo@...>, Eric Dumazet <dada1@...>, <linux-kernel@...>, <akpm@...>, <tglx@...>, <torvalds@...>
Date: Tuesday, November 20, 2007 - 12:16 pm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Not yet decided. There is the alternative to extend the accept()
interface to have both interfaces:

int accept(int, struct sockaddr *, socklen_t *);
and
int accept(int, struct sockaddr *, socklen_t *, int);

If you read the mails of my first submission you'll find that I
explained this. I talked to Andrew and he favored new syscalls. But
then I talked to Linus and he favored this approach. Probably
especially because it can be used for syslets as well. And it is less
code and data than introducing new syscalls.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFHQwhx2ijCOnn/RHQRAnezAKCkFmGwlwDZjpfKTRSUN4yLIeGTkACgtMK/
OcHdIaR8wbp848D3GU2iNYQ=
=nTu9
-----END PGP SIGNATURE-----
-

To: Ulrich Drepper <drepper@...>
Cc: <linux-kernel@...>, <akpm@...>, <mingo@...>, <tglx@...>, <torvalds@...>
Date: Sunday, November 18, 2007 - 3:44 am

I stared at this a bit, and it took me some time to try to grok what it
is trying to do. Eventually I figured it out, and I wonder if there
isn't an easier -- or at least more efficient -- way to accomplish this
goal.

It seems to me that we could accomplish the same thing by passing the
number of parameters in the upper bits of the system call number
register (%eax in the case of x86.) If set to zero, we'd fill in the
legacy number of registers (for backwards compatibility.) Unspecified
arguments are then forced to zero before invoking the target function;
we could also make a register count available if need be.

Alternatively, the same thing can be done with a dense system call
number space by adding a number of parameters field to the system call
table, however, that is more invasive in that one has to poke something
into each architecture (unfortunately -- it would be so much nicer if
there was a central metafile which one could process into the various
architecture system call tables.)

-hpa
-

To: H. Peter Anvin <hpa@...>
Cc: <linux-kernel@...>, <akpm@...>, <mingo@...>, <tglx@...>, <torvalds@...>
Date: Sunday, November 18, 2007 - 5:36 am

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

This isn't really a generic solution. The number of parameters is
limited to six. There are syscalls with six parameters already. There
are many more with five which could only handle one more parameter.

Also, is it really simpler? You'd need to have another table which
contains the default number of parameters a system call takes so that
you can fill in the default value of zero. This extra memory access has
to be performed for every system call.

I think it is unlikely that this approach is faster. To the contrary,
I'd guess.

I don't have much invested into this but it seems the sys_indirect
approach is so much simpler. Overhead is only paid if you really need
it which is rarely the case. Plus, you might have heard Linus and Zack
talk about syslets again. Starting syslets can be done using the same
interface, I guess.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iD8DBQFHQAez2ijCOnn/RHQRAjoHAJ4/Qq4ygaZ4uq6uCIVNq4hfN1m2pACgpJFi
Z/vBsGFpUc/EUz+VW66jEIY=
=B19x
-----END PGP SIGNATURE-----
-

To: Ulrich Drepper <drepper@...>
Cc: <linux-kernel@...>, <akpm@...>, <mingo@...>, <tglx@...>, <torvalds@...>
Date: Sunday, November 18, 2007 - 3:37 pm

What bothers me about the sys_indirect approach is that it will get
increasingly expensive as time goes on, and in doing so it does a
user-space memory reference, which are extra expensive. The extra table
can be colocated with the main table (a structure, in effect) so they'll
share the same cache line.

-hpa
-

To: H. Peter Anvin <hpa@...>
Cc: <linux-kernel@...>, <akpm@...>, <mingo@...>, <tglx@...>, <torvalds@...>
Date: Monday, November 19, 2007 - 11:28 pm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

You assume that using sys_indirect will be the norm. It won't. We
mustn't design system calls deliberately wrong so that they require the
indirection.

Beside, if the number of syscalls which has to be handled this way grows
we can use something more efficient for large numbers of test than a
switch statement. It could even be a word next to the system call table.

But I still don't see that the magic encoding is a valid solution, it
doesn't address the limited parameter number. Plus, using sys_indirect
could in future be used to transport entire parameters (like a sigset_t)
along with other information, thereby saving individual copy operations.

I think the sys_indirect approach is the way forward. I'll submit a
last version of the patch in a bit.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFHQlRw2ijCOnn/RHQRApifAKDE1nZqRbm4cJxbhobBb7jCx1T00QCgiSa0
EXKjL2Gwu3atSLSD+Rb4yO4=
=6ZGt
-----END PGP SIGNATURE-----
-

To: Ulrich Drepper <drepper@...>
Cc: <linux-kernel@...>, <akpm@...>, <mingo@...>, <tglx@...>, <torvalds@...>
Date: Tuesday, November 20, 2007 - 12:27 am

The limited number of parameters is a non-issue, we already have the
convention for that: for more than 6 parameters, pass a parameter to
arguments 6 and higher in the register normally used for parameter 6.

Now, for the specific case of x86-64 (as well as some of the RISC
architectures), this meshes poorly with the C calling convention, which
is that parameter 7+ are passed on the stack. We would obtain a more
efficient calling convention by allocating an additional register for
such an indirect pointer and/or adopt the convention that the additional
parameters are simply stored on the user stack starting at a specific
offset, presumably +8.

I would really like to see a systematic calling convention that doesn't
have limits that we have already broken several times. In particular, I
would like to see a convention that can be mapped 1:1 onto the platform
C calling convention in a syscall-independent way. We *almost* have
this today, but sys_indirect would blow that out of the water in a

I think it is a horrible kluge. It's yet another multiplexer, which we
are trying desperately to avoid in the kernel. Just to make things more
painful, it is a multiplexer which creates yet another ad hoc calling
convention, whereas we should strive to make the kernel calling
convention as uniform as possible.

If is is NOT going to be used in the common case, then it really doesn't
matter -- we can just use manually or automatically generated thunks.
It's not like we have thousands of system calls, and it's not like "a
system call" is a precious thing.

If it IS going to be used in the common case, then we should use
something more streamlined.

-hpa

-

Previous thread: Linux 2.6.24-rc3 by Linus Torvalds on Saturday, November 17, 2007 - 1:33 am. (5 messages)

Next thread: [PATCH v3 00/17] New RT Task Balancing -v3 by Steven Rostedt on Saturday, November 17, 2007 - 2:21 am. (2 messages)