Re: 64-syscall args on 32-bit vs syscall()

Previous thread: [PATCH -rc] memcg: disable move charge in no mmu case by Daisuke Nishimura on Sunday, March 14, 2010 - 9:35 pm. (1 message)

Next thread: linux-next: Tree for March 15 by Stephen Rothwell on Sunday, March 14, 2010 - 9:54 pm. (1 message)
From: Benjamin Herrenschmidt
Date: Sunday, March 14, 2010 - 9:48 pm

Hoy there !

This may have been discussed earlier (I have some vague memories...) but
I just hit a problem with that again (Mark: hint, it's in hdparm's
fallocate) so I'd like a bit of a refresh here on what is the "right
thing" to do...

So some syscalls want a 64-bit argument. Let's take fallocate() as our
example. So we already know that we have to be extra careful since some
32-bit arch will pass this into 2 registers (or stack slots) which need
to be aligned, and so we tend to already take care of making sure that
the said 64-bit argument is either defined as 2x32-bit arguments, or
defined as 1x64 bit argument aligned to 2x32-bit in the argument list.

So far so good...

The problem is when user space tries to use the same trick for calling
those functions using glibc-provided syscall() function. In this
example, hdparm does:

err = syscall(SYS_fallocate, fd, mode, offset, len);

With "offset" being a 64-bit argument.

This will break because the first argument to syscall now shifts
everything by one register, which breaks the register pair alignment
(and I suppose archs with stack based calling convention can have
similar alignment issues even if x86 doesn't).

Ulrich, Steven, shouldn't we have glibc's syscall() take a long long as
it's first argument to correct that ? Either that or making it some kind
of macro wrapper around a __syscall(int dummy, int sysno, ...) ?

As it is, any 32-bit app using syscall() on any of the syscalls that
takes 64-bit arguments will be broken, unless the app itself breaks up
the argument, but the the order of the hi and lo part is different
between BE and LE architectures ;-)

So is there a more "correct" solution than another here ? Should powerpc
glibc be fixed at least so that syscall() keeps the alignment ?

Cheers,
Ben.


--

From: David Miller
Date: Sunday, March 14, 2010 - 10:06 pm

From: Benjamin Herrenschmidt <benh@kernel.crashing.org>

I think it is even different on the same endian architectures,
f.e. mips I think.

There is no way to do this without some arch specific code
to handle things properly, really.
--

From: Benjamin Herrenschmidt
Date: Sunday, March 14, 2010 - 10:18 pm

Right, but to what extent ? IE. do we always need the callers using
syscall() directly to know it all, or can we to some extent handle some
of it inside glibc ?

For example, if powerpc glibc is fixed so that syscall() takes a 64-bit
first argument (or calls via some macro to add a dummy 32-bit argument),
the register alignment will be preserved, and things will work just
fine.

IE. It may not fix all problems with all archs, but in this case, it
will fix the common cases for powerpc at least :-) And any other arch
that has the exact same alignment problem.

Or is there any good reason -not- to do that in glibc ?

Cheers,
Ben.


--

From: David Miller
Date: Sunday, March 14, 2010 - 10:54 pm

From: Benjamin Herrenschmidt <benh@kernel.crashing.org>

The whole point of syscall() is to handle cases where the C library
doesn't know about the system call yet.

I think it's therefore very much "buyer beware".

On sparc it'll never work to use the workaround you're proposing since
we pass everything in via registers.

So arch knowledge will always need to be present in these situations.
--

From: Benjamin Herrenschmidt
Date: Monday, March 15, 2010 - 1:22 pm

I'm not sure I follow. We also pass via register on powerpc, but the
offset introduced by the sysno argument breaks register pair alignment
which cannot be fixed up inside syscall().

However, if I change glibc's syscall to be something like

#define syscall(sysno, args...)	__syscall(0 /* dummy */, sysno, args)

And make __syscall then do something like:

	mr	r0, r4
	mr	r3, r5
	mr	r4, r6
	mr	r5, r7
	mr	r6, r8
	.../...
	sc
	blr

Then at least all that class of syscalls will be fixed. Of course this
has to be in glibc arch code. I was merely asking if that was something
our glibc folks would consider and whether somebody could think of a
better solution :-)

Cheers
,Ben.


--

From: Ralf Baechle
Date: Monday, March 15, 2010 - 6:44 am

MIPS passes arguments in the endian order that is low/high for little

Syscall is most often used for new syscalls that have no syscall stub in
glibc yet, so the user of syscall() encodes this ABI knowledge.  If at a
later stage syscall() is changed to have this sort of knowledge we break
the API.  This is something only the kernel can get right.

  Ralf
--

From: H. Peter Anvin
Date: Monday, March 15, 2010 - 8:13 am

One option would be to do a libkernel.so, with auto-generated stubs out
of the kernel build tree.  As already discussed in #kernel this morning,
there are a number of sticky points with types and namespaces for this
this, but those aren't any worse than the equivalent problems for
syscall(3).

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--

From: Ulrich Drepper
Date: Monday, March 15, 2010 - 9:00 am

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


No need.  Put it in the vdso.  And name it something other than syscall.
 The syscall() API is fixed, you cannot change it.

All this only if it makes sense for ALL archs.  If it cannot work for
just one arch then it's not worth it at all.

- -- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/

iEYEARECAAYFAkueWbcACgkQ2ijCOnn/RHRtBQCeP88S/0xei7CAt65AGboqsrC8
N7wAoK7Qbi+OZuQrgHTCgTA27TgY+gQU
=4tJ6
-----END PGP SIGNATURE-----
--

From: David Miller
Date: Monday, March 15, 2010 - 12:00 pm

From: Ulrich Drepper <drepper@redhat.com>

There are many archs that still lack VDSO.
--

From: H. Peter Anvin
Date: Monday, March 15, 2010 - 12:41 pm

Putting it into the vdso is also rather annoyingly heavyweight for what
is nothing other than an ordinary shared library.  Just making it an
ordinary shared library seems a lot saner.

I don't see why syscall() can't change the type for its first argument
-- it seems to be exactly what symbol versioning is for.

Doesn't change the fact that it is fundamentally broken, of course.

	-hpa

--

From: Benjamin Herrenschmidt
Date: Monday, March 15, 2010 - 1:35 pm

No need to change the type of the first arg and go for symbol
versionning if you do something like I proposed earlier, there will be
no conflict between syscall() and __syscall() and both variants can
exist.

Cheers,
Ben.


--

From: H. Peter Anvin
Date: Monday, March 15, 2010 - 1:41 pm

Basically symbol versioning done "by hand", actually using symbol
versioning is better, IMNSHO.

	-hpa
--

From: Steven Munroe
Date: Tuesday, March 16, 2010 - 2:56 pm

One concern is the new syscall and the kernel have to match and mixing
will not work. your proposal seems to impact all syscalls not just the
one called via syscall API. These syscalls get generated inline which
makes static linking very dangerous ...

So I think you do need both symbol versioning and kernel feature stubs

--

From: Benjamin Herrenschmidt
Date: Tuesday, March 16, 2010 - 5:31 pm

What do you mean ? My proposal is purely a change to the syscall()
function, nothing else. No kernel change, no ABI change, no change to
the way glibc normally calls syscalls internally, etc... just the
exported syscall() function to shift its arguments in order to avoid
losing register pair alignment.

And the change would still be compatible with existing userland code who
manually splits the 64-bit arguments to avoid the problem on power.

IE. Unless I've missed something, this would be a 100% backward
compatible change that simply make a whole class of syscall() use work
that didn't before on power (but did on x86), such as the one I hit in
hdparm for example.

Cheers,
Ben.


--

From: Ulrich Drepper
Date: Tuesday, March 16, 2010 - 10:52 pm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


How can this be?  People are today actively working around the problem
of 64-bit arguments.  You have to break something since you cannot
recognize these situations.  And since it became meanwhile clear that
there is no way to "fix" all archs magically I really don't want to
introduce anything.  There are mechanisms in place to abstract out some
of the issues.  And for the rest, well, if you're using syscalls
directly you already have to encoded lowlevel knowledge.  One more bit
doesn't hurt.  It's not as if this happens every day.

- -- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/

iEYEARECAAYFAkugbhsACgkQ2ijCOnn/RHQzlACeMp0UK2jZuZOgXhJjB8Z9p4kh
rCoAn0zaJqFYV9tQ0Ct49Mprfa0O5iKh
=71la
-----END PGP SIGNATURE-----
--

From: Benjamin Herrenschmidt
Date: Wednesday, March 17, 2010 - 1:56 am

Ok, so I -may- be missing something, but I believe this won't break
anything:

 - You keep the existing syscall() exported by glibc for binary
compatibility

 - You add a new __syscall() (or whatever you want to name it) that adds
a dummy argument at the beginning, and whose implementation shifts
everything by 2 instead of 1 argument before calling into the kernel

 - You define in unistd.h or whatever is relevant, a macro that does:

#define syscall(__sysno, __args..) __syscall(0, _sysno, __args)

I believe that should cover it, at least for powerpc, possibly for other
archs too though as I said, I may have missed something there.

IE. Whether your app writes:

	syscall(SYS_foo, my_64bit_arg);

Or

	syscall(SYS_foo, (u32)(my_64bit_arg >> 32), (u32)(my_64bit_arg));

Both should still work with the new approach and end up doing the right
thing.

Hence, apps that use the first form today because it works on x86 would
end up working at least on powerpc where they would have been otherwise

It doesn't happen everyday. However, if my proposal ends up fixing a
bunch of cases where it does without breaking anything, then I suppose
it's worth considering, though as I said, it's possible that I miss some
subtlety here in which case I'd be glad to stand corrected :-)

Cheers,


--

From: Ulrich Drepper
Date: Wednesday, March 17, 2010 - 2:14 am

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


How can this possibly be the case?  This will screw people who currently
work around the ppc limitations of the existing syscall.

Just leave it alone.

- -- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/

iEYEARECAAYFAkugnV0ACgkQ2ijCOnn/RHRL4gCeIY0SLDCgLqtVvuMw+pvCzkwE
3MIAoJQRK5Mc+WtC/Wz9tPFPy4X+EALe
=lexw
-----END PGP SIGNATURE-----
--

From: Benjamin Herrenschmidt
Date: Wednesday, March 17, 2010 - 3:13 am

No it won't. As I said, it will work for both cases. The problem is a
register pair alignment problem. If the alignment is corrected with the
trick I proposed, 64-bit values will end up in the right pair, but
manually worked-around cases where the value is already broken up will
-also- end up in the right pair.

The problem with syscall() as it is is that it skews the arguments by 1
register, which causes the compiler to skip a register when generating
the call for a 64-bit value. By doing the trick I propose, that skew
will be gone, both 32 and 64 bit arguments will end up where expected.

Cheers,
Ben.


--

From: Jamie Lokier
Date: Wednesday, March 17, 2010 - 2:18 am

I think what Ulrich is getting at is your change will break existing
code which already does:

#ifdef __powerpc__
	syscall(SYS_foo, 0, my_64bit_arg);
#else
	syscall(SYS_foo, my_64bit_arg);
#endif

I don't know of any such code, but it might be out there.

-- Jamie
--

From: Benjamin Herrenschmidt
Date: Wednesday, March 17, 2010 - 3:18 am

No, the above "workaround" doesn't work. With the existing syscall()
definition, there is no difference between your two examples. In the
first case, you force a proper 64-bit aligment, but you are already off
by one register pair from the kernel expectation. In the second case,
gcc will imply one, which means that both your examples above will
result in my_64bit_arg in the -same- place, which is off by a register
pair from what the kernel expect.

IE. In the first case gcc will put SYS_foo in r3, 0 in r4, and
my_64bit_arg in r5 and r6. In the second case, gcc will put SYS_foo in
r3, won't care about r4, and will put the 64-bit arg in r5 and r6. Then,
glibc syscall() will shift r3 to r0, r3 to r4 etc... causing
my_64bit_arg to land in r4 and r5. But the kernel expects it in r3 and
r4.

The workaround that apps should use today is:

#if defined(__powerpc__) && WORDSIZE == 32
	syscall(SYS_foo, (u32)(my_64bit_arg >> 32), (u32)my_64bit_arg);
#else
	syscall(SYS_foo, my_64bit_arg);
#endif

And with my proposed change, both of the above will work. IE. gcc will
put the argument always in r5,r6 and the syscall() implementation will
always shift r5 to r3 and t6 to r4. 

Cheers,
Ben.


--

From: H. Peter Anvin
Date: Wednesday, March 17, 2010 - 11:30 am

Again, this is *exactly* symbol versioning done by hand... we have 
proper symbol versioning, let's use it.

	-hpa
--

From: Benjamin Herrenschmidt
Date: Wednesday, March 17, 2010 - 1:35 pm

Yeah, whatever, I don't mind what technique you use for the versionning,
ultimately, if the approach works, we can look at those details :-) We
-do- need the macro to strip the dummy argument though, unless we use
a slightly different technique which is to make the __sysno argument
itself 64-bit, which works as well I believe.

Cheers,
Ben.


--

From: H. Peter Anvin
Date: Wednesday, March 17, 2010 - 1:53 pm

It seems cleaner to do it that way (with a 64-bit sysno arg.)

	-hpa
--

From: Benjamin Herrenschmidt
Date: Wednesday, March 17, 2010 - 3:58 pm

Right. Now if we can get Ulrich to actually put 2 and 2 together and
admit that it actually works without breaking anything existing (at
least for my arch but I wouldn't be surprised if that was the case for
others), I would be even happier :-)

Steve, any chance you can cook up a glibc patch to test with ? Maybe
making it powerpc specific for now ?

Cheers,
Ben.


--

From: Steven Munroe
Date: Thursday, March 18, 2010 - 9:08 am

Do what declare __sysno as long long? The current prototype is in
unistd.h:

#ifdef __USE_MISC
/* Invoke `system call' number SYSNO, passing it the remaining
arguments.
   This is completely system-dependent, and not often useful.

   In Unix, `syscall' sets `errno' for all errors and most calls return
-1
   for errors; in many systems you cannot pass arguments or get return
   values for all system calls (`pipe', `fork', and `getppid' typically
   among them).

   In Mach, all system calls take normal arguments and always return an
   error code (zero for success).  */
extern long int syscall (long int __sysno, ...) __THROW;

#endif	/* Use misc.  */

Changing this would be an ABI change and would have to be versioned. It
would effect any one using syscall not just SYS_fallocate.

the question is do programmers in practice include unistd.h when they
use syscall.

If the changed prototype is not in scope then the 1st parm (__sysno)
defaults to int and is passed in on r3 which gets moved to r0.

If the changed syscall prototype is in scope then then _sysno would be
passed in r3/r4 (r3 would be 0 would be passed to r0 and the actual
system number would be in r4 and passed to the kernel in r3)

which behavior do you want? which (incorrect behavior compiled into
existing codes do you want to support?

Do you want syscall.S for PPC32 to change to match the changed
prototype? It will have to be be versioned and the new prototype will
only be available in future releases of GLIBC. Existing applications
will bind to the old ABI and get the old behavior.

--

From: Steven Munroe
Date: Thursday, March 18, 2010 - 10:03 am

Sorry in and long are compatible in 32-bit but not long long.

int and long are not compatible in 64-bit

It is hard the keep all the nodes and arguments straight.

But the concern about changing the prototype and are people actually

--

From: Benjamin Herrenschmidt
Date: Thursday, March 18, 2010 - 2:18 pm

Well, using the macro trick instead would fix that problem, code
wouldn't build if it doesn't include unistd.h :-)

Cheers,
Ben.


--

From: Jamie Lokier
Date: Thursday, March 18, 2010 - 6:22 pm

Or it will build, but call the old ABI version - no change to those programs.

-- Jamie
--

From: Andreas Schwab
Date: Thursday, March 18, 2010 - 9:21 am

int is incompatible with long, so you already get undefined behaviour
anyway.

Andreas.

-- 
Andreas Schwab, schwab@redhat.com
GPG Key fingerprint = D4E8 DBE3 3813 BB5D FA84  5EC7 45C6 250E 6F00 984E
"And now for something completely different."
--

From: Benjamin Herrenschmidt
Date: Monday, March 15, 2010 - 1:27 pm

Well, no. The change I propose would not break the ABI on powerpc and
would auto-magically fix thoses cases :-) But again, you don't have to
do the same thing on MIPS or sparc, it's definitely arch specific.

IE. What you are saying is that a syscall defined in the kernel as:

	sys_foo(u64 arg);

To be called from userspace would require something like:

	u64 arg = 0x123456789abcdef01;

#if defined(__powerpc__) && WORDSIZE == 32
	syscall(SYS_foo, (u32)(arg >> 32), (u32)arg);
#ese
	syscall(SYS_foo, arg);

While with the trick of making syscall a macro wrapping an underlying
__syscall that has an added dummy argument, the register alignment is
"corrected" and thus -both- forms above suddenly work for me. That might
actually work for you too.

Cheers,
Ben.


--

From: Steven Munroe
Date: Monday, March 15, 2010 - 8:03 am

The powerpc implementation of syscall is:


ENTRY (syscall)
	mr   r0,r3
	mr   r3,r4
	mr   r4,r5
	mr   r5,r6
	mr   r6,r7
	mr   r7,r8
	mr   r8,r9
	sc
	PSEUDO_RET
PSEUDO_END (syscall)

The ABI says:

"Long long arguments are considered to have 8-byte size and alignment.
The same 8-byte arguments that must go in aligned pairs or registers are
8-byte aligned on the stack."

This implies that the SYS_fallocate call will skip a register to get the
required alignment in the parameter save area.

for ppc32 on entry

r3 == SYS_fallocate
r4 == fd
r5 == mode
r6 == not used
r7, r8 == offset
r9 == len

This gets shifted to:

r0 == SYS_fallocate
r3 == fd
r4 == mode
r5 == not used
r6, r7 == offset
r8 == len

For syscall the vararg parms will be mirrored to the parameter save area
but will not be used. The ABI does not talk to LE for this case.


--

From: Benjamin Herrenschmidt
Date: Monday, March 15, 2010 - 1:32 pm

And my proposal is to make it instead:

#define syscall(__sysno, __args...)	__syscall(0,__sysno,__args)

ENTRY (__syscall)
	mr   r0,r4
 	mr   r3,r5
 	mr   r4,r6
 	mr   r5,r7
 	mr   r6,r8
 	mr   r7,r9
 	mr   r8,r10
 	sc
 	PSEUDO_RET



Which is not correct, as the kernel expects:

r0 == SYS_fallocate
r3 == fd
r4 == mode
r5, r6 == offset

Right, but the fact that we shift all args by -1- register means that we
break the 64-bit register pair alignment compared to the real syscall
which uses r0 instead for the syscall number. Hence my proposal to add
a dummy argument to restore that alignment.

As it is there is userspace code that does:

syscall(SYS_fallocate, fd, mode, offset, len);

Which works on x86 but is broken on ppc32 unless we do that change.

Cheers,


--

From: Jamie Lokier
Date: Monday, March 15, 2010 - 8:04 am

There are several problems with syscall(), not just this - because a
number of system calls in section 2 of the manual don't map directly
to kernel syscalls with the same function prototype.

Even fork() has become something complicated in Glibc that doesn't use
the fork syscall :-(

So anything using syscall() has to be careful on Linux already.
Changing the 64-bit alignment won't fix the other differences.

-- Jamie
--

From: Benjamin Herrenschmidt
Date: Monday, March 15, 2010 - 1:33 pm

It won't fix -all- the problems with syscall(), but it will fix a wagon
of them without breaking existing code that already does the arch
specific breakup on the call site...

Cheers,
Ben.



--

Previous thread: [PATCH -rc] memcg: disable move charge in no mmu case by Daisuke Nishimura on Sunday, March 14, 2010 - 9:35 pm. (1 message)

Next thread: linux-next: Tree for March 15 by Stephen Rothwell on Sunday, March 14, 2010 - 9:54 pm. (1 message)