Hoy there ! This may have been discussed earlier (I have some vague memories...) but I just hit a problem with that again (Mark: hint, it's in hdparm's fallocate) so I'd like a bit of a refresh here on what is the "right thing" to do... So some syscalls want a 64-bit argument. Let's take fallocate() as our example. So we already know that we have to be extra careful since some 32-bit arch will pass this into 2 registers (or stack slots) which need to be aligned, and so we tend to already take care of making sure that the said 64-bit argument is either defined as 2x32-bit arguments, or defined as 1x64 bit argument aligned to 2x32-bit in the argument list. So far so good... The problem is when user space tries to use the same trick for calling those functions using glibc-provided syscall() function. In this example, hdparm does: err = syscall(SYS_fallocate, fd, mode, offset, len); With "offset" being a 64-bit argument. This will break because the first argument to syscall now shifts everything by one register, which breaks the register pair alignment (and I suppose archs with stack based calling convention can have similar alignment issues even if x86 doesn't). Ulrich, Steven, shouldn't we have glibc's syscall() take a long long as it's first argument to correct that ? Either that or making it some kind of macro wrapper around a __syscall(int dummy, int sysno, ...) ? As it is, any 32-bit app using syscall() on any of the syscalls that takes 64-bit arguments will be broken, unless the app itself breaks up the argument, but the the order of the hi and lo part is different between BE and LE architectures ;-) So is there a more "correct" solution than another here ? Should powerpc glibc be fixed at least so that syscall() keeps the alignment ? Cheers, Ben. --
From: Benjamin Herrenschmidt <benh@kernel.crashing.org> I think it is even different on the same endian architectures, f.e. mips I think. There is no way to do this without some arch specific code to handle things properly, really. --
Right, but to what extent ? IE. do we always need the callers using syscall() directly to know it all, or can we to some extent handle some of it inside glibc ? For example, if powerpc glibc is fixed so that syscall() takes a 64-bit first argument (or calls via some macro to add a dummy 32-bit argument), the register alignment will be preserved, and things will work just fine. IE. It may not fix all problems with all archs, but in this case, it will fix the common cases for powerpc at least :-) And any other arch that has the exact same alignment problem. Or is there any good reason -not- to do that in glibc ? Cheers, Ben. --
From: Benjamin Herrenschmidt <benh@kernel.crashing.org> The whole point of syscall() is to handle cases where the C library doesn't know about the system call yet. I think it's therefore very much "buyer beware". On sparc it'll never work to use the workaround you're proposing since we pass everything in via registers. So arch knowledge will always need to be present in these situations. --
I'm not sure I follow. We also pass via register on powerpc, but the offset introduced by the sysno argument breaks register pair alignment which cannot be fixed up inside syscall(). However, if I change glibc's syscall to be something like #define syscall(sysno, args...) __syscall(0 /* dummy */, sysno, args) And make __syscall then do something like: mr r0, r4 mr r3, r5 mr r4, r6 mr r5, r7 mr r6, r8 .../... sc blr Then at least all that class of syscalls will be fixed. Of course this has to be in glibc arch code. I was merely asking if that was something our glibc folks would consider and whether somebody could think of a better solution :-) Cheers ,Ben. --
MIPS passes arguments in the endian order that is low/high for little Syscall is most often used for new syscalls that have no syscall stub in glibc yet, so the user of syscall() encodes this ABI knowledge. If at a later stage syscall() is changed to have this sort of knowledge we break the API. This is something only the kernel can get right. Ralf --
One option would be to do a libkernel.so, with auto-generated stubs out of the kernel build tree. As already discussed in #kernel this morning, there are a number of sticky points with types and namespaces for this this, but those aren't any worse than the equivalent problems for syscall(3). -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. --
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 No need. Put it in the vdso. And name it something other than syscall. The syscall() API is fixed, you cannot change it. All this only if it makes sense for ALL archs. If it cannot work for just one arch then it's not worth it at all. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/ iEYEARECAAYFAkueWbcACgkQ2ijCOnn/RHRtBQCeP88S/0xei7CAt65AGboqsrC8 N7wAoK7Qbi+OZuQrgHTCgTA27TgY+gQU =4tJ6 -----END PGP SIGNATURE----- --
From: Ulrich Drepper <drepper@redhat.com> There are many archs that still lack VDSO. --
Putting it into the vdso is also rather annoyingly heavyweight for what is nothing other than an ordinary shared library. Just making it an ordinary shared library seems a lot saner. I don't see why syscall() can't change the type for its first argument -- it seems to be exactly what symbol versioning is for. Doesn't change the fact that it is fundamentally broken, of course. -hpa --
No need to change the type of the first arg and go for symbol versionning if you do something like I proposed earlier, there will be no conflict between syscall() and __syscall() and both variants can exist. Cheers, Ben. --
Basically symbol versioning done "by hand", actually using symbol versioning is better, IMNSHO. -hpa --
One concern is the new syscall and the kernel have to match and mixing will not work. your proposal seems to impact all syscalls not just the one called via syscall API. These syscalls get generated inline which makes static linking very dangerous ... So I think you do need both symbol versioning and kernel feature stubs --
What do you mean ? My proposal is purely a change to the syscall() function, nothing else. No kernel change, no ABI change, no change to the way glibc normally calls syscalls internally, etc... just the exported syscall() function to shift its arguments in order to avoid losing register pair alignment. And the change would still be compatible with existing userland code who manually splits the 64-bit arguments to avoid the problem on power. IE. Unless I've missed something, this would be a 100% backward compatible change that simply make a whole class of syscall() use work that didn't before on power (but did on x86), such as the one I hit in hdparm for example. Cheers, Ben. --
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 How can this be? People are today actively working around the problem of 64-bit arguments. You have to break something since you cannot recognize these situations. And since it became meanwhile clear that there is no way to "fix" all archs magically I really don't want to introduce anything. There are mechanisms in place to abstract out some of the issues. And for the rest, well, if you're using syscalls directly you already have to encoded lowlevel knowledge. One more bit doesn't hurt. It's not as if this happens every day. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/ iEYEARECAAYFAkugbhsACgkQ2ijCOnn/RHQzlACeMp0UK2jZuZOgXhJjB8Z9p4kh rCoAn0zaJqFYV9tQ0Ct49Mprfa0O5iKh =71la -----END PGP SIGNATURE----- --
Ok, so I -may- be missing something, but I believe this won't break anything: - You keep the existing syscall() exported by glibc for binary compatibility - You add a new __syscall() (or whatever you want to name it) that adds a dummy argument at the beginning, and whose implementation shifts everything by 2 instead of 1 argument before calling into the kernel - You define in unistd.h or whatever is relevant, a macro that does: #define syscall(__sysno, __args..) __syscall(0, _sysno, __args) I believe that should cover it, at least for powerpc, possibly for other archs too though as I said, I may have missed something there. IE. Whether your app writes: syscall(SYS_foo, my_64bit_arg); Or syscall(SYS_foo, (u32)(my_64bit_arg >> 32), (u32)(my_64bit_arg)); Both should still work with the new approach and end up doing the right thing. Hence, apps that use the first form today because it works on x86 would end up working at least on powerpc where they would have been otherwise It doesn't happen everyday. However, if my proposal ends up fixing a bunch of cases where it does without breaking anything, then I suppose it's worth considering, though as I said, it's possible that I miss some subtlety here in which case I'd be glad to stand corrected :-) Cheers, --
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 How can this possibly be the case? This will screw people who currently work around the ppc limitations of the existing syscall. Just leave it alone. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/ iEYEARECAAYFAkugnV0ACgkQ2ijCOnn/RHRL4gCeIY0SLDCgLqtVvuMw+pvCzkwE 3MIAoJQRK5Mc+WtC/Wz9tPFPy4X+EALe =lexw -----END PGP SIGNATURE----- --
No it won't. As I said, it will work for both cases. The problem is a register pair alignment problem. If the alignment is corrected with the trick I proposed, 64-bit values will end up in the right pair, but manually worked-around cases where the value is already broken up will -also- end up in the right pair. The problem with syscall() as it is is that it skews the arguments by 1 register, which causes the compiler to skip a register when generating the call for a 64-bit value. By doing the trick I propose, that skew will be gone, both 32 and 64 bit arguments will end up where expected. Cheers, Ben. --
I think what Ulrich is getting at is your change will break existing code which already does: #ifdef __powerpc__ syscall(SYS_foo, 0, my_64bit_arg); #else syscall(SYS_foo, my_64bit_arg); #endif I don't know of any such code, but it might be out there. -- Jamie --
No, the above "workaround" doesn't work. With the existing syscall() definition, there is no difference between your two examples. In the first case, you force a proper 64-bit aligment, but you are already off by one register pair from the kernel expectation. In the second case, gcc will imply one, which means that both your examples above will result in my_64bit_arg in the -same- place, which is off by a register pair from what the kernel expect. IE. In the first case gcc will put SYS_foo in r3, 0 in r4, and my_64bit_arg in r5 and r6. In the second case, gcc will put SYS_foo in r3, won't care about r4, and will put the 64-bit arg in r5 and r6. Then, glibc syscall() will shift r3 to r0, r3 to r4 etc... causing my_64bit_arg to land in r4 and r5. But the kernel expects it in r3 and r4. The workaround that apps should use today is: #if defined(__powerpc__) && WORDSIZE == 32 syscall(SYS_foo, (u32)(my_64bit_arg >> 32), (u32)my_64bit_arg); #else syscall(SYS_foo, my_64bit_arg); #endif And with my proposed change, both of the above will work. IE. gcc will put the argument always in r5,r6 and the syscall() implementation will always shift r5 to r3 and t6 to r4. Cheers, Ben. --
Again, this is *exactly* symbol versioning done by hand... we have proper symbol versioning, let's use it. -hpa --
Yeah, whatever, I don't mind what technique you use for the versionning, ultimately, if the approach works, we can look at those details :-) We -do- need the macro to strip the dummy argument though, unless we use a slightly different technique which is to make the __sysno argument itself 64-bit, which works as well I believe. Cheers, Ben. --
It seems cleaner to do it that way (with a 64-bit sysno arg.) -hpa --
Right. Now if we can get Ulrich to actually put 2 and 2 together and admit that it actually works without breaking anything existing (at least for my arch but I wouldn't be surprised if that was the case for others), I would be even happier :-) Steve, any chance you can cook up a glibc patch to test with ? Maybe making it powerpc specific for now ? Cheers, Ben. --
Do what declare __sysno as long long? The current prototype is in unistd.h: #ifdef __USE_MISC /* Invoke `system call' number SYSNO, passing it the remaining arguments. This is completely system-dependent, and not often useful. In Unix, `syscall' sets `errno' for all errors and most calls return -1 for errors; in many systems you cannot pass arguments or get return values for all system calls (`pipe', `fork', and `getppid' typically among them). In Mach, all system calls take normal arguments and always return an error code (zero for success). */ extern long int syscall (long int __sysno, ...) __THROW; #endif /* Use misc. */ Changing this would be an ABI change and would have to be versioned. It would effect any one using syscall not just SYS_fallocate. the question is do programmers in practice include unistd.h when they use syscall. If the changed prototype is not in scope then the 1st parm (__sysno) defaults to int and is passed in on r3 which gets moved to r0. If the changed syscall prototype is in scope then then _sysno would be passed in r3/r4 (r3 would be 0 would be passed to r0 and the actual system number would be in r4 and passed to the kernel in r3) which behavior do you want? which (incorrect behavior compiled into existing codes do you want to support? Do you want syscall.S for PPC32 to change to match the changed prototype? It will have to be be versioned and the new prototype will only be available in future releases of GLIBC. Existing applications will bind to the old ABI and get the old behavior. --
Sorry in and long are compatible in 32-bit but not long long. int and long are not compatible in 64-bit It is hard the keep all the nodes and arguments straight. But the concern about changing the prototype and are people actually --
Well, using the macro trick instead would fix that problem, code wouldn't build if it doesn't include unistd.h :-) Cheers, Ben. --
Or it will build, but call the old ABI version - no change to those programs. -- Jamie --
int is incompatible with long, so you already get undefined behaviour anyway. Andreas. -- Andreas Schwab, schwab@redhat.com GPG Key fingerprint = D4E8 DBE3 3813 BB5D FA84 5EC7 45C6 250E 6F00 984E "And now for something completely different." --
Well, no. The change I propose would not break the ABI on powerpc and would auto-magically fix thoses cases :-) But again, you don't have to do the same thing on MIPS or sparc, it's definitely arch specific. IE. What you are saying is that a syscall defined in the kernel as: sys_foo(u64 arg); To be called from userspace would require something like: u64 arg = 0x123456789abcdef01; #if defined(__powerpc__) && WORDSIZE == 32 syscall(SYS_foo, (u32)(arg >> 32), (u32)arg); #ese syscall(SYS_foo, arg); While with the trick of making syscall a macro wrapping an underlying __syscall that has an added dummy argument, the register alignment is "corrected" and thus -both- forms above suddenly work for me. That might actually work for you too. Cheers, Ben. --
The powerpc implementation of syscall is: ENTRY (syscall) mr r0,r3 mr r3,r4 mr r4,r5 mr r5,r6 mr r6,r7 mr r7,r8 mr r8,r9 sc PSEUDO_RET PSEUDO_END (syscall) The ABI says: "Long long arguments are considered to have 8-byte size and alignment. The same 8-byte arguments that must go in aligned pairs or registers are 8-byte aligned on the stack." This implies that the SYS_fallocate call will skip a register to get the required alignment in the parameter save area. for ppc32 on entry r3 == SYS_fallocate r4 == fd r5 == mode r6 == not used r7, r8 == offset r9 == len This gets shifted to: r0 == SYS_fallocate r3 == fd r4 == mode r5 == not used r6, r7 == offset r8 == len For syscall the vararg parms will be mirrored to the parameter save area but will not be used. The ABI does not talk to LE for this case. --
And my proposal is to make it instead: #define syscall(__sysno, __args...) __syscall(0,__sysno,__args) ENTRY (__syscall) mr r0,r4 mr r3,r5 mr r4,r6 mr r5,r7 mr r6,r8 mr r7,r9 mr r8,r10 sc PSEUDO_RET Which is not correct, as the kernel expects: r0 == SYS_fallocate r3 == fd r4 == mode r5, r6 == offset Right, but the fact that we shift all args by -1- register means that we break the 64-bit register pair alignment compared to the real syscall which uses r0 instead for the syscall number. Hence my proposal to add a dummy argument to restore that alignment. As it is there is userspace code that does: syscall(SYS_fallocate, fd, mode, offset, len); Which works on x86 but is broken on ppc32 unless we do that change. Cheers, --
There are several problems with syscall(), not just this - because a number of system calls in section 2 of the manual don't map directly to kernel syscalls with the same function prototype. Even fork() has become something complicated in Glibc that doesn't use the fork syscall :-( So anything using syscall() has to be careful on Linux already. Changing the 64-bit alignment won't fix the other differences. -- Jamie --
It won't fix -all- the problems with syscall(), but it will fix a wagon of them without breaking existing code that already does the arch specific breakup on the call site... Cheers, Ben. --
