Re: [PATCH v3] elf loader support for auxvec base platform string

Previous thread: madvise(2) MADV_SEQUENTIAL behavior by Eric Rannaud on Tuesday, July 15, 2008 - 4:03 pm. (10 messages)

Next thread: [PATCH 00/13] kmemcheck (review) by Vegard Nossum on Tuesday, July 15, 2008 - 5:21 pm. (1 message)
From: Nathan Lynch
Date: Tuesday, July 15, 2008 - 4:58 pm

Background:
Some IBM POWER-based systems have the ability to run in a
"compatibility" mode which mostly appears to the OS as a different
processor from the actual hardware.  This feature of the platform is
useful for live partition migration and for backwards compatibility
with old kernels on new hardware.  For example, a Power6 system may
appear to be a Power5+, which makes the AT_PLATFORM value "power5+".

Problem:
Booting a system in a compatibility mode means that ld.so may load
libraries that are inappropriately tuned for the real
microarchitecture, and apps that use JIT techniques do not have the
right information for generating tuned code.  While the AT_PLATFORM
auxiliary vector entry correctly indicates the ISA supported, it does
not accurately reflect the underlying microarchitecture in this case,
and there is no good way for userspace to get this information.

Proposed solution:
Add an AT_BASE_PLATFORM auxiliary vector entry which indicates the
microarchitecture.  This entry uses the same string format as
AT_PLATFORM, and is readily usable by ld.so and other applications.

Other solutions that have been suggested but found wanting:

- Use a bit in AT_HWCAP to indicate compat mode -- this is not
expressive enough.  It's not possible to derive the microarchitecture
from the combination of AT_PLATFORM's value and a single bit.

- Use dsocaps -- this seems to be a ld.so-specific interface and not
easily usable by other programs.  ld.so/glibc is not the only program
that can use knowledge of the microarchitecture.

The following two patches:
- add the base support to binfmt_elf.c for AT_BASE_PLATFORM
- implement AT_BASE_PLATFORM for powerpc

Changes since v1:
- increment AT_VECTOR_SIZE_BASE
- define AT_BASE_PLATFORM in generic code instead of powerpc
--

From: Nathan Lynch
Date: Tuesday, July 15, 2008 - 4:58 pm

Some IBM POWER-based platforms have the ability to run in a
mode which mostly appears to the OS as a different processor from the
actual hardware.  For example, a Power6 system may appear to be a
Power5+, which makes the AT_PLATFORM value "power5+".  This means that
programs are restricted to the ISA supported by Power5+;
Power6-specific instructions are treated as illegal.

However, some applications (virtual machines, optimized libraries) can
benefit from knowledge of the underlying CPU model.  A new aux vector
entry, AT_BASE_PLATFORM, will denote the actual hardware.  For
example, on a Power6 system in Power5+ compatibility mode, AT_PLATFORM
will be "power5+" and AT_BASE_PLATFORM will be "power6".  The idea is
that AT_PLATFORM indicates the instruction set supported, while
AT_BASE_PLATFORM indicates the underlying microarchitecture.

If the architecture has defined ELF_BASE_PLATFORM, copy that value to
the user stack in the same manner as ELF_PLATFORM.

Signed-off-by: Nathan Lynch <ntl@pobox.com>
---
 fs/binfmt_elf.c        |   23 +++++++++++++++++++++++
 include/linux/auxvec.h |    5 ++++-
 2 files changed, 27 insertions(+), 1 deletions(-)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index d48ff5f..834c2c4 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -131,6 +131,10 @@ static int padzero(unsigned long elf_bss)
 #define STACK_ALLOC(sp, len) ({ sp -= len ; sp; })
 #endif
 
+#ifndef ELF_BASE_PLATFORM
+#define ELF_BASE_PLATFORM NULL
+#endif
+
 static int
 create_elf_tables(struct linux_binprm *bprm, struct elfhdr *exec,
 		unsigned long load_addr, unsigned long interp_load_addr)
@@ -142,7 +146,9 @@ create_elf_tables(struct linux_binprm *bprm, struct elfhdr *exec,
 	elf_addr_t __user *envp;
 	elf_addr_t __user *sp;
 	elf_addr_t __user *u_platform;
+	elf_addr_t __user *u_base_platform;
 	const char *k_platform = ELF_PLATFORM;
+	const char *k_base_platform = ELF_BASE_PLATFORM;
 	int items;
 	elf_addr_t *elf_info;
 	int ei_index = 0;
@@ -172,6 +178,19 @@ ...
From: Benjamin Herrenschmidt
Date: Wednesday, July 16, 2008 - 11:35 pm

Hi Linus, Andrew !

Should I seek somebody's ack before merging a patch like the one below ?

I'm a bit reluctant to merge via the powerpc.git tree some changes to
generic files without at least an ack from somebody else :-)

There have been some debate on whether this AT_BASE_PLATFORM is the
right approach, though I haven't seen them reach any useful conclusion
and our toolchain people internally are screaming for it...

Cheers,
Ben.


--

From: Andrew Morton
Date: Thursday, July 17, 2008 - 12:09 am

It tends to happen.  People often don't notice unless it a) crashes or
b) spits warnings or c) screws up my tree or d) all the above plus

Please add a comment which explains what this is.

Please also add a comment telling the world in which header file the
architecture *must* define this macro and then ensure that that header is

From my reading, this change will result in no additional code
generation on non-powerpc architectures.  This is good.  If poss, could
you please verify that theory and perhaps drop a note in the changelog
about that?


Apart from that - acked-by-me
--

From: Nathan Lynch
Date: Thursday, July 17, 2008 - 10:39 am

That was the intent, yes.  However:

$ scripts/bloat-o-meter vmlinux-x86-{before,after}
add/remove: 0/0 grow/shrink: 2/0 up/down: 18/0 (18)
function                                     old     new   delta
init_mm                                      784     800     +16
load_elf_binary                            11946   11948      +2

(x86_64_defconfig, gcc 4.2.3)

The init_mm/mm_struct bloat is expected (although I'd like to avoid
that), but evidently it has some small effect on load_elf_binary.
--

From: Nathan Lynch
Date: Thursday, July 17, 2008 - 3:19 pm

Some IBM POWER-based platforms have the ability to run in a
mode which mostly appears to the OS as a different processor from the
actual hardware.  For example, a Power6 system may appear to be a
Power5+, which makes the AT_PLATFORM value "power5+".  This means that
programs are restricted to the ISA supported by Power5+;
Power6-specific instructions are treated as illegal.

However, some applications (virtual machines, optimized libraries) can
benefit from knowledge of the underlying CPU model.  A new aux vector
entry, AT_BASE_PLATFORM, will denote the actual hardware.  For
example, on a Power6 system in Power5+ compatibility mode, AT_PLATFORM
will be "power5+" and AT_BASE_PLATFORM will be "power6".  The idea is
that AT_PLATFORM indicates the instruction set supported, while
AT_BASE_PLATFORM indicates the underlying microarchitecture.

If the architecture has defined ELF_BASE_PLATFORM, copy that value to
the user stack in the same manner as ELF_PLATFORM.

Signed-off-by: Nathan Lynch <ntl@pobox.com>

---

Added comment explaining ELF_BASE_PLATFORM.

 fs/binfmt_elf.c        |   28 ++++++++++++++++++++++++++++
 include/linux/auxvec.h |    5 ++++-
 2 files changed, 32 insertions(+), 1 deletions(-)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index d48ff5f..d8a7cc0 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -131,6 +131,15 @@ static int padzero(unsigned long elf_bss)
 #define STACK_ALLOC(sp, len) ({ sp -= len ; sp; })
 #endif
 
+#ifndef ELF_BASE_PLATFORM
+/*
+ * AT_BASE_PLATFORM indicates the "real" hardware/microarchitecture.
+ * If the arch defines ELF_BASE_PLATFORM (in asm/elf.h), the value
+ * will be copied to the user stack in the same manner as AT_PLATFORM.
+ */
+#define ELF_BASE_PLATFORM NULL
+#endif
+
 static int
 create_elf_tables(struct linux_binprm *bprm, struct elfhdr *exec,
 		unsigned long load_addr, unsigned long interp_load_addr)
@@ -142,7 +151,9 @@ create_elf_tables(struct linux_binprm *bprm, struct elfhdr *exec,
 	elf_addr_t __user ...
From: Andrew Morton
Date: Thursday, July 17, 2008 - 3:42 pm

On Thu, 17 Jul 2008 17:19:32 -0500

OK.

But it conflicts directly with the already-queued
execve-filename-document-and-export-via-auxiliary-vector.patch (which
various potential reviewers blithely deleted - don't come complaining
to me!):


From: John Reiser <jreiser@BitWagon.com>

The Linux kernel puts the filename argument of execve() into the
new address space.  Many developers are surprised to learn this.
Those who know and could use it, object "But it's not documented."
Those who want to use it dislike the expression
  (char *)(1+ strlen(env[-1+ n_env]) + env[-1+ n_env])
because it requires locating the last original environment variable,
and assumes that the filename follows the characters.

This patch documents the insertion of the filename, and makes it easier to
find by adding a new tag AT_EXECFN in the ElfXX_auxv_t; see <elf.h>.

In many cases readlink("/proc/self/exe",) gives the same answer.  But if all
the original pages get unmapped, then the kernel erases the symlink for
/proc/self/exe.  This can happen when a program decompressor does a good job
of cleaning up after uncompressing directly to memory, so that the address
space of the target program looks the same as if compression had never
happened.  One example is http://upx.sourceforge.net .

One notable use of the underlying concept (what path containED the executable)
is glibc expanding $ORIGIN in DT_RUNPATH.  In practice for the near term, it
may be a good idea for user-mode code to use both /proc/self/exe and AT_EXECFN
as fall-back methods for each other.  /proc/self/exe can fail due to
unmapping, AT_EXECFN can fail because it won't be present on non-new systems. 
The auxvec or {AT_EXECFN}.d_val also can get overwritten, although in nearly
all cases this would be the result of a bug.

The runtime cost is one NEW_AUX_ENT using two words of stack space.  The
underlying value is maintained already as bprm->exec; setup_arg_pages() in
fs/exec.c slides it for stack_shift, etc.

Signed-off-by: John ...
From: John Reiser
Date: Thursday, July 17, 2008 - 4:35 pm

[snip]

It seems to me that most of the patch conflicts are mechanical
and could be merged mechanically.

However I believe that the documentation change to this comment is important:
-----
I scratched my head for a while to figure out that AT_NOTELF also was
a subtraction as far as AT_VECTOR_SIZE_BASE was concerned.


-- 
John Reiser, jreiser@BitWagon.com
--

From: Nathan Lynch
Date: Friday, July 18, 2008 - 11:28 am

John, from your patch:

+#define AT_EXECFN  31	/* filename of program */

How did you arrive at 31 for the value of AT_EXECFN?  I haven't been
able to find out how AT_* values are "allocated", or what the reason
is for the gap between AT_SECURE and AT_SYSINFO.
--

From: John Reiser
Date: Friday, July 18, 2008 - 1:31 pm

The numbers are chosen by experience, taste, and fiat.
Hopefully new choices do not conflict with existing ones,
but there is no formal "issuing authority."
In history the auxiliary vector has been not well standardized
and many times has been hidden from view of applications, although some
SysV-based systems have made it visible as a fourth argument to main().
Linux has /proc/pid/auxv, although the implementation suffers
from being exposed to overwriting by the user.  From long experience
at virtualization in user mode, I favor better access, more use,
and better understanding of what the auxiliary vector provides.

AT_SYSINFO at 32 was chosen to avoid conflicts with [0,31]
partly on the theory that the first 32 might be considered
to be reserved for use across all UNIX-like systems,
while AT_SYSINFO definitely was Linux-specific.
I chose AT_EXECFN at 31 because I considered the concept
to be applicable to any system having execve(), even if AT_EXECFN
is not universally implemented.  I had not seen any new tags
below 32 in a long time.  The concept of AT_EXECFN allows
a nice interface for a virtualizer. The somewhat-related
AT_EXECFD already exists below 32.

Elsewhere, I've staked out use of a new AT_WINE_PRELOAD_INFO
at 30.  Avoid that one, please. :-)

-- 
John Reiser, jreiser@BitWagon.com
--

From: Andrew Morton
Date: Friday, July 18, 2008 - 1:52 pm

On Fri, 18 Jul 2008 13:31:29 -0700

The reliable way in which to reserve these numbers is to patch the
header file.
--

From: Benjamin Herrenschmidt
Date: Sunday, July 20, 2008 - 8:24 pm

Andrew, we have one other patch (the powerpc bits) on top of that one.
Do you want to carry both in -mm on top of John's patch ? We would like
that in .27 though, I don't know what your merge plans are for John's
patch.

Cheers,
Ben.


--

From: Andrew Morton
Date: Sunday, July 20, 2008 - 8:40 pm

How about I send John's patch Linuswards right now?
--

From: Benjamin Herrenschmidt
Date: Monday, July 21, 2008 - 2:33 am

No objection.

Cheers,
Ben.


--

From: Nathan Lynch
Date: Monday, July 21, 2008 - 11:48 am

Some IBM POWER-based platforms have the ability to run in a
mode which mostly appears to the OS as a different processor from the
actual hardware.  For example, a Power6 system may appear to be a
Power5+, which makes the AT_PLATFORM value "power5+".  This means that
programs are restricted to the ISA supported by Power5+;
Power6-specific instructions are treated as illegal.

However, some applications (virtual machines, optimized libraries) can
benefit from knowledge of the underlying CPU model.  A new aux vector
entry, AT_BASE_PLATFORM, will denote the actual hardware.  For
example, on a Power6 system in Power5+ compatibility mode, AT_PLATFORM
will be "power5+" and AT_BASE_PLATFORM will be "power6".  The idea is
that AT_PLATFORM indicates the instruction set supported, while
AT_BASE_PLATFORM indicates the underlying microarchitecture.

If the architecture has defined ELF_BASE_PLATFORM, copy that value to
the user stack in the same manner as ELF_PLATFORM.

Signed-off-by: Nathan Lynch <ntl@pobox.com>

---


Rebased to -mm to resolve conflicts with
execve-filename-document-and-export-via-auxiliary-vector.patch, and
changed AT_BASE_PLATFORM to lowest unclaimed value (24).


 fs/binfmt_elf.c        |   28 ++++++++++++++++++++++++++++
 include/linux/auxvec.h |    6 +++++-
 2 files changed, 33 insertions(+), 1 deletions(-)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index bad7d87..180f92b 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -131,6 +131,15 @@ static int padzero(unsigned long elf_bss)
 #define STACK_ALLOC(sp, len) ({ sp -= len ; sp; })
 #endif
 
+#ifndef ELF_BASE_PLATFORM
+/*
+ * AT_BASE_PLATFORM indicates the "real" hardware/microarchitecture.
+ * If the arch defines ELF_BASE_PLATFORM (in asm/elf.h), the value
+ * will be copied to the user stack in the same manner as AT_PLATFORM.
+ */
+#define ELF_BASE_PLATFORM NULL
+#endif
+
 static int
 create_elf_tables(struct linux_binprm *bprm, struct elfhdr *exec,
 		unsigned long load_addr, unsigned ...
From: Benjamin Herrenschmidt
Date: Monday, July 21, 2008 - 7:03 pm

Andrew, I'll merge that after John patch shows up if you give me your
Ack :-)

Cheers,

--

From: Linus Torvalds
Date: Thursday, July 17, 2008 - 9:10 am

Gaah. Generally we don't, but you're right to ask. The ELF loading keeps 
on accumulating these things, and I'm not sure we have the right process 
for things like this. They're all individually small and insignificant, 
and they are all individually never going away and making the ELF loader 
messier and messier.

I dunno. 

		Linus
--

From: Nathan Lynch
Date: Thursday, July 17, 2008 - 12:35 pm

FWIW, I was initially reluctant to do it this way, but the ELF aux
vector seems to be the only reasonable mechanism for getting this
information to all interested users.  And the only reason the patch
touches the generic code is because we have to copy a string to
userspace (just like AT_PLATFORM).  Otherwise it could be done in
powerpc's ARCH_DLINFO.
--

From: Benjamin Herrenschmidt
Date: Sunday, July 20, 2008 - 8:19 pm

Yeah, annoying. Oh well, this one at least is now getting Andrew's
scrutinity...

Cheers,
Ben.


--

From: Nathan Lynch
Date: Tuesday, July 15, 2008 - 4:58 pm

Stash the first platform string matched by identify_cpu() in
powerpc_base_platform, and supply that to the ELF loader for the value
of AT_BASE_PLATFORM.

Signed-off-by: Nathan Lynch <ntl@pobox.com>
---
 arch/powerpc/kernel/cputable.c |   11 +++++++++++
 include/asm-powerpc/cputable.h |    2 ++
 include/asm-powerpc/elf.h      |    8 ++++++++
 3 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/kernel/cputable.c b/arch/powerpc/kernel/cputable.c
index f7f3c21..89d8731 100644
--- a/arch/powerpc/kernel/cputable.c
+++ b/arch/powerpc/kernel/cputable.c
@@ -23,6 +23,9 @@
 struct cpu_spec* cur_cpu_spec = NULL;
 EXPORT_SYMBOL(cur_cpu_spec);
 
+/* The platform string corresponding to the real PVR */
+const char *powerpc_base_platform;
+
 /* NOTE:
  * Unlike ppc32, ppc64 will only call this once for the boot CPU, it's
  * the responsibility of the appropriate CPU save/restore functions to
@@ -1632,6 +1635,14 @@ struct cpu_spec * __init identify_cpu(unsigned long offset, unsigned int pvr)
 			} else
 				*t = *s;
 			*PTRRELOC(&cur_cpu_spec) = &the_cpu_spec;
+
+			/*
+			 * Set the base platform string once; assumes
+			 * we're called with real pvr first.
+			 */
+			if (powerpc_base_platform == NULL)
+				powerpc_base_platform = t->platform;
+
 #if defined(CONFIG_PPC64) || defined(CONFIG_BOOKE)
 			/* ppc64 and booke expect identify_cpu to also call
 			 * setup_cpu for that processor. I will consolidate
diff --git a/include/asm-powerpc/cputable.h b/include/asm-powerpc/cputable.h
index 2a3e907..ef8a248 100644
--- a/include/asm-powerpc/cputable.h
+++ b/include/asm-powerpc/cputable.h
@@ -127,6 +127,8 @@ extern struct cpu_spec *identify_cpu(unsigned long offset, unsigned int pvr);
 extern void do_feature_fixups(unsigned long value, void *fixup_start,
 			      void *fixup_end);
 
+extern const char *powerpc_base_platform;
+
 #endif /* __ASSEMBLY__ */
 
 /* CPU kernel features */
diff --git a/include/asm-powerpc/elf.h ...
Previous thread: madvise(2) MADV_SEQUENTIAL behavior by Eric Rannaud on Tuesday, July 15, 2008 - 4:03 pm. (10 messages)

Next thread: [PATCH 00/13] kmemcheck (review) by Vegard Nossum on Tuesday, July 15, 2008 - 5:21 pm. (1 message)