Luca Barbieri posted an interesting patch, which "implements a system that modifies the kernel code at runtime depending on CPU features and SMPness".
What this patch does (as far as i can figure out) is detect if your machine actually has multiple CPUs (you must have SMP support compiled into the kernel), and modify the kernel accordingly.
It doesn't seem likely that it will be merged into mainline anytime soon (if at all), but it's still pretty interesting in concept.
Will be used by another patch that I'll post in the very near future.
For i386 cpucount is used rather than computing the hweight of
cpu_online_map since it's faster and already there.
Ideally the same should be done for all architectures in both these
macros and num_online_cpus and num_possible_cpus.
This patch implements a system that modifies the kernel code at runtime
depending on CPU features and SMPness.
In fact, I'm not really sure whether it's a good idea to do something
like this. When I started implementing this it seemed to be possible to
have a much simpler and cleaner implementation but unfortunately this
isn't the case especially due to the need of SMP correctness.
This patch requires the is_smp() patch I posted earlier and also
requires the new CPU selection code and the code that actually uses
both.
This code already exists, but needs a few adjustments so it may not
arrive immediately.
The code is invoked in the following ways:
Note that the int3 and int 0xfa handler are actually invoked by the code
to be modified, so the handlers replace calls to themselves.
For example for APIC writes we have <int3> 0x40 mem32 -> <xchgl/movl>
<%eax>, mem32.
No fixups are performed until SMP boot is completed. Instructions are
instead emulated in the interrupt/exception handlers.
Unfortunately with this patch executing invalid code will cause the
processor to enter an infinite exception loop rather than panic. Fixing
this is not trivial for SMP+preempt so it's not done at the moment.
This code needs special care for SMP safety. Here is a copy of the
comment explaining this:
/* If we are running on SMP any other processor might be executing the
code that we are modifying. We must make sure that the other
processor will either fault or will execute the complete
replacement instruction.
This is accomplished by using instructions that fault only
depending on up to 4 bytes. When fixing up something we first write
bytes after the first 4 and we then use a locked write to set the
first 4.
We depend on the processor execute unit to never see our locked
write before it sees the other modifications.
According to page 7-5 of the Intel Pentium 4 System Programming
Manual, this is safe on 486 and Pentium
[link to patch]
From: Pavel Machek
To: linux-kernel
Subject: Re: [PATCH 1 / ...] i386 dynamic fixup/self modifying code
Date: 2002-08-28 12:11:30
Hi!
> This patch implements a system that modifies the kernel code at runtime
> depending on CPU features and SMPness.
Nice!
> This patch requires the is_smp() patch I posted earlier and also
> requires the new CPU selection code and the code that actually uses
> both.
> This code already exists, but needs a few adjustments so it may not
> arrive immediately.
>
> The code is invoked in the following ways:
> * Undefined exception handler: this is used to replace
> unsupported instructions with supported ones. Used for invlpg
> -> flushall, prefetchnta -> prefetch -> nop, *fence -> lock
> addl 0, (%esp), movntq -> movq
> * Int3 handler: this is used when a 1 byte opcode is desired.
> This is controlled by a config option so that debuggers and
> kprobe won't break. Used for lock/nop and APIC write
Why not do *everything* using int3 handler? It should simplify your code.
Hooking on 'unknown instruction' should not be really neccessary if you
replace all invlpgs (etc) with 0xcc...
Pavel
> Unfortunately with this patch executing invalid code will cause the
> processor to enter an infinite exception loop rather than panic. Fixing
> this is not trivial for SMP+preempt so it's not done at the moment.
Using 0xcc for everything should fix that, right?
From: Luca Barbieri
To: linux-kernel
Subject: Re: [PATCH 1 / ...] i386 dynamic fixup/self modifying code
Date: 30 Aug 2002 00:57:35 +0200
> > This patch implements a system that modifies the kernel code at runtime
> > depending on CPU features and SMPness.
>
> Nice!
> > This patch requires the is_smp() patch I posted earlier and also
> > requires the new CPU selection code and the code that actually uses
> > both.
> > This code already exists, but needs a few adjustments so it may not
> > arrive immediately.
> >
> > The code is invoked in the following ways:
> > * Undefined exception handler: this is used to replace
> > unsupported instructions with supported ones. Used for invlpg
> > -> flushall, prefetchnta -> prefetch -> nop, *fence -> lock
> > addl 0, (%esp), movntq -> movq
> > * Int3 handler: this is used when a 1 byte opcode is desired.
> > This is controlled by a config option so that debuggers and
> > kprobe won't break. Used for lock/nop and APIC write
>
> Why not do *everything* using int3 handler? It should simplify your code.
Because kdb, kgdb and kprobe want to use int3 so it must be used only if
the config option is enabled.
> Hooking on 'unknown instruction' should not be really neccessary if you
> replace all invlpgs (etc) with 0xcc...
It's better: first, if it is supported we have no faults.
Second, in some cases (e.g. movntq) it's not possible without adding
extra padding when using int xx (and int3 can't be used for the reason
above).
> > Unfortunately with this patch executing invalid code will cause the
> > processor to enter an infinite exception loop rather than panic. Fixing
> > this is not trivial for SMP+preempt so it's not done at the moment.
>
> Using 0xcc for everything should fix that, right?
Not possible. See above.
This could be fixed by checking whether the opcode is not one generated
or recognized by the dynamic fixup code, assuming that nothing else can
change the code.
However, this fails to detect "impossible" cases like cpus that should
understand prefetches but that don't.
From: Alan Cox
To: linux-kernel
Subject: Re: [PATCH 1 / ...] i386 dynamic fixup/self modifying code
Date: 30 Aug 2002 00:19:52 +0100
On Wed, 2002-08-28 at 13:11, Pavel Machek wrote:
> > Unfortunately with this patch executing invalid code will cause the
> > processor to enter an infinite exception loop rather than panic. Fixing
> > this is not trivial for SMP+preempt so it's not done at the moment.
>
> Using 0xcc for everything should fix that, right?
Except you can't do the fixup on SMP without risking hitting the CPU
errata. You also break debugging tools that map kernel code pages r/o
and people who ROM it.
The latter aren't a big problem (they can compile without runtime
fixups). For the other fixups though you -have- to do them before you
run the code. That isnt hard (eg sparc btfixup). You generate a list of
the addresses in a segment, patch them all and let the init freeup blow
the table away
From: Luca Barbieri
To: linux-kernel
Subject: Re: [PATCH 1 / ...] i386 dynamic fixup/self modifying code
Date: 30 Aug 2002 01:29:32 +0200
On Fri, 2002-08-30 at 01:19, Alan Cox wrote:
> On Wed, 2002-08-28 at 13:11, Pavel Machek wrote:
> > > Unfortunately with this patch executing invalid code will cause the
> > > processor to enter an infinite exception loop rather than panic. Fixing
> > > this is not trivial for SMP+preempt so it's not done at the moment.
> >
> > Using 0xcc for everything should fix that, right?
>
> Except you can't do the fixup on SMP without risking hitting the CPU
> errata.
Worked around by making sure all other processors are stopped (iret is
serializing) sending IPIs if they are not already spinning on the fixup
lock. See patch #2.
> You also break debugging tools that map kernel code pages r/o
> and people who ROM it.
>
> The latter aren't a big problem (they can compile without runtime
> fixups).
OK, I'll add a config option for this.
> For the other fixups though you -have- to do them before you
> run the code. That isnt hard (eg sparc btfixup). You generate a list of
> the addresses in a segment, patch them all and let the init freeup blow
> the table away
Is doing them at runtime with the aforementioned workaround fine?
From: Alan Cox
To: linux-kernel
Subject: Re: [PATCH 1 / ...] i386 dynamic fixup/self modifying code
Date: 30 Aug 2002 00:32:35 +0100
On Fri, 2002-08-30 at 00:29, Luca Barbieri wrote:
> Worked around by making sure all other processors are stopped (iret is
> serializing) sending IPIs if they are not already spinning on the fixup
> lock. See patch #2.
what happens we you do a fixup and the fixup occurs in an IPI handler
(eg a cross CPU tlb flush).
> > For the other fixups though you -have- to do them before you
> > run the code. That isnt hard (eg sparc btfixup). You generate a list of
> > the addresses in a segment, patch them all and let the init freeup blow
> > the table away
> Is doing them at runtime with the aforementioned workaround fine?
Is doing them all in the beginning not somewhat saner and more
debuggable. The only reason to do it at runtime is hotplugging a less
capable CPU. I have a suggestion for that case which is that we don't
bother about it 8)
From: Luca Barbieri
To: linux-kernel
Subject: Re: [PATCH 1 / ...] i386 dynamic fixup/self modifying code
Date: 30 Aug 2002 02:10:56 +0200
On Fri, 2002-08-30 at 01:32, Alan Cox wrote:
> On Fri, 2002-08-30 at 00:29, Luca Barbieri wrote:
> > Worked around by making sure all other processors are stopped (iret is
> > serializing) sending IPIs if they are not already spinning on the fixup
> > lock. See patch #2.
>
> what happens we you do a fixup and the fixup occurs in an IPI handler
> (eg a cross CPU tlb flush).
Why should something bad happen in this case? (unless it happens in the
IPI handler for the SMP lock vector, but I've duplicated the spinlock
and apic-ack code to avoid using the fixups).
I've just noticed another problem instead: we might have a CPU waiting
with interrupts disabled for the CPU executing the fixup.
This could be fixed by waiting for a limited amount of iterations and
then emulating the instruction, but it makes the code even uglier.
We might get deadlocks on NMIs but that would also happen if we e.g. get
a memory parity NMI inside printk (deadlock on logbuf_lock). Should both
bugs be fixed?
> > > For the other fixups though you -have- to do them before you
> > > run the code. That isnt hard (eg sparc btfixup). You generate a list of
> > > the addresses in a segment, patch them all and let the init freeup blow
> > > the table away
> > Is doing them at runtime with the aforementioned workaround fine?
>
> Is doing them all in the beginning not somewhat saner and more
> debuggable.
That wouldn't work for compiler-generated prefetches (unless you
preprocess the compiler output) and would enlarge the kernel.
However, it would be significantly cleaner.
> The only reason to do it at runtime is hotplugging a less
> capable CPU. I have a suggestion for that case which is that we don't
> bother about it 8)
Even if we fixup at runtime this won't work since we the fixed up
instructions won't fault.
To handle this, we would need to keep the table around, stop CPUs and
fixup.
Anyway this scenario is quite unlikely.
From: Alan Cox
To: linux-kernel
Subject: Re: [PATCH 1 / ...] i386 dynamic fixup/self modifying code
Date: 30 Aug 2002 12:17:59 +0100
On Fri, 2002-08-30 at 01:10, Luca Barbieri wrote:
> That wouldn't work for compiler-generated prefetches (unless you
> preprocess the compiler output) and would enlarge the kernel.
> However, it would be significantly cleaner.
My general experience with compiler generated prefetches right now is
pretty poor for kernel type code. Its hard to do it right in the
compiler for complex stuff rather than 'fortran in C' type jobs
We certainly could perl the asm to drop in the right directives if it
became an issue, but there are children on the list so lets worry about
it if it becomes a problem
Luca Barbieri writes:
> This patch implements a system that modifies the kernel code at runtime
> depending on CPU features and SMPness.
>...
> /* If we are running on SMP any other processor might be executing the
> code that we are modifying. We must make sure that the other
> processor will either fault or will execute the complete
> replacement instruction.
>
> This is accomplished by using instructions that fault only
> depending on up to 4 bytes. When fixing up something we first write
> bytes after the first 4 and we then use a locked write to set the
> first 4.
>
> We depend on the processor execute unit to never see our locked
> write before it sees the other modifications.
>
> According to page 7-5 of the Intel Pentium 4 System Programming
> Manual, this is safe on 486 and Pentium
I've tried this sort of thing before (unsynchronised cross-modifying code),
but I had to abandon it due to Pentium III Erratum E49 and similar errata
for all Intel P6 CPUs. Have you verified that you're not hitting this erratum?
/Mikael
From: Luca Barbieri
To: linux-kernel
Subject: Re: [PATCH 1 / ...] i386 dynamic fixup/self modifying code
Date: 2002-08-28 16:16:30
> I've tried this sort of thing before (unsynchronised cross-modifying code),
> but I had to abandon it due to Pentium III Erratum E49 and similar errata
> for all Intel P6 CPUs. Have you verified that you're not hitting this erratum?
It is indeed completely hitting it.
However, we can work around this by simply stopping all other CPUs in
interrupt context with an IPI (while this may sound horrible, it
shouldn't significantly impact performance unless the response time is
excessively long).
I'll write some code to this. However I don't have the hardware to test
it, so it might require multiple iterations to get it right.
As for the "all Intel P6 CPUs" are really _all_ Intel P6 CPU broken?
Do you know of any other CPU that would need the workaround?
From: Mikael Pettersson
To: linux-kernel
Subject: Re: [PATCH 1 / ...] i386 dynamic fixup/self modifying code
Date: 2002-08-28 19:48:46
Luca Barbieri writes:
> > I've tried this sort of thing before (unsynchronised cross-modifying code),
> > but I had to abandon it due to Pentium III Erratum E49 and similar errata
> > for all Intel P6 CPUs. Have you verified that you're not hitting this erratum?
> It is indeed completely hitting it.
> However, we can work around this by simply stopping all other CPUs in
> interrupt context with an IPI (while this may sound horrible, it
> shouldn't significantly impact performance unless the response time is
> excessively long).
That was my thought too. IPI to bring the others to a barrier, do the
modification, release the barrier.
In my case (patching CALL instructions to call the correct targets
after HW detection) I was fortunately able to fix up the code before
it was seen by other CPUs, but this relied on the fact that I knew
the locations of all CALL sites needing fix up.
> I'll write some code to this. However I don't have the hardware to test
> it, so it might require multiple iterations to get it right.
>
> As for the "all Intel P6 CPUs" are really _all_ Intel P6 CPU broken?
Yes, last time I checked the erratum existed for all members of
Intel's P6 family.
> Do you know of any other CPU that would need the workaround?
No. The P5 is ok, and I believe the P4 is also. The K7s didn't have
this listed as an erratum last time I checked.
/Mikael
Why?
Can someone tell me exactly what problem this solves? Why in the world do we need a self-modifying kernel?
BTW, the second [link to patch] link is broken.
Link fixed
Oops. Sorry for that broken link. it's fixed now.
re: Why?
Er, from what I can tell, this is trying to make SMPness dynamic.. but then, I'm probably wrong. Code is not literally /modified/ on the run.. I'm not sure why that term was used. :-)
[correct me if I'm wrong, higher being.]
Looking at the code.. It appe
Looking at the code.. It appears it does modify the kernel code.. here, it is writing opcodes.. apparently 0x0faef889f6 is "sfence; movl %esi, %esi" .. I'm not sure what you mean by "modified on the run" .. I would imagine one would only need to modify it once after the kernel starts.
if (cpu_needs_sfence) { if (cpu_has_mmxext) set5(pass_atomic instr, 0x0f, 0xae, 0xf8, 0x89, 0xf6); /* sfence; movl %esi, %esi */ elseOptimization
Not being a kernel developer, this may be blatantly wrong. From reading the thread posted above, it seems these patches allow on-the-fly or dynamic kernel optimization. Meaning the kernel attunes itself to the CPU(s) in the system. Apparently this is more advanced than what you get when compiling for a specific arch or machine type with gcc. Machine Instructions for specific features supported by the SMP system are substituted for generic calls. Confusing, maybe some with more insight could clarify this for all of us learning?
Why not at compile-time?
Those kind of decisions should be available at compile time. We know if we are compiling for SMP or not, and we know what family of CPU it is. If there is a deficiency in gcc's output, code it up in assembly. There must be something more that we're missing..
Because run-time is better...
Aunt Tilly doesn't want to recompile her kernel. She just wants to boot and have it work. (Only if you ask Aunt Tilly how often she boots her computer she'll say "Never -- it's on the table, so kicking it is much harder than hitting it.") She certainly doesn't know details about the CPU in her computer.
I've got a diverse network of various Intel and AMD machines, some single-CPU, some dual-CPU. I want to run make-kpkg _once_, ship the resulting kernel.deb off to each machine in the cluster, and reboot. Having a custom kernel on each box is a maintenance nightmare.
The distros (rightly) refuse to ship with fifty-bazillion kernels. A patch like this would let them ship one kernel that still worked efficiently on fifty-bazillion sub-architectures.
maybe
it's for weeding out 'optimization' commands (like prefetch) on a processor that doesn't support them really well. AFAIK sfence; mov %esi,%esi does 'noop'.