Hello Herbert,
I think I finally found the problem.
Here a short description again: all our routers with a via C3 using padlock for AES-encryption are
crashing with 2.6.26 while they work fine with 2.6.25. Not using padlock
(i.e. using the i386 assembler version of AES) they just work fine.After a kernel bisection I found that
fa5c4639419668cbb18ca3d20c1253559a3b43ae works fine,
2adee9b30d1382fba97825b9c50e4f50a0117c36 crashes.I did not try to bisect the last 4 commits as they seem to be related.
Instead I took a 2.6.26 and reverted the following commits:
e8a496ac8cd00cabbdaa373db4818a9ad19a1c5a
fd3c3ed5d1e3ceb37635cbe6d220ab94aae0781d
2adee9b30d1382fba97825b9c50e4f50a0117c36
1679f2710ac58df580d3716fab1f42ae50a226eb
aa283f49276e7d840a40fb01eee6de97eaa7e012
61c4628b538608c1a85211ed8438136adfeb9a95and changed
arch/x86/kernel/process.c
by hand.With such a modified 2.6.26 the problem disappears.
Here is the difference between 2.6.26 and the modified version:
=============================================
diff -ru kernels/linux-2.6.26/arch/x86/kernel/i387.c tests/linux-2.6ww/arch/x86/kernel/i387.c
--- kernels/linux-2.6.26/arch/x86/kernel/i387.c 2008-07-15 11:29:31.000000000 +0200
+++ tests/linux-2.6ww/arch/x86/kernel/i387.c 2008-08-06 18:40:54.000000000 +0200
@@ -35,18 +35,17 @@
#endifstatic unsigned int mxcsr_feature_mask __read_mostly = 0xffffffffu;
-unsigned int xstate_size;
-static struct i387_fxsave_struct fx_scratch __cpuinitdata;-void __cpuinit mxcsr_feature_mask_init(void)
+void mxcsr_feature_mask_init(void)
{
unsigned long mask = 0;clts();
if (cpu_has_fxsr) {
- memset(&fx_scratch, 0, sizeof(struct i387_fxsave_struct));
- asm volatile("fxsave %0" : : "m" (fx_scratch));
- mask = fx_scratch.mxcsr_mask;
+ memset(¤t->thread.i387.fxsave, 0,
+ sizeof(struct i387_fxsave_struct));
+ asm volatile("fxsave %0" : : "m" (current->thread.i387.fxsave));
+ mask = current->thread.i387.fxsave.mxcsr_m...
Both the padlock version or asm version don't use FP/math registers, right?
It is interesting that you don't see the problem with asm version
but see the problem with padlock version.Does disabling CONFIG_PREEMPT in 2.6.26 change anything? And also,
can you provide the complete kernel log till the point of failure(oops
that you sent doesn't have the call trace info)thanks,
suresh
--
Didn't check that yet as I'm still running my modified 2.6.26. It now runs
almost one day flawlessly.But yesterday I tried the following patch on top of a vanilla 2.6.26:
=======================================================
diff -ur ../linux-2.6.26/drivers/crypto/padlock-aes.c ./drivers/crypto/padlock-aes.c
--- ../linux-2.6.26/drivers/crypto/padlock-aes.c 2008-07-15 11:29:32.000000000 +0200
+++ ./drivers/crypto/padlock-aes.c 2008-08-07 17:46:55.000000000 +0200
@@ -16,6 +16,7 @@
#include <linux/interrupt.h>
#include <linux/kernel.h>
#include <asm/byteorder.h>
+#include <asm/i387.h>
#include "padlock.h"/* Control word. */
@@ -144,9 +145,11 @@
static inline void padlock_xcrypt(const u8 *input, u8 *output, void *key,
void *control_word)
{
+ kernel_fpu_begin();
asm volatile (".byte 0xf3,0x0f,0xa7,0xc8" /* rep xcryptecb */
: "+S"(input), "+D"(output)
: "d"(control_word), "b"(key), "c"(1));
+ kernel_fpu_end();
}static void aes_crypt_copy(const u8 *in, u8 *out, u32 *key, struct cword *cword)
@@ -179,6 +182,7 @@
return;
}+ kernel_fpu_begin();
asm volatile ("test $1, %%cl;"
"je 1f;"
"lea -1(%%ecx), %%eax;"
@@ -190,15 +194,18 @@
: "+S"(input), "+D"(output)
: "d"(control_word), "b"(key), "c"(count)
: "ax");
+ kernel_fpu_end();
}static inline u8 *padlock_xcrypt_cbc(const u8 *input, u8 *output, void *key,
u8 *iv, void *control_word, u32 count)
{
/* rep xcryptcbc */
+ kernel_fpu_begin();
asm volatile (".byte 0xf3,0x0f,0xa7,0xd0"
: "+S" (input), "+D" (output), "+a" (iv)
: "d" (control_word), "b" (key), "c" (count));
+ kernel_fpu_end();
return iv;
}=============================================================
I found that kernel_fpu_begin(); kernel_fpu_begin(); is used with
MMX and/or SSE:include/asm/xor_32.h
drivers/md/raid6mmx.c
drivers/md/raid6sse1.c
drivers/md/raid6sse2.cWith this cha...
Forget that - I booted the wrong kernel.
I now really test this modification and it seems to be stable. The router now
runs for 45 minutes without oops.Regards,
--
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts
Leiter EDV
Leopoldstraße 15
80802 München
Tel: +49 89 38196 276
Fax: +49 89 38196 150
Email: wolfgang.walter@stwm.de
http://www.studentenwerk-muenchen.de/
--
It now runs for 8 hours.
So how to proceed?
Here a summary:
1) 2.6.25.x works
2) 2.6.26 without padlock works
3a) 2.6.26 using padlock and ipsec chrashes.
Always the same reason:
__switch_to() wants to fxsave but there is no memory allocated
which means that TS_USEDFPU has been set.
3b) 2.6.26 with padlock code from 2.6.25: using padlock and ipsec chrashes.
4) Reverting the fpu patches (from 2.6.25 to 2.6.26) fixes the problem.
5) Protecting the padlock cmds with kernel_fpu_begin(); kernel_fpu_begin();
fixes the problem.Some thoughts:
4) probably fixes the problem because memory for fxsave is always allocated.
Maybe this is also the reason why 2.6.25 and earlier work.
5) is interesting:
kernel_fpu_begin() will save the fpu state if TS_USEDFPU is set - exactly as
__unlazy_fpu in __switch_to would do. So when the crypto code calls padlock
the world is ok: if TS_USEDFPU is set then memory has been allocated.This makes me rather sure that there is no memory corruption by the network
code and/or crypto code itself overwriting the task_info stucture and setting
TS_USEDFPU.I don't know though why kernel_fpu_begin exactly fixes the problem:
Maybe preemption must be disabled when padlocks RNG, ACE, Hash-engine etc. is
used. Maybe just some barrier is needed which preempt_disable provides.The padlock programming guide states that padlock's RNG, ACE, ... internally
use SSE. Especially "it temporary enables SSE until no further SSE
instruction or XSTORE, ... is seen". Further it may overlap with execution of
non-SSE instructions.Probably it is the best to treat the padlock opcodes as SSE instructions.
What should I do now?
Regards,
--
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts
--
because it prevents taking a DNA fault inside the kernel (resulting
I will post your fix(including the other usages) with the appropriate changelog
shortly.thanks,
suresh
--
Fine.
By the way, if padlock is used from userspace, could this be aproblem, too? I
mean not via the kernel crypto layer but directly. I think these extensions
are not restricted to a privileged user.Regards,
--
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts
Leiter EDV
Leopoldstraße 15
80802 München
Tel: +49 89 38196-276
Fax: +49 89 38196-144
wolfgang.walter@stwm.de
http://www.studentenwerk-muenchen.de/
--
No. Userspace will not have any such issue. Kernel is special, as it
has to preserve the user-space context and also use these instructions
for its own kernel purpose.It is ok to get a spurious DNA from userland.
thanks,
suresh
--
Ok. Thanks. I think I can explain the failure:
These padlock instructions though don't use/touch SSE registers, but it behaves
similar to other SSE instructions. For example, it might cause DNA faults
when cr0.ts is set. While this is a spurious DNA trap, it might cause
problems with the recent fpu code changes.This is the code sequence that is probably causing this problem:
a) new app is getting exec'd and it is somewhere in between
start_thread() and flush_old_exec() in the load_xyz_binary()b) At pont "a", task's fpu state (like TS_USEDFPU, used_math() etc) is
cleared.c) Now we get an interrupt/softirq which starts using these encrypt/decrypt
routines in the network stack. This generates a math fault (as
cr0.ts is '1') which sets TS_USEDFPU and restores the math that is
in the task's xstate.d) Return to exec code path, which does start_thread() which does
free_thread_xstate() and sets xstate pointer to NULL while
the TS_USEDFPU is still set.e) At the next context switch from the new exec'd task to another task,
we have a scenarios where TS_USEDFPU is set but xstate pointer is null.
This can cause an oops during unlazy_fpu() in __switch_to()Now:
1) This should happen with or with out pre-emption. Viro also encountered
similar problem with out CONFIG_PREEMPT.2) kernel_fpu_begin() and kernel_fpu_end() will fix this problem, because
kernel_fpu_begin() will manually do a clts() and won't run in to the
situation of setting TS_USEDFPU in step "c" above.3) This was working before the fpu changes, because its a spurious
math fault which doesn't corrupt any fpu/sse registers and the task's
math state was always in an allocated state.I propose to go with wolf's patch which add's kernel_fpu_begin() and
kernel_fpu_end() around these instructions. This is the correct fix
(unless there is something wrong in my above understanding).thanks,
suresh
--
It's technically overkill, if (and only if!) these instructions don't
actually touch the SSE state (most likely they're using the SSE
pipeline, and need this stuff to deal with power management issues.)However, overkill is a good way to make sure something is dead.
Applying the patch will make sure we fix the regression, and we can
worry about optimizing this further post-2.6.27.-hpa
--
Yes the PadLock uses the SSE pipeline, but doesn't touch any
Do we really need the FPU changes right now? I'd prefer for that
to be backed out until a proper solution is found. Disabling
preemption around crypto is really bad for scheduling latency.Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
These FPU changes are already in 2.6.26. Undoing them, would that be accepted
for 2.6.26 stable?Maybe the following solution would be possible: if a processor with padlock is
detected the memory for xstate is always allocated when the thread is created
instead "lazy"?Or would it be possible to change __switch_to(): bevor calling __unlazy_fpu()
check if xstate is NULL, if yes, clear TS_USEDFPU and math-state?As I wrote in my other mail: 2.6.26 with the patch seems to perform rather
well with ipsec. Network latency and bandwidth seem completely unchanged
compared to 2.6.15.13 as far as I can see yet.But this is of course not a good test for kernel latency itself. As we don't
have any destops based on VIA C3 (and these using padlock) I can't really
test this (even if I set one up I don't know how it used to feel with
2.6.25 :-) ).But I doubt that a lot of people use it as a desktop together with a VIA C3
and ipsec. If they would they must see this crash.Regards,
--
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts
Leiter EDV
Leopoldstraße 15
80802 München
Tel: +49 89 38196-276
Fax: +49 89 38196-144
wolfgang.walter@stwm.de
http://www.studentenwerk-muenchen.de/
--
That will effectively happen, so it doesn't really matter.
The true optimization would be to recognize that the state doesn't have
to be saved, and track when we did so, and so on and so forth.VIA really did their customers a disservice tying this to CR0.TS.
-hpa
--
With the patch? I don't think so:
With the patch the following changes compared with 2.6.25:
* processes which do not use FPU/SSE will not start saving and loading there
xstate any more (and will not allocate xstate memory) if kernel is using
padlock with the crypto-framework. Its similar to other places where the
kernel uses MMX/SSE and protects them with kernel_fpu_begin();
kernel_fpu_end(); isn't it?* if user space uses FPU/SSE its xstate will be saved though it is not really
necessary.* the crypto-framework using padlock can not be preempted any more while
executing padlock-opcodes.This is what Hebert fears.
What I'm not really understand - and Suresh probably could tell us - if with
2.6.25 and earlier the following could happen:1) a process is using FPU/SSE so it has a saved xstate BUT its not loaded
because of lazy loading of FPU state (didn't use FPU/SSE for some time).2) padlock usage by kernel without kernel_fpu_begin() kernel_fpu_end() sets
TS_USEDFPU of this task but its not loaded3) in __switch_to() the process then saves its FPU/SSE state overwriting its
real xstate.If I understand Suresh explanations correctly this could not happen.
The problem only exists at task creation time: a race between initialising
TS_USEDFPU, used_math()) and setting the xstate pointer to NULL.So wouldn't this help:
* a flag TS_INITIALISING is set for this task bevor used_math and TS_USEDFPU
is set
* then math-state of the new task is initialisied including setting xstate to
NULL
* then the flag is cleared
* math_state_restore() in trap_32.c does nothing if this flag is set.Another possibility would be to use xstate for tracking if math has ever be
used by this task and made TS_USEDFPU only valid if xtate != NULLBoth would only reestablish 2.6.25 behaviour.
But even if one or both would work its no solution for 2.6.26 (and probably
2.6.27). My trust in 2.6.26 increased enough to use it on desktops and IYes that would be p...
And yet it requires all the settings that goes along with holding the
I hate to say it, but given the relative marketshare, we should disable
Padlock instead.This is part of the pain of being a minority architecture.
-hpa
--
Walter, Viro,
As I can't test, can you please test this and Ack.
thanks,
suresh
---
[patch] fix via padlock instruction usage with kernel_fpu_begin/end()Wolfgang Walter reported this oops on his via C3 using padlock for
AES-encryption:##################################################################
BUG: unable to handle kernel NULL pointer dereference at 000001f0
IP: [<c01028c5>] __switch_to+0x30/0x117
*pde = 00000000
Oops: 0002 [#1] PREEMPT
Modules linked in:Pid: 2071, comm: sleep Not tainted (2.6.26 #11)
EIP: 0060:[<c01028c5>] EFLAGS: 00010002 CPU: 0
EIP is at __switch_to+0x30/0x117
EAX: 00000000 EBX: c0493300 ECX: dc48dd00 EDX: c0493300
ESI: dc48dd00 EDI: c0493530 EBP: c04cff8c ESP: c04cff7c
DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
Process sleep (pid: 2071, ti=c04ce000 task=dc48dd00 task.ti=d2fe6000)
Stack: dc48df30 c0493300 00000000 00000000 d2fe7f44 c03b5b43 c04cffc8 00000046
c0131856 0000005a dc472d3c c0493300 c0493470 d983ae00 00002696 00000000
c0239f54 00000000 c04c4000 c04cffd8 c01025fe c04f3740 00049800 c04cffe0
Call Trace:
[<c03b5b43>] ? schedule+0x285/0x2ff
[<c0131856>] ? pm_qos_requirement+0x3c/0x53
[<c0239f54>] ? acpi_processor_idle+0x0/0x434
[<c01025fe>] ? cpu_idle+0x73/0x7f
[<c03a4dcd>] ? rest_init+0x61/0x63
=======================Wolfgang also found out that adding kernel_fpu_begin() and kernel_fpu_end()
around the padlock instructions fix the oops.Suresh wrote:
These padlock instructions though don't use/touch SSE registers, but it behaves
similar to other SSE instructions. For example, it might cause DNA faults
when cr0.ts is set. While this is a spurious DNA trap, it might cause
oops with the recent fpu code changes.This is the code sequence that is probably causing this problem:
a) new app is getting exec'd and it is somewhere in between
start_thread() and flush_old_exec() in the load_xyz_binary()b) At pont "a", task's fpu state (like TS_USEDFPU, use...
No this wasn't a problem because kernel_fpu_begin clears TS and
therefore we don't get faults on SSE instructions.However, with your patch it will become a problem due to the
fact that it wasn't designed to be nested.Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
No. Here is the case that can fail on 2.6.25 aswell.
0. CPU's TS flag is set
1. kernel using FPU in some optimized copy routine and while doing
kernel_fpu_begin() takes an interrupt just before doing clts()2. Takes an interrupt and ipsec uses padlock instruction. And we
take a DNA fault as TS flag is still set.3. We handle the DNA fault and set TS_USEDFPU and clear cr0.ts
4. We complete the padlock routine
5. Go back to step-1, which resumes clts() in kernel_fpu_begin(), finishes
the optimized copy routine and does kernel_fpu_end(). At this point,
we have cr0.ts again set to '1' but the task's TS_USEFPU is stilll
set and not cleared.6. Now kernel resumes its user operation. And at the next context
switch, kernel sees it has do a FP save as TS_USEDFPU is still set
and then will do a unlazy_fpu() in __switch_to(). unlazy_fpu()
will take a DNA fault, as cr0.ts is '1' and now, because we are
in __switch_to(), math_state_restore() will get confused and will
restore the next task's FP state and will save it in prev tasks's FP state.
Remember, in __switch_to() we are already on the stack of the next task
but take a DNA fault for the prev task.This causes the fpu leakage. We didn't encounter this so far on via
platforms because we don't have any optimized routines that use FP/SSE
in the kernel?thanks,
suresh
--
Right, we could've fixed this by doing the clts before checking
TS_USEDFPU in kernel_fpu_begin. However, I admit that taking a
fault in the general case for the PadLock is stupid anyway so we
should definitely fix this in the ways that you suggested.Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
We're invoking this code from interrupt handlers?
-hpa
--
In softirq context, yes.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
Here are the results:
===================================================
*** 2.6.25.13: modprobe tcrypt mode=200testing speed of ecb(aes) encryption
test 0 (128 bit key, 16 byte blocks): 1 operation in 608 cycles (16 bytes)
test 1 (128 bit key, 64 byte blocks): 1 operation in 582 cycles (64 bytes)
test 2 (128 bit key, 256 byte blocks): 1 operation in 702 cycles (256 bytes)
test 3 (128 bit key, 1024 byte blocks): 1 operation in 1182 cycles (1024 bytes)
test 4 (128 bit key, 8192 byte blocks): 1 operation in 6421 cycles (8192 bytes)
test 5 (192 bit key, 16 byte blocks): 1 operation in 1340 cycles (16 bytes)
test 6 (192 bit key, 64 byte blocks): 1 operation in 1411 cycles (64 bytes)
test 7 (192 bit key, 256 byte blocks): 1 operation in 1747 cycles (256 bytes)
test 8 (192 bit key, 1024 byte blocks): 1 operation in 3091 cycles (1024 bytes)
test 9 (192 bit key, 8192 byte blocks): 1 operation in 18018 cycles (8192 bytes)
test 10 (256 bit key, 16 byte blocks): 1 operation in 601 cycles (16 bytes)
test 11 (256 bit key, 64 byte blocks): 1 operation in 628 cycles (64 bytes)
test 12 (256 bit key, 256 byte blocks): 1 operation in 796 cycles (256 bytes)
test 13 (256 bit key, 1024 byte blocks): 1 operation in 1468 cycles (1024 bytes)
test 14 (256 bit key, 8192 byte blocks): 1 operation in 8466 cycles (8192 bytes)testing speed of ecb(aes) decryption
test 0 (128 bit key, 16 byte blocks): 1 operation in 565 cycles (16 bytes)
test 1 (128 bit key, 64 byte blocks): 1 operation in 581 cycles (64 bytes)
test 2 (128 bit key, 256 byte blocks): 1 operation in 702 cycles (256 bytes)
test 3 (128 bit key, 1024 byte blocks): 1 operation in 1179 cycles (1024 bytes)
test 4 (128 bit key, 8192 byte blocks): 1 operation in 6409 cycles (8192 bytes)
test 5 (192 bit key, 16 byte blocks): 1 operation in 1335 cycles (16 bytes)
test 6 (192 bit key, 64 byte blocks): 1 operation in 1402 cycles (64 bytes)
test 7 (192 bit key, 256 byte blocks): 1 operation in 1738 cycles (256 bytes)
test 8 (192 bit key, 1024 byte bl...
Thanks, so for IPsec it's about a 10% increase in CPU utilisation.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
Yes. Padlock/ipsec usage seems to be happening in the interrupt
context and this triggered the bug that started this thread.thanks,
suresh
--
I don't exactly understand this. You think that
kernel_fpu_begin();
XCRYPT....
kernel_fpu_end();is a problem and wasn't before?
Say we have a software crypt-alg which uses optimized memcpy implemented with
SSE instructions. These are protected with kernel_fpu_begin();
kernel_fpu_end();So we have also code
kernel_fpu_begin();
SSE....
kernel_fpu_end();in crypto called under same circumstances.
If XCRYPT may be interrupted and the interrupt code again uses this optimized
memcpy implementation and so nesting kernel_fpu_begin then why should this
not happen with the other alg.How could any kernel code use MMX/SSE/FPU when the interrupt case isn't
handled?Or is your argument that its lazy allocation itself is the problem: this
nesting could always happen and was a bug but only with lazy allocation it is
dangerous (as it may cause a spurious math fault in the race window).If this were right than any kernel code executing SSE may trigger now a oops
in __switch_to() under some special circumstances.Regards,
--
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts
Leiter EDV
Leopoldstraße 15
80802 München
Tel: +49 89 38196-276
Fax: +49 89 38196-144
wolfgang.walter@stwm.de
http://www.studentenwerk-muenchen.de/
--
Wolf, kernel_fpu_begin() and kernel_fpu_end() only saves the user
state from getting corrupted with the kernel state. But it doesn't
help if kernel has nesting fpu usage.In this padlock case with the patch, we may encounter a nested
kernel_fpu_begin() and end()
but this is ok, as the padlock is not actually touching fpu/sse registers.Yes, we do have a problem when the interrupt handlers also use SSE
registers and if there is a nesting inside the kernel. Today we don'tToday we don't have any nesting, for example, fast memcpy, if in interrupt
will use the slow version and doesnt use the optimized version.thanks,
suresh
--
I take this back. kernel_fpu_end() is unconditionally doing stts(). So,
nesting of kernel_fpu_begin/end() is not ok.
--
I don't think we have ever allowed MMX/SSE/FPU code in interrupt
handlers. kernel_fpu_begin()..end() lock out preemption, and so couldIf lazy allocation can cause the RAID code, for example (which executes
SSE instructions in the kernel, but not at interrupt time) to start
randomly oopsing, then lazy allocations have to be pulled.-hpa
--
Yes, fast handlers fall back to slow handlers in the interrupt context
and don't touch FP/SSE and thus avoid the kernel nesting.hmm, in the padlock interrupt usage scenario(even though it doesn't touch FP/SSE
registers), kernel_fpu_begin/end() will not solve the problem,
as nesting of kernel_fpu_begin() is not ok, as we unconditionally
do stts() in kernel_fpu_end(). So the proposed patch is not ok,While the lazy allocation is not a big thing and can be pulled(with a
very small patch), this has brought two existing security issues to light
so far. one in lguest code(fixed now) and now in padlock usage. I think even
in 2.6.25, padlock usage can easily can cause the FPU leakage as I mentioned
in another response.Backing out lazy allocation is not just enough here. Let me think a little
thanks,
suresh
--
Can we have something like irq_ts_save() and irq_ts_restore(), which will
do something like:int irq_ts_save()
{
if (!in_interrupt())
return 0;if (read_cr0() & X86_CR0_TS) {
clts();
return 1;
}
return 0;
}void irq_ts_restore(int TS_state)
{
if (!in_interrupt())
return 0;if (TS_state)
stts();
}and use this around padlock usage. Taking a spurious DNA fault in the process
context(even inside the kernel) should be ok. Main issue is with the interrupt
context and we can prevent the DNA fault in the irq context using above.Either above, or we have to remove the lazy fpu allocation and make the
below code in kernel_fpu_begin() atomic by disabling interrupts(to fix
the security hole with padlock usage)kernel_fpu_begin:
...local_irq_disable();
if (me->status & TS_USEDFPU)
__save_init_fpu(me->task);
else
clts();local_irq_enable();
...
--
Couldn't we just move clts before the USEDFPU check? That huld
close the window.Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
Appended the complete patch. Wolf, can you please help test this again
you are correct. as pre-emption is already disabled, we should be ok. But
given that we are taking another(clean) route to fix this issue, can leave the
current code as it is(and not do an unconditional clts()).
---[patch] fix via padlock instruction usage with irq_ts_save/restore()
Wolfgang Walter reported this oops on his via C3 using padlock for
AES-encryption:##################################################################
BUG: unable to handle kernel NULL pointer dereference at 000001f0
IP: [<c01028c5>] __switch_to+0x30/0x117
*pde = 00000000
Oops: 0002 [#1] PREEMPT
Modules linked in:Pid: 2071, comm: sleep Not tainted (2.6.26 #11)
EIP: 0060:[<c01028c5>] EFLAGS: 00010002 CPU: 0
EIP is at __switch_to+0x30/0x117
EAX: 00000000 EBX: c0493300 ECX: dc48dd00 EDX: c0493300
ESI: dc48dd00 EDI: c0493530 EBP: c04cff8c ESP: c04cff7c
DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
Process sleep (pid: 2071, ti=c04ce000 task=dc48dd00 task.ti=d2fe6000)
Stack: dc48df30 c0493300 00000000 00000000 d2fe7f44 c03b5b43 c04cffc8 00000046
c0131856 0000005a dc472d3c c0493300 c0493470 d983ae00 00002696 00000000
c0239f54 00000000 c04c4000 c04cffd8 c01025fe c04f3740 00049800 c04cffe0
Call Trace:
[<c03b5b43>] ? schedule+0x285/0x2ff
[<c0131856>] ? pm_qos_requirement+0x3c/0x53
[<c0239f54>] ? acpi_processor_idle+0x0/0x434
[<c01025fe>] ? cpu_idle+0x73/0x7f
[<c03a4dcd>] ? rest_init+0x61/0x63
=======================Wolfgang also found out that adding kernel_fpu_begin() and kernel_fpu_end()
around the padlock instructions fix the oops.Suresh wrote:
These padlock instructions though don't use/touch SSE registers, but it behaves
similar to other SSE instructions. For example, it might cause DNA faults
when cr0.ts is set. While this is a spurious DNA trap, it might cause
oops with the recent fpu code changes.This is the code sequence that is probably cau...
* Works fine, machine is up since 61 minutes.
* Performance:
Routing performance over esp-tunnels seems unchanged here compared to 2.6.25
(this was also the case with the "kernel_fpu_begin" patch).tcrypt mode=200 shows exactly the same performance penalty compared to 2.6.25
as the "kernel_fpu_begin" patch.But I think this the right way to go with 2.6.26 und probably 2.6.27. And I'm
not sure if tcyrpt really shows the whole story for 2.6.25:a) does it measure the costs of the unecessary FXSAVE and FXRSTOR ?
b) does it measure the clts() and stts() which will happen any way though not
in padlock-*.c itself but in __switch_to() and math_state_restore() ?So shouldn'tthis patch make - in the whole - performance better compared to
2.6.25 (because its avoids FXSAVE and FXRSTOR for tasks which do not use
FPU/SSE/... in userspace)?Here the results for tcrypt mode=200:
===============================
testing speed of ecb(aes) encryption
test 0 (128 bit key, 16 byte blocks): 1 operation in 763 cycles (16 bytes)
test 1 (128 bit key, 64 byte blocks): 1 operation in 740 cycles (64 bytes)
test 2 (128 bit key, 256 byte blocks): 1 operation in 860 cycles (256 bytes)
test 3 (128 bit key, 1024 byte blocks): 1 operation in 1340 cycles (1024 bytes)
test 4 (128 bit key, 8192 byte blocks): 1 operation in 6583 cycles (8192 bytes)
test 5 (192 bit key, 16 byte blocks): 1 operation in 1542 cycles (16 bytes)
test 6 (192 bit key, 64 byte blocks): 1 operation in 1614 cycles (64 bytes)
test 7 (192 bit key, 256 byte blocks): 1 operation in 1950 cycles (256 bytes)
test 8 (192 bit key, 1024 byte blocks): 1 operation in 3294 cycles (1024 bytes)
test 9 (192 bit key, 8192 byte blocks): 1 operation in 18214 cycles (8192 bytes)
test 10 (256 bit key, 16 byte blocks): 1 operation in 753 cycles (16 bytes)
test 11 (256 bit key, 64 byte blocks): 1 operation in 781 cycles (64 bytes)
test 12 (256 bit key, 256 byte blocks): 1 operation in 949 cycles (256 bytes)
test 13 (256 bit key, 1024 byte blocks): 1 ...
That's not surprising since tcrypt runs with BH off so it'll
do pretty much the same thing as before. This also shows that
reading CR0 doesn't impose any extra overhead compared to what
was done in kernel_fpu_begin.Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
Ok. In the real world usage, where we use these instructions both from
process and softirq context, we will probably not see much penality,
as the process context's first access will always endup doing full fp restore
(and also we kick in the context switch FP optimization, which willAs there are no further objections, Herbert/Ingo not sure who among you
will push this patch to Linus tree and 2.6.26 -stable tree aswell.thanks,
suresh
--
I'll push this one along.
Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
Yes, I'll test that tomorrow.
Regards,
--
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts
Leiter EDV
Leopoldstraße 15
80802 München
Tel: +49 89 38196-276
Fax: +49 89 38196-144
wolfgang.walter@stwm.de
http://www.studentenwerk-muenchen.de/
--
no fundamental objection to the x86 bits.
shouldnt this:
+ if (!in_interrupt())
+ return 0;just be eliminated and the cr0/TS save/restore be made unconditional?
irq-assymetric APIs are not nice in general.Reading/setting cr0 isnt _that_ slow. (or if it is, by how much does it
slow things down, exactly?)Ingo
--
Setting it is relatively slow. I think that's part of the reason for
special instructions to muck with the TS flag.Reading it might be slow on obsolete processors.
-hpa
--
In addition to the slowness(Wolf has collected some data with earlier
patch and I think it showed a double digit increase in cpu utilization with
some crypt tests):we can't unconditionally do clts() in the process context. We have
to disable pre-emption to avoid interactions with context switch and
lazy restore. So there will be RT latency issues aswell.thanks,
suresh
--
On Mon, Aug 11, 2008 at 01:19:01PM -0700, Suresh Siddha wrote:
Yes disabling preemption is the real killer.
This is just a quick band-aid. Longer term we should add a task
flag that indicates the task is currently doing kernel FPU which
will tell the scheduler to clear TS the next time it's run. That
way we won't need to disable preemtion or pollute the user task's
FPU used state.Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
That's not sufficient, though, because you have to track all the state
and how it relates to everything. You now have to track both the
userspace FPU state and the potential kernel FPU state. The VIA
instructions are special (in the short bus to school sense) in that they
use a mechanism intended to protect specific state to protect -- exactly
nothing.-hpa
--
Sorry, the kernel TS state is what I meant. I'm definitely not
advocating the saving of the kernel FPU state. This is only for
things like the VIA (which also exists for other processors, see
the xor SSE stuff in include/asm-x86).Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
No, there you are actually using the FPU state (which includes the SSE
state.)-hpa
--
Right, well at least VIA could still use this and it wouldn't
hurt others that much.Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
The first solution - if it works and padlock is the only which has problem
with it - seems to be a good fix for 2.6.26. If it works I can't say as I'm
not familiar enough with these things. But I'll happily test it :-).The second would be a little bit intrusive, wouldn't it? Most machines don't
have padlock, and therfore don't need this change but nevertheless may be
affected (i.e. they use MMX for memcpy or MMX/SSE with raid6) and now get a
different behaviour. Don't know how expensive such a local_irq_enable/disable
would be.Regards,
--
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts
--
What about
arch/x86/lib/mmx_32.c
and then
include/asm-x86/string_32.h
couldn't memcpy end as beeing implemented with MMX? And interrupt handlers may
use memcpy? Or did I miss something?Similar: couldn't memcpy be called in asynchronous way, say a page fault or
someting like that?Regards,
--
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts
Leiter EDV
Leopoldstraße 15
80802 München
Tel: +49 89 38196 276
Fax: +49 89 38196 150
Email: wolfgang.walter@stwm.de
http://www.studentenwerk-muenchen.de/
--
You mean the dynamic fpu allocation patch or the proposed patch for padlock
code?Thanks,
--
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts
Leiter EDV
Leopoldstraße 15
80802 München
Tel: +49 89 38196-276
Fax: +49 89 38196-144
wolfgang.walter@stwm.de
http://www.studentenwerk-muenchen.de/
--
Even if there wasn't one before, there is going to be one now
because as you pointed out yourself, if we get an inbound IPsec
packet between any kernel_fpu_begin/kernel_fpu_end area, we could
get a nested kernel_fpu_end which puts us back to square one wrt
to the original race.So clearly we need to think more about this issue.
Unless we can come up with a new solution quickly, I recommend
that we back out the FPU changes.Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
* AES-CBC
Seems to work here: running since 73 minutes my test which usually crashes the
machine after 1-2 minutes :-).Enclosing the whole encryption/decryption is a good idea, by the way. With my
modification I observed a large latency (up to 500ms) for some packets over
the esp-tunnel (every 5 to 10 seconds there were some of them). This is not
the case with your patch.I did some simple performance tests. As far as I can see throughput and
latency (i.e. routing packets as ipsec-gateway) are at least as good as with
2.6.25.13.* RNG
did several time
dd if=/dev/hwrng bs=1024 count=$((1024*100)) of=/dev/zero
Works fine. Observed no performance degradation compared to 2.6.25.
* Hash-Engine
Can't test that as I don't have an VIA Ester or newer.
Regards,
--
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts
--
Someone please test this with tcrypt mode=200 with and without
the patch.If there is a significant degradation I suggest that we back out
the FPU changes. Making your competitor's processor go slow is
not a nice thing to do, especially when you've just released a
processor in the same space.If the degradation is small then I'll push it to Linus.
Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
Please be assured that I would like to close/adddress all the perf and
stability issues and it doesn't matter whose processor it is ;)
--
OK I take that back, Intel (and AMD) processors are equally affected
by this. For a start, the newly added Intel crc32c driver suffers
from exactly the same problem since the instruction used is also
SSE.According to
http://softwarecommunity.intel.com/articles/eng/3788.htm
the Intel AES instructions are also SSE so they too will suffer
from this.Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
Actually I was mistaken, Intel has done the right thing by not
generating a fault on crc32c with TS set.Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
Yes, the documentation seems to says so. But I am double checking
with experts inside, to make sure we got the SDM right.
--
Thanks for checking this!
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
Indeed :)
-hpa
--
I got confirmation that crc32 instructions don't generate DNA when cr0.TS bit
is set. So no changes required for Austin's patches in this context.thanks,
suresh
--
Regards,
--
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts
Leiter EDV
Leopoldstraße 15
80802 München
Tel: +49 89 38196-276
Fax: +49 89 38196-144
wolfgang.walter@stwm.de
http://www.studentenwerk-muenchen.de/
--
It's part of the kernel. Just make sure you've got CRYPTO_TEST
enabled as a module.Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
Ok, that is easy :-). I'll make that test tomorrow. Bed time now :-).
I'll run that test on a unpatched 2.6.25 instead a unpatched 2.6.26 as latter
will crash. If you want I can use my 2.6.26 where I reverted the FPU state
changes (see one of my earlier mails for the patch), of course.Regards,
--
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts
Leiter EDV
Leopoldstraße 15
80802 München
Tel: +49 89 38196-276
Fax: +49 89 38196-144
wolfgang.walter@stwm.de
http://www.studentenwerk-muenchen.de/
--
2.6.25 should be fine as not much has changed in the padlock.
Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
Actually the patch is totally unacceptable as it is. Disabling
preemption around padlock crypto operations is going to make the
latency for RT go through the roof. Please find a solution that
doesn't involve that or just back out the FPU changes.Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
Probably
drivers/crypto/padlock-sha.c
drivers/char/hw_random/via-rng.cneeds a fix, too.
Regards,
--
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts
Leiter EDV
Leopoldstraße 15
80802 München
Tel: +49 89 38196 276
Fax: +49 89 38196 150
Email: wolfgang.walter@stwm.de
http://www.studentenwerk-muenchen.de/
--
BTW, in one of your oops, I see:
note: cron[1207] exited with preempt_count 268435459
I smell some kind of stack corruption here which is corrupting
thread_info (in the above case preempt_count in the thread_info).Similarly, if the status field(in thread_info) gets corrupted(setting
TS_USEDFPU) without proper math state allocated(present in thread_struct),
we can end up oops in __switch_to.But you seem to say, reverting recent fpu patches make the problem go away.
hmm, just wondering if your test kernel (with fpu patches reverted) is stable
enough and don't see other oops/issues?Recently Vegard also noticed some stack corruptions (in network stack) leading
to similar problems. Not sure if Vegard has root caused his issue. copying him
for his comments.thanks,
suresh
--
On Wed, Aug 6, 2008 at 11:21 PM, Suresh Siddha
I don't think this is the same problem. What I see is almost certainly
a problem with netpoll, netconsole, or the 8139too driver. I see a UDP
packet in a task_struct.There is also the fact that reverting fpu patches makes it go away
(for Wolfgang), while for the issue I am seeing, oops in FP code is
just one out of several different corruptions (sometimes it happens in
other slabs).(Sorry for the little late reply.)
Thanks.
Vegard
--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036
--
I don't think 268435459 is a strong indication of stack corruption:
268435459 == 0x10000003 == PREEMPT_ACTIVE|0x03
Regards,
--
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts
--
Thanks for your answer.
I don't know how padlock exactly works and I don't know anything of i386's
architecture on hardware and assembler level. So I can only speculate:Maybe padlock aes does influence FP/math.
http://linux.via.com.tw/support/beginDownload.action?eleid=181&fid=261
states:
3. SSE instructions must be enabled via the standard x86 method of enabling
the FXSAVE/FXRSTOR instructions using CR4[9] This enables the full set of SSE
instructions. If CR4[9] is not set, PadLock behaves as if it were disabled viaNo oops yet.
2.6.26 crashes here within 1 or 2 minutes if (and only) if there is ipsec
traffic using padlock aes and there are actually processes running (i.e.
ssh).The modified 2.6.26 did not crash yet (now running hours).
Unmodified 2.6.26 where we use i386 assembler aes instead of padlock runs
since 2 weeks.We further use almost same kernels (only compiled for K7 or Intel Core Duo,
Regards,
--
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts
--
| Hiten Pandya | Re: up? (emacs docbook xml ide) |
| Greg KH | [GIT PATCH] driver core patches against 2.6.24 |
| Roland Dreier | Re: Integration of SCST in the mainstream Linux kernel |
| Florian Schmidt | blacklist kernel boot option |
git: | |
| Linus Torvalds | Re: iptables very slow after commit 784544739a25c30637397ace5489eeb6e15d7d49 |
| Arjan van de Ven | Re: [GIT]: Networking |
| David Miller | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
