Re: [PATCH 3/4] Revert "x86: disable IOMMUs on kernel crash"

Previous thread: Re: [PATCH 0/6] tagged sysfs support by Ben Hutchings on Friday, April 2, 2010 - 5:58 pm. (3 messages)

Next thread: none
From: Chris Wright
Date: Friday, April 2, 2010 - 6:27 pm

Series includes the following patches:

  x86/amd-iommu: enable iommu before attaching devices
  x86/amd-iommu: warn when issuing command to uninitialized cmd buffer
  Revert "x86: disable IOMMUs on kernel crash"
  x86/amd-iommu: use for_each_pci_dev

First one is the primary bug fix.

v2
- add disable_iommus on failure path
- move removal of CMD_BUFFER_UNINITIALIZED to iommu_enable_command_buffer()
- fix ";;" typo in patch 2
- add Revert "x86: disable IOMMUs on kernel crash"
- add x86/amd-iommu: use for_each_pci_dev

thanks,
-chris
--

From: Chris Wright
Date: Friday, April 2, 2010 - 6:27 pm

This effectively reverts commit 61d047be99757fd9b0af900d7abce9a13a337488.

Disabling the IOMMU can potetially allow DMA transactions to
complete without being translated.  Leave it enabled, and allow
crash kernel to do the IOMMU reinitialization properly.

Cc: Joerg Roedel <joerg.roedel@amd.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
---
 arch/x86/kernel/crash.c |    6 ------
 1 file changed, 6 deletions(-)

--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -27,7 +27,6 @@
 #include <asm/cpu.h>
 #include <asm/reboot.h>
 #include <asm/virtext.h>
-#include <asm/x86_init.h>
 
 #if defined(CONFIG_SMP) && defined(CONFIG_X86_LOCAL_APIC)
 
@@ -103,10 +102,5 @@ void native_machine_crash_shutdown(struc
 #ifdef CONFIG_HPET_TIMER
 	hpet_disable();
 #endif
-
-#ifdef CONFIG_X86_64
-	x86_platform.iommu_shutdown();
-#endif
-
 	crash_save_cpu(regs, safe_smp_processor_id());
 }

--

From: Joerg Roedel
Date: Saturday, April 3, 2010 - 10:22 am

Hmm, I think for this we need to change the gart code too and disable
the gart before its initialization runs to not re-introduce issues fixed
in commit bc2cea6a34fdb30f118ec75db39a46a191870607, no?

	Joerg

--

From: Eric W. Biederman
Date: Saturday, April 3, 2010 - 10:44 am

That is a different code path with a different set of assumptions and
restrictions.  On a normal kexec of course we want to do an orderly shutdown.

For the gart with a little luck we can just ignore it on kexec on
panic.  Unlike a virtualization capable iommu it doesn't prevent access
to devices, when it is enabled.   Worst case is that we have to start
including iommu=off for gart systems. The best case is that we can
figure out how to have the gart code reinitialize itself sanely,
starting from some arbitrary point.

machine_crash_shutdown is about doing those things that we can not do
in any other way.

Eric
--

From: Joerg Roedel
Date: Sunday, April 4, 2010 - 1:44 am

Thats another problem with this patch. It introduces a difference


No no no. This is a maintenance nightmare for almost everybody. Where do
you want to Document this special cases that 'if kernel uses gart then
and only then boot the kexec kernel with iommu=off'.
Always passing iommu=off to the kexec kernel doesn't work too for

Yes, that is missing in this patch. But to keep changes small and don't
bother with the gart code at all I suggest to remove the shutdown
routine from the amd-iommu code only and not the whole shutdown call in
the machine_crash_shutdown path.

	Joerg

--

From: Eric W. Biederman
Date: Sunday, April 4, 2010 - 2:16 am

Of course there is.  kexec on panic does no shutdown, and should not.
That is fundamental.  That is documented.

This make them symmetric crap is a bad idea, because we can not reliably
do it.  999 times out of 1000 we can actually handle everything we
need to in the kdump kernel in the initialization path and when

Now that I look at that commit again I am stunned.  That commit
already says it is doing things wrong.  


I agree.  I would like to see something better, but the situation
with the current set of patches is workable.  Getting autodetection
in there an automatically doing the right thing would be much better.


Which will only encourage further abuse of that code path.  The reliability
has been continually decreasing lately and I believe this is one of those
reasons.  The hpet junk in there appears to be an even bigger reason.

I have machines right now that can not reliably crash dump because
someone tried papering over problems by putting code in the
machine_crash_shutdown path which must have worked for their test cast
but does not work for real world failures, and right now I am very
grumpy about it all.

I guess what I am saying is that I do not believe shutting down the
gart in machine_crash_shutdown actually works.  It is definitely not
the right place to do that work.  So I don't see why we should keep
any of that code there.

Eric
--

From: Eric W. Biederman
Date: Sunday, April 4, 2010 - 2:19 am

I guess I should add the place where I have serious test results for with
a gart is on 2.6.29.  Where there isn't that iommu shutdown and I don't have
problems with it.

So my experience is that touching the gart in machine_crash_shutdown is
unnecessary.

Eric
--

From: Joerg Roedel
Date: Saturday, April 3, 2010 - 10:41 am

Another problem: This also breaks if the kdump kernel has no
iommu-support.

	Joerg

--

From: Eric W. Biederman
Date: Saturday, April 3, 2010 - 10:49 am

Not a problem.  We require a lot of things of the kdump kernel,
and it is immediately apparent in a basic sanity test.

Eric
--

From: Joerg Roedel
Date: Saturday, April 3, 2010 - 12:13 pm

Only if the sanity test is done on an iommu machine which I don't want
to rely on.

	Joerg

--

From: Eric W. Biederman
Date: Saturday, April 3, 2010 - 12:41 pm

That makes no sense.

The requirements on the kdump kernel has always been that it somehow
figure out to recover a machine that is in a random hardware state.
That requires drivers for the hardware, that is critical to the
machines operation.

The easy test for sysadmins is to do:
echo > /proc/sysrq-trigger

Anyone who thinks the result from one piece of hardware applies to
another is deluded.

We have been down the path of doing lots of things in the crashing
kernel with lkcd, in practice it was worthless in the event of real
world crashes.

kexec on panic isn't perfect but it at least is an architecture that
works often enough to be usable.  It does require testing to make certain
the basic code paths don't regress, but even so it is a lot easier
to maintain and keep useful than any alternative I know of.

Eric
--

From: Bernhard Walle
Date: Sunday, April 4, 2010 - 12:24 am

Also, in most cases (for example: distribution kernels), the kdump
kernel nowadays is identical to the running kernel. So, if the running
kernel has IOMMU support, the kdump kernel also has.


Regards,
Bernhard
--

From: Eric W. Biederman
Date: Sunday, April 4, 2010 - 12:51 am

I have been rather pleasantly surprised at how well that works.  Although
I am still glad I insisted that using the same kernel version not be a
hard requirement.  It allowed us to specify a good solid interface.

How to cope with GART in that situation without having to manually specify
iommu=off in the configuration I find a more compelling question.

I expect I will have a look at that in the coming months if someone doesn't get
there before I do.

Eric
--

From: Joerg Roedel
Date: Sunday, April 4, 2010 - 1:53 am

Yes, I know. But is that a requirement for kexec? If not we still
potentially have this problem. It is a smaller problem than
data-corruption caused by in-flight dma because most
people^Wdistributions indeed use the same kernel for normal boot and
kexec, thats true.

	Joerg

--

From: Eric W. Biederman
Date: Sunday, April 4, 2010 - 2:44 am

For normal kexec no.  That path is expected to do a clean hardware
shutdown.

For kexec on panic aka kdump the requirement is that your your crash
kernel be able to initialize your hardware from any state it can be
put in.

Currently we have no kernels that properly recover and amd-iommu and

There are exceptions made for practical reality.  The fact that
typically the kdump kernel is the same as the running kernel mean that
we don't even need to consider one of those exceptions.

Eric
--

From: Joerg Roedel
Date: Sunday, April 4, 2010 - 3:01 am

Ok, if you show me where this is documented for everybody then I am
probably convinced :-)
We should fixup the gart initialization anyway.

	Joerg

--

From: Chris Wright
Subject:
Date: Tuesday, April 6, 2010 - 10:42 am

So, you planning to pull in all 4 patches then?

thanks,
-chris
--

From: Joerg Roedel
Subject:
Date: Tuesday, April 6, 2010 - 10:51 am

Yes, I will apply them tomorrow and write a fix for the GART issue this
may introduce.

	Joerg


--

From: Vivek Goyal
Date: Tuesday, April 6, 2010 - 1:39 pm

Hi Joerg,

Going through the old mail thread, I think the commit you pointed to was
primarily introduced to solve kexec + GART issue and not necessarily kdump
issue.

In fact disabling IOMMU patch was introduced by you.

Author: Joerg Roedel <joerg.roedel@amd.com>
Date:   Tue Jun 9 17:56:09 2009 +0200

    x86: disable IOMMUs on kernel crash
    
    If the IOMMUs are still enabled when the kexec kernel boots access to
    the disk is not possible. This is bad for tools like kdump or anything
    else which wants to use PCI devices.
    
    Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>

I am assuming you introduced this patch because you faced issues with
amd-iommu and not GART.

So basically GART should have been working with kdump even before you
introduced disabling iommu patch in kdump path.

Thanks
Vivek
--

From: Vivek Goyal
Date: Tuesday, April 6, 2010 - 2:13 pm

Looking at following commit, we were still not shutting down GART and
fixing issues like second kernel accessing the GART aperture set by first
kernel.

commit aaf230424204864e2833dcc1da23e2cb0b9f39cd
Author: Yinghai Lu <Yinghai.Lu@Sun.COM>
Date:   Wed Jan 30 13:33:09 2008 +0100

    x86: disable the GART early, 64-bit
    
    For K8 system: 4G RAM with memory hole remapping enabled, or more than
    4G RAM installed.

So I guess it should be fine to not shutdown GART in crashing kernel and
then look at the fresh issues which crop up and figure out how to fix
those.

Vivek
--

From: Yinghai Lu
Date: Tuesday, April 6, 2010 - 2:45 pm

not sure if it is related:

for crashing kernel, it could do early_memtest to check if some device
are still do dma operation.

When I use kexec to start second kernel, if enable the early_memtest
in second kernel, it will find some pages RAM are BAD,
and it will mark them and not use them.  memtest=1 should be good enough.
Fresh restart will not report there is any BAD ram in the same system.

YH
--

From: Eric W. Biederman
Date: Tuesday, April 6, 2010 - 3:10 pm

Devices doing DMA in general are not a problem in the kdump kernel
because we are using an area of memory that has been reserved since
the beginning of time and no DMA's should be targeting it.  The challenge

I assume you are not talking kdump here.

On-going DMA in the case of kexec indicates some device driver isn't shutting
itself down when it's shutdown method is called.

Odds are it is a network controller that doesn't stop DMA when it is brought
down or it is, possibly a really weird disk driver.

If you are seeing this with the kdump kernel this may indeed indicate an
IOMMU reinitialization problem.

Eric
--

From: David Woodhouse
Date: Sunday, April 4, 2010 - 4:54 am

The kdump kernel has to support the hardware it's running on.
Film at 11.

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation

--

From: Chris Wright
Date: Friday, April 2, 2010 - 6:27 pm

Replace open coded version with for_each_pci_dev

Signed-off-by: Chris Wright <chrisw@sous-sol.org>
---
 arch/x86/kernel/amd_iommu.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/arch/x86/kernel/amd_iommu.c
+++ b/arch/x86/kernel/amd_iommu.c
@@ -2187,7 +2187,7 @@ static void prealloc_protection_domains(
 	struct dma_ops_domain *dma_dom;
 	u16 devid;
 
-	while ((dev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, dev)) != NULL) {
+	for_each_pci_dev(dev) {
 
 		/* Do we handle this device? */
 		if (!check_device(&dev->dev))

--

From: Chris Wright
Date: Friday, April 2, 2010 - 6:27 pm

To catch future potential issues we can add a warning whenever we issue
a command before the command buffer is fully initialized.

Signed-off-by: Chris Wright <chrisw@sous-sol.org>
---
Or not...this is just if you think it's useful ;-)

 arch/x86/include/asm/amd_iommu_types.h |    1 +
 arch/x86/kernel/amd_iommu.c            |    1 +
 arch/x86/kernel/amd_iommu_init.c       |    5 +++--
 3 files changed, 5 insertions(+), 2 deletions(-)

--- a/arch/x86/include/asm/amd_iommu_types.h
+++ b/arch/x86/include/asm/amd_iommu_types.h
@@ -140,6 +140,7 @@
 
 /* constants to configure the command buffer */
 #define CMD_BUFFER_SIZE    8192
+#define CMD_BUFFER_UNINITIALIZED 1
 #define CMD_BUFFER_ENTRIES 512
 #define MMIO_CMD_SIZE_SHIFT 56
 #define MMIO_CMD_SIZE_512 (0x9ULL << MMIO_CMD_SIZE_SHIFT)
--- a/arch/x86/kernel/amd_iommu.c
+++ b/arch/x86/kernel/amd_iommu.c
@@ -392,6 +392,7 @@ static int __iommu_queue_command(struct 
 	u32 tail, head;
 	u8 *target;
 
+	WARN_ON(iommu->cmd_buf_size & CMD_BUFFER_UNINITIALIZED);
 	tail = readl(iommu->mmio_base + MMIO_CMD_TAIL_OFFSET);
 	target = iommu->cmd_buf + tail;
 	memcpy_toio(target, cmd, sizeof(*cmd));
--- a/arch/x86/kernel/amd_iommu_init.c
+++ b/arch/x86/kernel/amd_iommu_init.c
@@ -436,7 +436,7 @@ static u8 * __init alloc_command_buffer(
 	if (cmd_buf == NULL)
 		return NULL;
 
-	iommu->cmd_buf_size = CMD_BUFFER_SIZE;
+	iommu->cmd_buf_size = CMD_BUFFER_SIZE | CMD_BUFFER_UNINITIALIZED;
 
 	return cmd_buf;
 }
@@ -472,12 +472,13 @@ static void iommu_enable_command_buffer(
 		    &entry, sizeof(entry));
 
 	amd_iommu_reset_cmd_buffer(iommu);
+	iommu->cmd_buf_size &= ~(CMD_BUFFER_UNINITIALIZED);
 }
 
 static void __init free_command_buffer(struct amd_iommu *iommu)
 {
 	free_pages((unsigned long)iommu->cmd_buf,
-		   get_order(iommu->cmd_buf_size));
+		   get_order(iommu->cmd_buf_size & ~(CMD_BUFFER_UNINITIALIZED)));
 }
 
 /* allocates the memory where the IOMMU will log its events to */

--

From: Chris Wright
Date: Friday, April 2, 2010 - 6:27 pm

Hit another kdump problem as reported by Neil Horman.  When initializaing
the IOMMU, we attach devices to their domains before the IOMMU is
fully (re)initialized.  Attaching a device will issue some important
invalidations.  In the context of the newly kexec'd kdump kernel, the
IOMMU may have stale cached data from the original kernel.  Because we
do the attach too early, the invalidation commands are placed in the new
command buffer before the IOMMU is updated w/ that buffer.  This leaves
the stale entries in the kdump context and can renders device unusable.
Simply enable the IOMMU before we do the attach.

Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
---
 arch/x86/kernel/amd_iommu_init.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

--- a/arch/x86/kernel/amd_iommu_init.c
+++ b/arch/x86/kernel/amd_iommu_init.c
@@ -1288,6 +1288,8 @@ static int __init amd_iommu_init(void)
 	if (ret)
 		goto free;
 
+	enable_iommus();
+
 	if (iommu_pass_through)
 		ret = amd_iommu_init_passthrough();
 	else
@@ -1300,8 +1302,6 @@ static int __init amd_iommu_init(void)
 
 	amd_iommu_init_notifier();
 
-	enable_iommus();
-
 	if (iommu_pass_through)
 		goto out;
 
@@ -1315,6 +1315,7 @@ out:
 	return ret;
 
 free:
+	disable_iommus();
 
 	amd_iommu_uninit_devices();
 

--

From: Joerg Roedel
Date: Wednesday, April 7, 2010 - 3:05 am

Applied all, thanks.

--

Previous thread: Re: [PATCH 0/6] tagged sysfs support by Ben Hutchings on Friday, April 2, 2010 - 5:58 pm. (3 messages)

Next thread: none