Re: [PATCH] PCI Hotplug: fakephp: fix deadlock... again

Previous thread: [PATCH] x86: fix probe_nr_irqs for xen by Yinghai Lu on Thursday, August 21, 2008 - 1:10 pm. (13 messages)

Next thread: [PATCH 0/2] pciehp/shpchp prevent duplicate slot names by Alex Chiang on Thursday, August 21, 2008 - 2:11 pm. (4 messages)
From: Alex Chiang
Date: Thursday, August 21, 2008 - 1:19 pm

Hi Greg,

While playing around with my slot symlink stuff, I noticed that
the following sequence is problematic:

	1. clean boot
	2. modprobe acpiphp
	3. echo 0 > /sys/bus/pci/slots/N/power
	4. ???

After step 3, we *should* be seeing pci_release_dev() getting
called, but we never do because the refcount on the device is
still quite high (5 or 6, on my ia64 system).

I'm still trying to track this down, but I did notice, via code
inspection, at least one suspicious area:

#define for_each_pci_dev(d) while ((d = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, d)) != NULL)

That eventually calls pci_get_dev_by_id(), which increases the
refcount on the device, but never decrements it.

Looks like that change in behavior happened here:

	PCI: clean up search.c a lot
	95247b57ed844511a212265b45cf9a919753aea1

pci_get_device() used to decrement the refcount, but no longer
does.

Thanks to Matthew Wilcox for helping me get this far...

Like I said, I'm still trying to track down my particular issue,
but I'd like to get your opinion on this.

Thanks!

/ac

--

From: Matthew Wilcox
Date: Thursday, August 21, 2008 - 1:25 pm

In particular, I'd like to know whether this should be fixed by
pci_get_dev_by_id() decrementing the refcount of from/dev_start,
pci_get_subsys() decrementing 'from', or by bus_find_device()
decrementing 'start'.  It looks like bus_find_device() is the place
where this should logically happen, but the kerneldoc doesn't document
the intended behaviour.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--

From: Greg KH
Date: Thursday, August 21, 2008 - 1:47 pm

Ah, no the driver core isn't supposed to do this, it's something the pci
functions do out of "niceness" as that's how we can use them in an
iterator properly.

Does the following (untested) patch fix the issue for you all?

thanks,

greg k-h

--------------
Subject: PCI: fix reference leak in pci_get_dev_by_id()

From: Greg Kroah-Hartman <gregkh@suse.de>

Alex Chiang and Matthew Wilcox pointed out that pci_get_dev_by_id() does
not properly decrement the reference on the from pointer if it is
present, like the documentation for the function states it will.

Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Alex Chiang <achiang@hp.com>
Cc: stable <stable@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>


diff --git a/drivers/pci/search.c b/drivers/pci/search.c
index 217814f..3b3b5f1 100644
--- a/drivers/pci/search.c
+++ b/drivers/pci/search.c
@@ -280,6 +280,8 @@ static struct pci_dev *pci_get_dev_by_id(const struct pci_device_id *id,
 			      match_pci_dev_by_id);
 	if (dev)
 		pdev = to_pci_dev(dev);
+	if (from)
+		pci_dev_put(from);
 	return pdev;
 }
 
--

From: Alex Chiang
Date: Thursday, August 21, 2008 - 3:14 pm

Perfect, yes.

I applied it, and observed the refcount on the device I was using
to debug this problem.

I was able to modprobe acpiphp successfully, and echo 0 into the
device's 'power' file.

I then watched the rest of the hotplug core do its thing,
decrementing the refcount properly along the way, and at the end,
we did call pci_release_dev(), as I originally expected to. :)

Just to verify, I toggled the device's power a few times (echoing
a 1 and then a 0, etc.) and watched the refcounts. After each
offline, we properly called pci_release_dev().

Thanks for fixing this. It fixes a pretty bad leak in the hotplug
core (we were leaking an entire struct pci_dev * # of functions
for each offlined card, the first time around; subsequent
onlines/offlines were ok).

Tested-by: Alex Chiang <achiang@hp.com>

--

From: Zhao, Yu
Date: Friday, August 29, 2008 - 9:23 pm

And the pci_get_dev_by_id() is not safe again the PCI device removal. It might fire a warning in bus_find_device() when reference count of the knode_bus is decreased to 0 by pci_remove_bus().

Need to document this or a fix?

Thanks,
Yu

--

From: Greg KH
Date: Friday, August 29, 2008 - 10:37 pm

Is this something new?  Hasn't this always been that way?  Why would you
be wanting to call this function anyway?

thanks,

greg k-h
--

From: Zhao, Yu
Date: Friday, August 29, 2008 - 11:20 pm

It's been so for a while, guess since 2.5. I meant to say the pci_remove_bus_device(), not pci_remove_bus(), is used by pci hotplug code. If a device is being plugged off, and some drivers are calling pci_get_dev_by_id() to search something, then the warning might be fired through bus_find_device -> klist_iter_init_node ->kref_get.

Thanks,
--

From: Zhao, Yu
Date: Saturday, August 30, 2008 - 8:14 pm

Just happened to see a change in fakephp when searching usage of the pci_remove_bus_device(). I couldn't find the original patches, but this http://git.kernel.org/?p=linux/kernel/git/jbarnes/pci-2.6.git;a=commitdiff;h=fe99740ca... shows, the pci_remove_bus_device() was removed while the remove_slot() was added in disable_slot(), which means: 1) fakephp can't remove pci_dev anymore; 2) deadlock happens because remove_slot() tries to remove the sysfs entry calling itself.

--

From: Alex Chiang
Date: Monday, September 1, 2008 - 11:40 am

Sorry about introducing this regression. Does the following patch
fix it for you?

I've tested it on my system, and I can again use fakephp to
remove a device:

sapphire:/sys/bus/pci/slots # echo 0 > fake4/power 
fakephp: disable_slot - physical_slot = fake4
e1000 0000:01:02.1: PCI INT B disabled
GSI 32 (level, low) -> CPU 2 (0x0200) vector 54 unregistered
fakephp: Slot already scheduled for removal
fakephp: removing slot fake4

We get the "slot already scheduled for removal" because that
particular device has 2 functions, and we're creating slots on a
per-slot basis now, not a per-function basis.

Although, I wonder, Willy -- is that really the right thing to
do? Seems like fakephp would be more useful if we did operate on
a per-function basis, and not per-slot. Especially given Yu's
work with SR-IOV, where we can apparently have lots of functions
per a physical device.

Hm?

Thanks.

/ac

From: Alex Chiang <achiang@hp.com>

PCI Hotplug: fakephp: fix deadlock... again

Commit fe99740cac117f208707488c03f3789cf4904957 (construct one
fakephp slot per PCI slot) introduced a regression, causing a
deadlock when removing a PCI device.

We also never actually removed the device from the PCI core.

So we:

	- remove the device from the PCI core
	- do not directly call remove_slot() to prevent deadlock

Yu Zhao reported and diagnosed this defect.

Signed-off-by: Alex Chiang <achiang@hp.com>
---
diff --git a/drivers/pci/hotplug/fakephp.c b/drivers/pci/hotplug/fakephp.c
index 40337a0..146ca9c 100644
--- a/drivers/pci/hotplug/fakephp.c
+++ b/drivers/pci/hotplug/fakephp.c
@@ -320,15 +320,15 @@ static int disable_slot(struct hotplug_slot *slot)
 			return -ENODEV;
 		}
 
+		/* remove the device from the pci core */
+		pci_remove_bus_device(dev);
+
 		/* queue work item to blow away this sysfs entry and other
 		 * parts.
 		 */
 		INIT_WORK(&dslot->remove_work, remove_slot_worker);
 		queue_work(dummyphp_wq, &dslot->remove_work);
 
-		/* blow away this ...
From: Matthew Wilcox
Date: Monday, September 1, 2008 - 5:10 pm

I suspect it depends on what you believe the point of fakephp is.
My assumption was that it was a way to fake what would happen if you had
a hotplug controller for a particular slot.  In that context, the change
I made was clearly correct.  If you want to use it for hot-removing
individual functions from a Linux guest running under a hypervisor
(for example), that's much less useful.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--

From: Alex Chiang
Date: Monday, September 1, 2008 - 5:19 pm

Ok, this was all developed before I started working in this area,

Sure, that sounds reasonable.

In that case, my patch should fix the stupid regression I
introduced, but if others think that fakephp would be more useful
for hot-removing functions, I wouldn't object to reverting the
original patch.

Thanks.

/ac

--

From: Jesse Barnes
Date: Monday, September 8, 2008 - 9:12 pm

I don't have a preference here; whatever is most useful for people is probably 
what we should go for.  Zhao?

Thanks,
Jesse

--

From: Matthew Wilcox
Date: Monday, September 8, 2008 - 9:27 pm

I think we have two distinct groups of users; those who find emulating
slot hotplug useful, and those who want to hot-remove pci functions.
Maybe hot-removing a pci function should not be part of the pci hotplug
core per se -- obviously it would share much code -- but being able
to hot-remove a function is a fundamentally different thing from being
able to hot-remove a slot.  For example, hot-removing a function from a
device that is in a real hotplug slot should be possible, but it wouldn't
involve the driver for that hotplug slot.

So maybe what we want is a /sys/bus/pci/devices/dddd:bb:dd.f/remove file
that does just that.

Oh, and we also want a way to hot-add functions, not necessarily even
ones that have been removed from the machine after it was booted, but
those that show up after the machine has booted.  For example, one of
my former colleagues had a laptop which would remove the wireless pci
device from the bus if the rfkill switch was enabled.  I remember there
was a hack to load a module that called some pci bus rescan functionality.
I didn't look into it in much detail.

I don't have a firm idea about an interface for this.  SCSI handles it by
writing scsi-add-single-device H C T L to /proc/scsi/scsi.  Maybe we want
a /sys/bus/pci/scan or /sys/bus/pci/devices/scan file that we can echo
"0000:01:02.3" to scan just that function, or "0000:01:02" to scan the
device.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--

From: Andrew Patterson
Date: Monday, September 8, 2008 - 10:32 pm

I like this idea. Even perhaps add a recursive scan to rescan behind a
bridge?

-- 
Andrew Patterson
Hewlett-Packard

--

From: Zhao, Yu
Date: Thursday, September 4, 2008 - 8:03 am

Thanks for the fix.

Acked-by: Yu Zhao <yu.zhao@intel.com>

--

From: Jesse Barnes
Date: Thursday, August 21, 2008 - 3:23 pm

Wow, thanks Greg, that was fast.  I applied this to my for-linus branch; I'll 
ask Linus to pull it for 2.6.27.

Thanks,
--
Jesse Barnes, Intel Open Source Technology Center
--

From: Henrique de Moraes Holschuh
Date: Thursday, August 21, 2008 - 6:04 pm

Does 2.6.26 need it as well?

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh
--

From: Henrique de Moraes Holschuh
Date: Thursday, August 21, 2008 - 6:09 pm

Never mind, I just read Greg's reply.  It arrived a few seconds after I sent
this.  Sorry about it.

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh
--

From: Greg KH
Date: Thursday, August 21, 2008 - 1:40 pm

No, pci_get_device() never decremented the refcount, and that didn't
change in the above git commit.

The description of pci_get_device() says that a reference is grabbed:
	Iterates through the list of known PCI devices.  If a PCI device
	is found with a matching @vendor and @device, the reference
	count to the device is incremented and a pointer to its device
	structure is returned.  Otherwise, %NULL is returned.  A new
	search is initiated by passing %NULL as the @from argument.
	Otherwise if @from is not %NULL, searches continue from next
	device on the global list.  The reference count for @from is
	always decremented if it is not %NULL.


All of the pci_find* functions should not have grabbed a reference to
the device, as that was the "old" behavior.  All of the pci_get*
functions do grab a reference.

Did I somehow mess up and one of the pci_find* functions now improperly
increment a reference?  Hopefully we shouldn't be using those functions
anymore as they aren't hotplug safe...

thanks,

greg k-h
--

Previous thread: [PATCH] x86: fix probe_nr_irqs for xen by Yinghai Lu on Thursday, August 21, 2008 - 1:10 pm. (13 messages)

Next thread: [PATCH 0/2] pciehp/shpchp prevent duplicate slot names by Alex Chiang on Thursday, August 21, 2008 - 2:11 pm. (4 messages)