Hello, [this patch series touches a few subsystems; hopefully I got all the right maintainers] Recently, Matthew Wilcox sent out the following mail about PCI slots: http://marc.info/?l=linux-pci&m=119432330418980&w=2 The following patch series is a rough first cut at implementing the ideas he outlined, namely, that PCI slots are physical objects that we care about, independent of their hotplug capabilities. We introduce a struct pci_slot as a first-class citizen, and turn hotplug_slot into a subsidiary structure. A brief glimpse at the potential for cleanup is given, as we modify the existing hotplug drivers and remove the multiple get_address() methods. Certainly more cleanup can be done with this new structure. Additionally, we introduce an ACPI PCI slot driver, which will detect all physical PCI slots in the system (as described by ACPI). De-coupling the notion of a physical slot from its hotplug capability allows users to understand the physical geography of their machines, whether their slots are hotpluggable or not. This patch series builds heavily on Willy's prior work (with a bit of the typical refresh needed when aiming at the moving target of the kernel). The ACPI PCI slot driver is new. I have tested this series on both ia64 (hp rx6600) and x86 (hp dl380g). On ia64, system firmware provides _SUN methods for the PCI slots and we get entries in /sys/bus/pci/slots/. On my x86 machine, firmware didn't seem to provide a _SUN, so we don't get any entries in /sys/bus/pci/slots/, but nothing really bad happens either. ;) Comments/feedback are appreciated. Thanks. /ac -
[Please note, I got Rick Jones' email screwed up in the 0/5
email; it's corrected above.]
From: Alex Chiang <achiang@hp.com>
Rename the slot to be the contents of the 'path' sysfs attribute, and
delete the attribute. The mapping from pci address to slot name is
supposed to be done through the 'address' file, which will be provided
automatically later in this series of patches.
Signed-off-by: Alex Chiang <achiang@hp.com>
Signed-off-by: Matthew Wilcox <matthew@wil.cx>
---
drivers/pci/hotplug/sgi_hotplug.c | 27 +--------------------------
1 files changed, 1 insertions(+), 26 deletions(-)
diff --git a/drivers/pci/hotplug/sgi_hotplug.c b/drivers/pci/hotplug/sgi_hotplug.c
index ef07c36..e1e7c9d 100644
--- a/drivers/pci/hotplug/sgi_hotplug.c
+++ b/drivers/pci/hotplug/sgi_hotplug.c
@@ -91,21 +91,6 @@ static struct hotplug_slot_ops sn_hotplug_slot_ops = {
static DEFINE_MUTEX(sn_hotplug_mutex);
-static ssize_t path_show (struct hotplug_slot *bss_hotplug_slot,
- char *buf)
-{
- int retval = -ENOENT;
- struct slot *slot = bss_hotplug_slot->private;
-
- if (!slot)
- return retval;
-
- retval = sprintf (buf, "%s\n", slot->physical_path);
- return retval;
-}
-
-static struct hotplug_slot_attribute sn_slot_path_attr = __ATTR_RO(path);
-
static int sn_pci_slot_valid(struct pci_bus *pci_bus, int device)
{
struct pcibus_info *pcibus_info;
@@ -173,18 +158,10 @@ static int sn_hp_slot_private_alloc(struct hotplug_slot *bss_hotplug_slot,
return -ENOMEM;
bss_hotplug_slot->private = slot;
- bss_hotplug_slot->name = kmalloc(SN_SLOT_NAME_SIZE, GFP_KERNEL);
- if (!bss_hotplug_slot->name) {
- kfree(bss_hotplug_slot->private);
- return -ENOMEM;
- }
+ bss_hotplug_slot->name = slot->physical_path;
slot->device_num = device;
slot->pci_bus = pci_bus;
- sprintf(bss_hotplug_slot->name, "%04x:%02x:%02x",
- pci_domain_nr(pci_bus),
- ((u16)pcibus_info->pbi_buscommon.bs_persist_busnum),
- device + 1);
sn_generate_path(pci_bus, ...Register one slot per slot, rather than one slot per function.
Change the name of the slot to fake%d instead of the pci address.
Signed-off-by: Alex Chiang <achiang@hp.com>
Signed-off-by: Matthew Wilcox <matthew@wil.cx>
---
drivers/pci/hotplug/fakephp.c | 75 +++++++++++++++--------------------------
1 files changed, 27 insertions(+), 48 deletions(-)
diff --git a/drivers/pci/hotplug/fakephp.c b/drivers/pci/hotplug/fakephp.c
index 027f686..828335e 100644
--- a/drivers/pci/hotplug/fakephp.c
+++ b/drivers/pci/hotplug/fakephp.c
@@ -93,6 +93,7 @@ static int add_slot(struct pci_dev *dev)
struct dummy_slot *dslot;
struct hotplug_slot *slot;
int retval = -ENOMEM;
+ static int count = 1;
slot = kzalloc(sizeof(struct hotplug_slot), GFP_KERNEL);
if (!slot)
@@ -106,7 +107,8 @@ static int add_slot(struct pci_dev *dev)
slot->info->max_bus_speed = PCI_SPEED_UNKNOWN;
slot->info->cur_bus_speed = PCI_SPEED_UNKNOWN;
- slot->name = &dev->dev.bus_id[0];
+ slot->name = kmalloc(8, GFP_KERNEL);
+ sprintf(slot->name, "fake%d", count++);
dbg("slot->name = %s\n", slot->name);
dslot = kmalloc(sizeof(struct dummy_slot), GFP_KERNEL);
@@ -141,17 +143,17 @@ error:
static int __init pci_scan_buses(void)
{
struct pci_dev *dev = NULL;
- int retval = 0;
+ int lastslot = 0;
while ((dev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, dev)) != NULL) {
- retval = add_slot(dev);
- if (retval) {
- pci_dev_put(dev);
- break;
- }
+ if (PCI_FUNC(dev->devfn) > 0 &&
+ lastslot == PCI_SLOT(dev->devfn))
+ continue;
+ lastslot = PCI_SLOT(dev->devfn);
+ add_slot(dev);
}
- return retval;
+ return 0;
}
static void remove_slot(struct dummy_slot *dslot)
@@ -275,23 +277,9 @@ static int enable_slot(struct hotplug_slot *hotplug_slot)
return -ENODEV;
}
-/* find the hotplug_slot for the pci_dev */
-static struct hotplug_slot *get_slot_from_dev(struct pci_dev *dev)
-{
- struct dummy_slot *dslot;
-
- list_for_each_entry(dslot, &slot_list, node) ...Please use snprintf to avoid buffer overruns! --linas -
Or, since kmalloc can return a 32-byte object at smallest, just allocate 32 bytes and continue using sprintf. fake%d would overflow after 999,999,999,999,999,999,999,999,999 pci devices, by which time we've run the machine out of memory anyway. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." -
This is ugly. Please do it the way we already do e.g. for acpiphp: add a=20 char[8] to "struct dummy_slot" and just reference that here. Or better do=20 what this name suggests: kill fakephp completely and use dummyphp[1] instea= d. Eike 1) http://opensource.sf-tec.de/kernel/
Hi Eike, I took at look at the code in acpiphp you're talking about, and I'm not sure why you think the above is "ugly". We're talking about a runtime vs compile time storage for the name, and doing a kmalloc/sprintf is a pretty standard idiom. BTW, I did incorporate both Linas' and Willy's comments, by changing the kmalloc size to an explicit 32, and using snprintf instead. In any case, for your particular comment, I think I'm going to leave it alone for now, and let Greg weigh in with the final recommendation. Thanks for the review. /ac -
Because we have another allocation of very small size for every slot here. struct dummy_slot has a size of 4 pointers, that's 16 byte for 32 bit=20 architectures. Putting 8 byte of additional storage here would save a=20 complete allocation on 32 bit platforms. For 64 bit platforms the memory=20 usage is the same but we do less allocations (i.e. less points to fail) and= =20 less memory fragmentation. Btw: your code lacks a check if kmalloc() returns NULL. Eike
Hi Eike, Good points. I'll make your suggested change for v2. Thanks. /ac -
Introduce struct pci_slot
- Make pci_slot the primary sysfs entity. hotplug_slot becomes a
subsidiary structure.
- Change the prototype of pci_hp_register() to take the bus and
slot number (on parent bus) as parameters.
- Remove all the ->get_address methods since this functionality is
now handled by pci_slot directly.
Signed-off-by: Alex Chiang <achiang@hp.com>
Signed-off-by: Matthew Wilcox <matthew@wil.cx>
---
drivers/pci/Makefile | 2 +-
drivers/pci/hotplug/acpiphp.h | 1 -
drivers/pci/hotplug/acpiphp_core.c | 23 +---
drivers/pci/hotplug/acpiphp_glue.c | 16 --
drivers/pci/hotplug/acpiphp_ibm.c | 5 +-
drivers/pci/hotplug/cpci_hotplug_core.c | 2 +-
drivers/pci/hotplug/cpqphp_core.c | 4 +-
drivers/pci/hotplug/fakephp.c | 2 +-
drivers/pci/hotplug/ibmphp_ebda.c | 3 +-
drivers/pci/hotplug/pci_hotplug_core.c | 237 +++++++++++--------------------
drivers/pci/hotplug/pciehp_core.c | 22 +---
drivers/pci/hotplug/rpadlpar_sysfs.c | 4 +-
drivers/pci/hotplug/sgi_hotplug.c | 7 +-
drivers/pci/hotplug/shpchp_core.c | 17 +--
drivers/pci/pci.h | 13 ++
drivers/pci/slot.c | 157 ++++++++++++++++++++
include/linux/pci.h | 14 ++
include/linux/pci_hotplug.h | 12 +--
18 files changed, 293 insertions(+), 248 deletions(-)
create mode 100644 drivers/pci/slot.c
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index 5550556..12f0b2d 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -2,7 +2,7 @@
# Makefile for the PCI bus specific drivers.
#
-obj-y += access.o bus.o probe.o remove.o pci.o quirks.o \
+obj-y += access.o bus.o probe.o remove.o pci.o quirks.o slot.o \
pci-driver.o search.o pci-sysfs.o rom.o setup-res.o
obj-$(CONFIG_PROC_FS) += proc.o
diff --git a/drivers/pci/hotplug/acpiphp.h ...How much migration do you expect? Some of it? All of it? If the answer is "all of it", wouldn't it just be easier to rename struct hotplug_slot, and go from there? --linas -
Only the bits that make sense for non-hotpluggable slots. I tried your suggestion first. It wasn't much fun. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." -
Detect all physical PCI slots as described by ACPI, and create entries in /sys/bus/pci/slots/. Not all physical slots are hotpluggable, and the acpiphp module does not detect them. Now we know the physical PCI geography of our system, without caring about hotplug. Signed-off-by: Alex Chiang <achiang@hp.com> --- Looking to get comments on this patch, as it could use some work. I borrowed a lot of code from acpiphp_glue.c. If your work is identifiable and feel that pci_slot.c should have your (c), please let me know. More importantly, in register_slot(), we are seeing a lot of noisy output during boot, as acpi_get_pci_id() complains quite often about not finding an ACPI-PCI binding on devices we pass in. I'm not quite sure what is the best way to detect whether we've already added a slot, so that we can avoid calling acpi_get_pci_id() so often. drivers/acpi/Makefile | 3 +- drivers/acpi/pci_slot.c | 168 +++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 170 insertions(+), 1 deletions(-) create mode 100644 drivers/acpi/pci_slot.c diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile index 54e3ab0..f6caec8 100644 --- a/drivers/acpi/Makefile +++ b/drivers/acpi/Makefile @@ -47,7 +47,8 @@ obj-$(CONFIG_ACPI_FAN) += fan.o obj-$(CONFIG_ACPI_DOCK) += dock.o obj-$(CONFIG_ACPI_BAY) += bay.o obj-$(CONFIG_ACPI_VIDEO) += video.o -obj-y += pci_root.o pci_link.o pci_irq.o pci_bind.o +obj-y += pci_root.o pci_link.o pci_irq.o pci_bind.o \ + pci_slot.o obj-$(CONFIG_ACPI_POWER) += power.o obj-$(CONFIG_ACPI_PROCESSOR) += processor.o obj-$(CONFIG_ACPI_CONTAINER) += container.o diff --git a/drivers/acpi/pci_slot.c b/drivers/acpi/pci_slot.c new file mode 100644 index 0000000..a49a14e --- /dev/null +++ b/drivers/acpi/pci_slot.c @@ -0,0 +1,168 @@ +/* + * pci_slot.c - ACPI PCI Slot Driver + * + * The code here is heavily leveraged from the acpiphp module. + * Thanks to Matthew Wilcox <matthew@wil.cx> for much guidance. + ...
Change the semantics of pci_create_slot() such that it does not
require a hotplug release() method when creating a slot. Now, we
can use this interface to create a pci_slot for any physical PCI
slots, not just hotpluggable ones.
Add new pci_slot_add_hotplug() interface so that various PCI hotplug
drivers can add hotplug release() methods to pre-existing physical
slots.
Signed-off-by: Alex Chiang <achiang@hp.com>
---
drivers/pci/hotplug/pci_hotplug_core.c | 2 +-
drivers/pci/slot.c | 39 +++++++++++++++++++++++++++-----
include/linux/pci.h | 4 ++-
3 files changed, 37 insertions(+), 8 deletions(-)
diff --git a/drivers/pci/hotplug/pci_hotplug_core.c b/drivers/pci/hotplug/pci_hotplug_core.c
index 96c21e2..91afa8e 100644
--- a/drivers/pci/hotplug/pci_hotplug_core.c
+++ b/drivers/pci/hotplug/pci_hotplug_core.c
@@ -568,7 +568,7 @@ int pci_hp_register(struct hotplug_slot *slot, struct pci_bus *bus, int slot_nr)
return -EINVAL;
}
- pci_slot = pci_create_slot(bus, slot_nr, slot->name, hotplug_release);
+ pci_slot = pci_slot_add_hotplug(bus, slot_nr, hotplug_release);
if (IS_ERR(pci_slot))
return PTR_ERR(pci_slot);
slot->pci_slot = pci_slot;
diff --git a/drivers/pci/slot.c b/drivers/pci/slot.c
index 5d830d4..3e91491 100644
--- a/drivers/pci/slot.c
+++ b/drivers/pci/slot.c
@@ -70,20 +70,48 @@ static int create_sysfs_files(struct pci_slot *slot)
return result;
}
+struct pci_slot *pci_slot_add_hotplug(struct pci_bus *parent, int slot_nr,
+ void (*release)(struct pci_slot *))
+{
+ struct pci_slot *slot;
+ int found = 0;
+
+ down_write(&pci_bus_sem);
+
+ /* This slot should have already been created, so look for it. If
+ * we can't find it, return -EEXIST.
+ */
+ for (slot = parent->slot; slot; slot = slot->next)
+ if (slot->number == slot_nr) {
+ found = 1;
+ break;
+ }
+
+ if (!found) {
+ slot = ERR_PTR(-EEXIST);
+ goto out;
+ }
+
+ slot->release = release;
+ ...I'm still not sold on this idea at all. I'm really betting that there is a lot of incorrect acpi slot information floating around in machines and odd things will show up in these slot entries. I say this because for a long time there was no "standard" acpi entries for hotplug slots and different companies did different things. Hence the "odd" IBM acpi hotplug implementation as one example. If this is going to go anywhere, you need to get IBM to agree that it works properly with all their machines... Also, some companies already provide userspace tools to get all of this information about the different slots in a system and what is where, from userspace, no kernel changes are needed. So, why add all this extra complexity to the kernel if it is not needed? thanks, greg k-h -
Is that the end of the world? Instead of having no information, we'll end up with odd information. If people complain, we can always blacklist (indeed, won't the ACPI rolling blacklist catch the majority Not in terms of slot names. There were various things that ACPI didn't specify, like attention and latches, but the description of _SUN hasn't Do you have any examples of this? We should, IMO, be improving the way we tell users which device has a problem. 'tulip7', 'eth4', 'pci 0000:04:1e.3' ... all the user wants to know is which damn card it is so they can replace it. And slot name tells them that. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." -
I don't think the rolling blacklist will catch this, as the rest of the ACPI code is "correct". And yes, incorrect information exported by the kernel is not good at Ok, again, I want to see the IBM people sign off on this, after testing on all of their machines, before I'll consider this, as I know the IBM acpi tables are "odd". Also, how about Dell machines? I know they are probably not expecting this information to show up and who knows if the numbering of their slots match up with their physical diagrams (I say this based on all of IBM sells a program that does this for server rooms. It's probably part of some Tivoli package somewhere, sorry I don't remember the name. I did see it working many years ago and it required no kernel changes at Not if the slot information is incorrect :) That's my main concern. thanks, greg k-h -
That seems a little higher standard than patches are normally held to. How about the patches get sent to the appropriate people at IBM (who are they?) and if we haven't heard a NAK from them after a month, then they I'm pretty sure that Windows uses the _SUN information, so I think this Well, yes, it doesn't take kernel effort if you're willing to build proprietary knowledge into a userspace program. This seems similar to the argument that filesystems don't need to be in the kernel since they can be implemented in user space. They work better in the kernel, just like this does. By the way, the end result of this should be a pci-hotplug system that has much less code in the drivers, more uniformity between the presentation to userspace, more functionality and is easier to understand. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." -
When you are touching code that I know is going to have problems on a whole range of different manufacturer's machines, do you expect me to just ignore that and let you add a feature that is going to be Again, based on the past history with such things as mis-labeled ethernet ports, I have doubts. I can hope, but I need proof based on the major problems we have had in the past. It is better to not show information to users at all, then to give them Don't mix up things like filesystems and exposing pci slots. That is just foolish... thanks, greg k-h -
I be one of them. :) I have been involved in many (but not all) of IBM's x86 based (IBM System x) servers with hotplug capable I am not fundamentally opposed to this new capability but share the same concerns that Greg and others have expressed. So far, I have only tried the changes on one single node system (IBM x3850) but the below NAK-worthy result supports the idea that the changes need to be well and widely tested. Have you possibly considered a kernel option as a kinder and gentler way of introducing the changes? Gary -- Gary Hade System x Enablement IBM Linux Technology Center 503-578-4503 IBM T/L: 775-4503 garyhade@us.ibm.com http://www.ibm.com/linux/ltc ==== IBM x3850 Slots 1-2: PCI-X under PCI root bridges Slots 3-6: PCIe under transparent P2P bridges Slot 1: PCI-X - populated Slot 2: PCI-X - !populated Slot 3: PCIe - populated Slot 4: PCIe - !populated Slot 5: PCIe - !populated Slot 6: PCIe - populated result is with 2.6.24-rc2 plus all 4 proposed patches problem: acpiphp failed to register empty PCIe slots 4 and 5 ==== # dmesg acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5 acpiphp_glue: found PCI host-bus bridge with hot-pluggable slots acpiphp_glue: found ACPI PCI Hotplug slot 1 at PCI 0000:02:01 acpiphp: Slot [1] registered acpiphp_glue: found PCI host-bus bridge with hot-pluggable slots acpiphp_glue: found ACPI PCI Hotplug slot 2 at PCI 0000:06:01 acpiphp: Slot [2] registered acpiphp_glue: found PCI-to-PCI bridge at PCI 0000:0a:00.0 acpiphp_glue: found ACPI PCI Hotplug slot 3 at PCI 0000:0b:00 acpiphp: Slot [3] registered acpiphp_glue: found PCI-to-PCI bridge at PCI 0000:0f:00.0 acpiphp_glue: found ACPI PCI Hotplug slot 4 at PCI 0000:10:00 acpiphp: pci_hp_register failed with error -17 acpiphp_glue: acpiphp_register_hotplug_slot failed(err code = 0xffffffef) acpiphp_glue: found ACPI PCI Hotplug slot 4 at PCI 0000:10:00 acpiphp: pci_hp_register failed with error -17 acpiphp_glue: acpiphp_register_hotplug_slot ...
Hi Gary, Silly question, but I have to ask. :) I sent out 5 patches -- is this simply a typo on your part, or Ok, so acpiphp wasn't going to register those slots anyway, since they are empty. It would have bailed out after not seeing _ADR or _EJ0 on those slots. The acpi-pci-slot driver created those slots anyway, which is one of the points of the patch -- to create sysfs entries even for [repeated 7x] We saw this message 8x, once for each SxFy object under your p2p bridge. I actually somewhat did expect to see this error message (hence the RFC part of my patch ;) I currently don't have a good way to determine if we've already seen an empty slot under a p2p bridge, so we try to register every SxFy object. Of course, a /sys/bus/pci/slots/4/ entry Arguably, the right thing happened here. We got entries for empty slots, and we know their addresses. If anyone can clue me in on a better way to implement patch 4/5 in my series so that we're not seeing those multiple attempts to register slots under p2p bridges, I'd love to hear your ideas. Thanks again for testing. /ac -
Thanks. This will allow everyone to focus on the systems where the changes are most beneficial and not waste a bunch of time No, acpiphp should (and did before your changes) register all hotplug capable slots. All 6 slots (2 PCI-X, 4 PCIe) in that system are hotplug capable. Emptyness shouldn't matter. If the empty slots are not registered it is not be possible to successfully hotplug cards to them. Without your changes acpiphp loads with the following output. acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5 acpiphp: Slot [1] registered acpiphp: Slot [2] registered acpiphp: Slot [3] registered acpiphp: Slot [4] registered acpiphp: Slot [5] registered acpiphp: Slot [6] registered With your changes I confirmed that an attempted hotplug to a boot-time vacant PCIe slot failed as expected. The driver saw the insertion event but didn't find anything to enable: acpiphp_glue: handle_hotplug_event_bridge: Bus check notify on \_SB_.VP05.CALG acpiphp_glue: handle_hotplug_event_bridge: re-enumerating slots under \_SB_.VP05.CALG No, the P2P parent bus is 0000:0f and the P2P child bus is 0000:10 so I believe the real address for slot 4 should be 0000:10:00. kernel without your changes after loading acpiphp: # cat /sys/bus/pci/slots/4/address 0000:10:00 kernel with your changes both before and after loading acpiphp: # cat /sys/bus/pci/slots/4/address Of course, this kind of confusing noise would not be acceptable No, the wrong thing happened here. I expect the slot directories for the empty slots to look the same as they did before your changes. This is what the slot directories for empty slots look like without your changes. # find /sys/bus/pci/slots/[45] /sys/bus/pci/slots/4 /sys/bus/pci/slots/4/power /sys/bus/pci/slots/4/attention /sys/bus/pci/slots/4/latch /sys/bus/pci/slots/4/adapter /sys/bus/pci/slots/4/address /sys/bus/pci/slots/5 /sys/bus/pci/slots/5/power /sys/bus/pci/slots/5/attention ...
Hi Gary, Of course, you are right. I am not sure what I was thinking when I wrote that email yesterday, but I pretty much got it exactly wrong. I think I confused myself between empty/populated vs hp/non-hp. In any case, thanks for the patient explanation. I went back and looked at a vanilla kernel on my machine that has the following hardware configuration: HP rx6600 Slot 1-2: PCI-X under PCI root bridges Slot 3-6: PCIe under transparent P2P bridges Slot 7-10: PCI-X under PCI root bridges Slot 1: PCI-X, !populated, !hp Slot 2: PCI-X, populated, !hp Slot 3: PCIe, populated, hp Slot 4: PCIe, populated, hp Slot 5: PCIe, !populated, hp Slot 6: PCIe, populated, hp Slot 7: PCI-X, !populated, hp Slot 8: PCI-X, !populated, hp Slot 9: PCI-X, !populated, hp Slot 10: PCI-X, !populated, hp On a kernel without my changes, we see: [root@canola slots]# modprobe acpiphp [root@canola slots]# ls 10 3 4 5 6 7 8 9 [root@canola slots]# for i in * ; do ls $i ; done adapter address attention latch power adapter address attention latch power adapter address attention latch power adapter address attention latch power adapter address attention latch power adapter address attention latch power adapter address attention latch power adapter address attention latch power [root@canola slots]# for i in * ; do cat $i/address ; done 0000:01:02 0000:8b:00 0000:52:00 0000:c5:00 0000:10:00 0000:0a:01 0000:4a:01 0000:01:01 This confirms that you were right -- we see all the hp slots show up, even though many of them are non-populated. Also, the correct hp entries show up, just like you'd expect. Ok, so all that said, after re-implementing my ACPI-PCI slot driver, we get all the correct answers, but with the additional appearance of slots 1 and 2 (which aren't hotpluggable): [root@canola slots]# ls 1 10 2 3 4 5 6 7 8 9 [root@canola slots]# for i in * ; do ls $i ; done address adapter address attention latch ...
On Thu, Nov 15, 2007 at 10:36:13AM -0700, Alex Chiang wrote: Yea, looks much better. Nice to see that the addresses for slots below the p2p bridges now include the child instead of the parent bus number. :) Will do. It still looks like I will not get the box back until early next week. I will only be working Monday and Tuesday due to holiday/vacation but that will hopefully not I haven't tried it yet but I feel better seeing that you are now testing on a system having a PCI topology that appears similar to some of our systems. Well, I didn't see any obvious warts but I'm pretty skilled at missing problems when looking at that kind of info. Thanks, Gary -- Gary Hade System x Enablement IBM Linux Technology Center 503-578-4503 IBM T/L: 775-4503 garyhade@us.ibm.com http://www.ibm.com/linux/ltc -
Hi Gary, Can you send me the lspci -vv and lspci -vt output from this machine? Thanks. /ac -
Alex, I added some slot number hints to the below `lspci -vt` output. I am still looking at your other message and will have some hopefully helpful comments soon. I did apply all 5 patches, 4 was a typo. Thanks, Gary -- Gary Hade System x Enablement IBM Linux Technology Center 503-578-4503 IBM T/L: 775-4503 garyhade@us.ibm.com http://www.ibm.com/linux/ltc Script started on Wed 14 Nov 2007 03:38:14 AM PST 0;root@elm3a9:~[root@elm3a9 ~]# lspci -vt -+-[0000:19]---00.0-[0000:1a-1d]--+-00.0 Intel Corporation 82571EB Gigabit Ethernet Controller <- slot 6 | \-00.1 Intel Corporation 82571EB Gigabit Ethernet Controller +-[0000:14]---00.0-[0000:15-18]-- <- slot 5 +-[0000:0f]---00.0-[0000:10-13]-- <- slot 4 +-[0000:0a]---00.0-[0000:0b-0e]----00.0-[0000:0c]----04.0 Adaptec ASC-29320ALP U320 <- slot 3 +-[0000:06]---00.0 IBM Calgary PCI-X Host Bridge <- slot 2 +-[0000:02]-+-00.0 IBM Calgary PCI-X Host Bridge <- slot 1 | \-01.0 S2io Inc. Xframe II 10Gbps Ethernet +-[0000:01]-+-00.0 IBM Calgary PCI-X Host Bridge | +-01.0 Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet | +-01.1 Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet | \-02.0 Adaptec AIC-9410W SAS (Razor ASIC non-RAID) \-[0000:00]-+-00.0 IBM Calgary PCI-X Host Bridge +-01.0 ATI Technologies Inc Radeon RV100 QY [Radeon 7000/VE] +-03.0 NEC Corporation USB +-03.1 NEC Corporation USB +-03.2 NEC Corporation USB 2.0 +-0f.0 Broadcom CSB6 South Bridge +-0f.1 Broadcom CSB6 RAID/IDE Controller \-0f.3 Broadcom GCLE-2 Host Bridge 0;root@elm3a9:~[root@elm3a9 ~]# lspci -vvv 00:00.0 Host bridge: IBM Calgary PCI-X Host Bridge (rev 04) Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- ...
Hi Gary, Actually, I just reworked my patch this morning, and believe that I have a much cleaner implementation now that should fix a lot of the errors you saw. Incoming! Thanks. /ac -
Who would be a good contact at IBM to get some eyes / machine Who would be a good contact at Dell for the same? I don't have as much experience with oddball firmware from various vendors as others on this list, but given the rather stable definition of _SUN in the ACPI spec, I'd be surprised to see vendors abusing that method. [I fully accept the possibility that I'm just naive ;)] More likely, a vendor will do what the HP Proliant folks did, that being simply omitting a _SUN method altogether. One more thought on that -- at *worst* my patch series will do no worse than the status quo of what the acpiphp module is doing today. That module walks through the namespace looking for _SUN methods, and when it finds them, it creates an entry in exactly the same spot (/sys/bus/pci/slots/N) that my patch series does. What this series adds beyond acpiphp is adding entries for slots Like I said in an earlier email, HP ia64 systems will require a kernel change to get this information. Whether it comes via a generic ACPI access layer like dev_acpi, or something like this patch series, the kernel will still get touched. Thanks. /ac -
The acpiphp module is not loaded on zillions of non-pci-hotplug How are you going to ensure that the userspace programs that use those sysfs files are not going to break into tiny pieces now that you have And like I said, I'm pretty sure you don't need to touch the kernel today as there are people doing this just fine from userspace without any kernel changes needed :) thanks, greg k-h -
Do you know how this Tivoli package works or where it gets the I think you are assuming userspace can just dump the raw ACPI tables and extract this information from them. But I don't think that's possible because _SUN is a method that can contain arbitrary AML. That AML has to be *executed*, and you can't do that safely in userspace. Bjorn -
I was told it was just reading the ACPI tables from userspace. As I saw it running on a machine without a modified kernel, I had no reason to doubt this, but it might be doing something else for all I know. What I do know is that it somehow does work without any kernel changes needed, and is in use today by very large data centers without any Yes, I was assuming that based on the above running code, if this is somehow impossible to do from userspace, I don't know what the code is doing. thanks, greg k-h -
The only reported _SUN problems on Dell systems were on the PE6800 and PE6850 systems, which we've fixed with an updated BIOS several months ago. IIRC the values weren't always unique which kind of defeated the purpose. The eth0/eth1 "issues" are completely separate, with no relation to ACPI. That's purely a PCI tree depth-first discovery vs. breadth-first discovery (where it indeed seems everyone except Linux 2.6 uses and expects breadth-first). We looked at extending ACPI to export device naming preference, but instead extended SMBIOS tables because we could get that change in far easier. The other bit that looks at slot numbers is the old PCI IRQ Routing Again, I've only seen this be incorrect on those two models of Dell servers with outdated BIOS versions. Thanks, Matt -- Matt Domsch Linux Technology Strategist, Dell Office of the CTO linux.dell.com & www.dell.com/linux -
FWIW, the ACPI 2.0 spec did not require uniqueness for _SUN. (although there is a strange table that refers to _SUN as the slot-unique ID (table 6-1 in spec v2.0b), the actual definition of _SUN does not mention uniqueness). Uniqueness was first required in the 3.0 spec. /ac -
Does your code handle if these are not unique? thanks, greg k-h -
Hi Greg, I hate to punt on this, but I'm not doing anything that acpiphp is not doing. If we get back non-unique values for _SUN, I'm pretty sure both acpiphp and my pci_slot driver will both continue along as far as they can, until they get an error back from kobject_register() about trying to register an object with an existing name. I think I may have a machine that has non-unique values for _SUN. If so, I'll try and hunt it down and do some testing. Thanks for the suggestion. /ac -
I know they are different, I was using that as an example of trying to match up physical descriptions on a machine with the internal kernel representation. It isn't always easy and lots of times things go wrong, if for no other reason than no one is checking these things. thanks, greg k-h -
On HP ia64 systems, that information is locked away in ACPI, and there's no easy way to get at it. Alex Williamson tried providing a generic dev_acpi driver, so that userspace could do whatever they wanted to with the information: http://lkml.org/lkml/2004/8/3/106 And that effort kinda died. I'm of mixed feelings on that driver, since it would be really nice to get unfettered access to the ACPI namespace, but it's pretty dangerous, since any interesting thing you might want to do is actually a method (aka, it calls into firmware) and who knows what side effects there might be. So from my point of view, if ia64 customers want to know about the slots they have in their systems, we're gonna have to do something kernel-intrusive anyhow. Thanks. /ac -
Doesn't /sys/firmware/acpi give you raw access to the correct tables already? And isn't there some other tool that dumps the raw ACPI tables? I thought the acpi developers used it all the time when debugging things with users. thanks, greg k-h -
I'm neither an acpi developer (well I don't think that I am :) nor an end-user, but here are the two things for which I was going to use the information being presented by Alex's patch: 1) a not-yet, but on track to be released tool to be used by end-users to diagnose I/O bottlenecks - the information in /sys/bus/pci/slot/<foo>/address would be used to associated interfaces and/or pci busses etc with something the end user would grok - the number next to the slot. 2) I was also going to get the folks doing installers to make use of the "end-user" slot ID. Even without going to the extreme of the aforementioned 192 slot system, an 8 slot system with a bunch of dual-port NICs in it (for example) is going to present this huge list of otherwise identical entries. Even if the installers show the MAC for a NIC (or I guess a WWN for an HBA or whatnot) that still doesn't tell one without prior knowledge of what MACs were installed in which slot, which slot is associated with a given ethN. Having the end-user slot ID visible is then going to be a great help to that poor admin who is doing the install. rick jones -
Why not just use the code in the linux firmware kit that does this already today from userspace (thanks to Kristen for pointing this out to me on irc.)? No kernel changes needed, and you can get started on your userspace application today, will work on all distros, no problems at all :) thanks, greg k-h -
So then we have something that works on ACPI-based machines. Who will add support for all the other kinds of firmware and non-firmware based slots? -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." -
Well, it seems that the powerpc people do not want this at all, as they just have "virtual" slots that they don't want people trying to root around and find the real thing. What other types of machines can export this kind of information? How will you discover non-firmware based slots? thanks, greg k-h -
Just to add my 2 cents, I've not seen any CompactPCI gear that had a way to easily map a PCI peripheral slot to the corresponding physical slot in the chassis. In the products I've worked on that use my CPCI hotplug drivers, we did any required mapping in userspace based on knowledge of the chassis layout for the particular product. Scott -- ============================================================================== Scott Murray, scott@spiteful.org "Good, bad ... I'm the guy with the gun." - Ash, "Army of Darkness" -
On Tue, 13 Nov 2007 16:04:00 -0700 As far as being able to retrieve the slot number (which it seemed from the HP manageablity application perspective is the goal here), that information is available from userspace as well for at least standard PCI and pcie based systems for occupied slots. For standard pci, you have to make something up anyway - for shpchp we just use an incremental number and combine it with the bus number to represent the slot. For pcie, you can get this info from the slot capabilities register. -
Ummm ... that's not what the /spec/ says. I've never worked on any shpc machines, but the shpc driver reads the slot values from the SLOT_CONFIG register, just like the spec says to. I don't understand how you're supposed to get the slot number for standard PCI for occupied slots. Could you explain? -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." -
The slot number for shpc slot is like 'YYYY_XXXX'. YYYY is the bus number, though I don't know the specific reason why it was added. XXXX is slot number decided according to the shpc specification, as you said. Thanks, Kenji Kaneshige -
On Wed, 14 Nov 2007 18:55:21 +0900 Kenji is correct, you should listen to him, not me :). For standard PCI slots that are not hotplug, I would think you'd need firware support to find the slot numbers -- but I'm not sure on this. btw - the YYYY was added to prevent duplicate slot numbers on buggy machines btw. -
On Tue, 13 Nov 2007 12:26:32 -0800 There are - people should take a look at the Intel Linux Firmware Kit for an example of how to parse ACPI tables in userspace - the code is also GPL'd, so you are free to use it in another GPL application. http://www.linuxfirmwarekit.org/download.php#source In many DSDTs I've seen, _SUN is hardcoded anyway and can be found by reading the DSDT from userspace. This is what the firmwarekit does to check for duplicate _SUN in one of it's tests. Kristen -
I see three relevant things in the firmware kit:
1) ExecuteAmlMethod() in amlpoke/amlpoke.c. This uses
/proc/acpi/hotkey/event_config to cause the kernel to
execute an AML method. This looks similar to what dev_acpi
does and is unsafe for the same reasons (the method may have
side effects that interfere with kernel drivers). The kernel
support for this was removed in 2.6.21:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=5ee6ed...
2) execute_aml_method() in acpitable.c. Similar to above.
3) parse_SUN_name() in SUN/SUN.c. This uses acpidump, acpixtract,
and iasl -d to disassemble the DSDT and SSDTs, then looks for
things like "Name (_SUN, 0x0000012C)". That works well if _SUN
merely returns a constant, and many DSDTs do that.
But _SUN can be implemented as a control method, and then we have
a problem because we can't determine the _SUN value by inspecting
the DSDT. We have to evaluate the method, and that may require
operation regions, semaphores, etc., so it can only be done in the
kernel.
So I agree that the firmware kit has a clever hack that works on much
existing x86 firmware, and it sounds like Tivoli might even rely on
it. But I don't feel good about it, and it could easily break when
some BIOS writer needs to make _SUN slightly more complicated.
Bjorn
-
Do you know of such BIOSes out there that do this? Will the above scheme not work for the ia64 boxes that you know of that are out in the world today? thanks, greg k-h -
Hi Greg, Matthew and Bjorn have already been hinting at this, but I'd like to voice my concern as well. Even if no one is implementing _SUN as a complicated AML control method today, that is no guarantee that prevents future firmware vendors from doing so. The spec allows it, so in my opinion, pointing to existing implementations and claiming that it all just works is somewhat unfair (and scary). If/when the first vendor implements an AML version of _SUN, the firmware kit will break, and Tivoli will have to react. I was initially intrigued with you and Kristen pointed out this tool, but after thinking about it some more, I am not convinced it is a viable alternative, due to the unknown nature of future firmware implementations. Thanks. /ac -
Hi Greg, Talked to some of our ia64 firmware folks, and they confirmed that our future platforms will implement _SUN as a control method. Thanks. /ac -
One last mail on this subject -- Bjorn has pointed out to me that the Dell pe6800 and rez1850 both implement _SUN as control methods today. Does Tivoli run on those machines? If so, how is it getting slot information? Thanks. /ac -
Hi Greg, Matt,
I downloaded the firmware kit today and played with it. There is
a test called SUN.exe, which searches through the DSDT, looking
to verify that there are no duplicate _SUN values in the system.
The test breaks on my hp rx6600 (a currently shipping platform),
because _SUN is not a hard-coded value, but implemented as an AML
control method. From the various SSDT dumps:
Here's the _SUN for a slot:
Method (_SUN, 0, Serialized)
{
Store (\SLOT.SUN, ^_UID)
Local0
Return (Local0)
}
Here's the _SUN for a processor:
Method (_SUN, 0, NotSerialized)
{
Store (\CPUL.SUN, ^CPID)
Local0
Return (Local0)
}
So parse_SUN_name() from SUN.c is just plain broken. The test
"passes" because it doesn't find any duplicate values for _SUN,
but what's really going on is that it doesn't find *any* values
of _SUN, so of course there won't be duplicates. ;)
I looked at convincing this test to try and execute the method
using ExecuteAmlMethod() and/or execute_aml_method(), but of
course, that won't work on a modern kernel, as they depend on
/proc/acpi/hotkey/, which was removed (as Bjorn pointed out).
Matt, is there any chance you could see if the firmware kit works
or breaks on your PE6800?
I downloaded the latest tarball:
http://www.linuxfirmwarekit.org/download/firmwarekit-r3.tar.gz
And ran it in stand-alone mode. You do need to build the test
suite, but you don't need to run the entire thing -- just run the
SUN/SUN.exe that is generated. (If you get odd complaints about
not being able to find libstandalone.so, just stick it in
/usr/local/lib, modify /etc/ld.so.conf as necessary and run
ldconfig).
Please apply the following debug patch to see how many _SUN
objects the test is actually finding. On my machine, I don't find
any _SUN objects.
Thanks.
/ac
diff --git a/SUN/SUN.c b/SUN/SUN.c
index 90264a7..3b21b9c 100644
--- a/SUN/SUN.c
+++ b/SUN/SUN.c
@@ -260,6 +260,7 @@ static void parse_SUN_name(gpointer data, gpointer ...Dumping raw ACPI tables isn't adequate - _SUN might be a complex ACPI method with multiple reads and writes to raw hardware, and we really don't want to do that in userspace. The only way to do this reliably is in the kernel. -- Matthew Garrett | mjg59@srcf.ucam.org -
But it really isn't, as the firmware kit has proven... thanks, greg k-h -
The firmware toolkit will only work if the _SUN method merely returns a hardcoded value. The spec doesn't require that that be the case, and it's easy to imagine situations where it won't. -- Matthew Garrett | mjg59@srcf.ucam.org -
Second that motion.... I don't get it. What are the goals of this patch, really? Just to get a "slot geographical location" from the kernel? I'm balancing the intellectual appeal of having a kernel struct for representing physical objects, against the headache of reading (debugging, modifying) code that has yet another struct doing yet another thing. So far, the dread of future headaches is winning. On pseries systems, I deal with something called the "partitionable endpoint", which I think probably usually corresponds to physical slots, but I don't really know. The hardware guys pitch changeups all the time. Some of these are soldered onto the planar, so the are not physical slots. But they are, um, "hotpluggable" in that you can dynamically add & remove them from the system (even though you cannot physically unplug the ones that are soldered on -- you can only assign them to other hardware partitions (think hardware VM or hardware Xen)). So, naively, the physical slot concept doesn't really map to what I have to work with; it just adds one more appendix to it all, one more thing to get confused about. To be clear: above remarks are for the PowerPC boxes. I have no clue about how things work on the IBM Intel-based boxes. And Greg's original "get IBM to agree" remark is about the Intel-based boxes. --linas -
~ # find /sys/bus/pci/slots -print /sys/bus/pci/slots /sys/bus/pci/slots/control /sys/bus/pci/slots/control/remove_slot /sys/bus/pci/slots/control/add_slot /sys/bus/pci/slots/0001:00:02.0 /sys/bus/pci/slots/0001:00:02.0/phy_location /sys/bus/pci/slots/0001:00:02.0/max_bus_speed /sys/bus/pci/slots/0001:00:02.0/adapter /sys/bus/pci/slots/0001:00:02.0/attention /sys/bus/pci/slots/0001:00:02.0/power /sys/bus/pci/slots/0001:00:02.6 etc. ~ # cat /sys/bus/pci/slots/0000:00:02.2/phy_location U787A.001.DNZ00Z5-P1-C2 phy_location corresponds to the printed labels on the box, and is also shared with the various management consoles and serivce processors and what not that get involved with stuff like that. Although, please note: on these sysmes, 99% of the "useful" info about hardware is in /proc/device-tree, and not in /sys I doubt anyone looks at /sys/bus/pci/slots, I think they mostly look at the open firmware device tree. --linas -
Ugh. Almost two years ago, paulus promised me he was going to fix the slot name for rpaphp. Guess he didn't. This is one of the hateful things about the current design -- hotplug drivers do too much. Instead of being just the interface between the Linux PCI code and the hardware, they create sysfs directories, add files, and generally have far too much freedom. We have four different schemes currently for naming in slots/, 1. slot number. Used by cpqphp, ibmphp, acpiphp, pciehp, shpc. 2. domain:bus:dev:fn. Used by fakephp. 3a. domain:bus:dev. Used by rpaphp and sgihp. 3b. Except that rpaphp uses phy_location to present the information that should be in the name and sgihp uses path. ... I've forgotten what cpci uses. And yenta doesn't use it. How is anyone supposed to write sane managability tools in the presence Right. This should look like: # cat /sys/bus/pci/slots/U787A.001.DNZ00Z5-P1-C2/address 0000:00:02 -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." -
It's not only complexity. Each new sysfs entry costs memory. Memory is not free. There should be always a good reason for those. -Andi -
It's not a lot of memory; it's one directory and a couple of files for each PCI slot in the system. Even huge systems have maybe 200 slots. In order for this to take up as much as one page of ram on a typical PC with six slots, this would have to consume 680 bytes per directory. I don't think sysfs is that inefficient (and if it is, maybe this feature is not where the problem is, given the 'power' directory per device, the 19 files per scsi device, the huge numbers of symlinks, etc). -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." -
It becomes much more when someone does a find /sys. dentries are expensive. They eventually can get pruned again, but it's still costly to do that. -Andi -
Again, if this is a big concern for you, there are better places to look at for savings. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." -
Whoever is proposing a feature has the burden to justify that its usefulness is larger than the overhead/cost it adds. Doesn't seem to be the case with this one so far. And in general ignoring overhead in new features is a pretty sad approach. Big bloat does come in small steps with each new feature. -Andi -
Huh? There are half a dozen people who think it does, and half a dozen people who think it doesn't. Fine, you're on the side which sees no use A very generic argument which could be used to shoot down any new feature. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." -
I'm a bit confused where you're going with this. I thought the main point of contention wasn't around the utility of the patchset, it was around getting the actual information correct. We've already mentioned several uses that this patchset adds: 1. allowing managability folks to determine which physical slot their failing card might be sitting in (independent of hotplug capability). 2. giving installers a method of presenting the physical slot to the user so that the user can choose to, say, do a network install off of the card in slot N, vs having to guess based on MAC address or something else unfriendly 3. performance monitoring tools based on slot addresses/numbers I believe that userspace cares more about those sorts of things I'm kinda new to the Linux kernel, so I don't really get what you're saying. Are you saying that the approved method of exposing kernel information to the user via sysfs is actually bloat? Even if that information has utility? Feel free to educate me; I'm here to learn. Thanks. /ac -
Hi,
I tried the series of patches and I encountered some problems.
Since I don't have enough time to investigate them, I can only
report them at present. My environment was ia64 machine with 64
hotplug slots (shpc & pcie). Those slots can be handled by
acpiphp too, though I have not tried it yet.
- Problem(1)
Two kind of Call Traces were displayed so many times at the
boot time. Those are attached below. I guess it relate to
the ACPI Namespace definition.
- Problem(2)
Many of the shpc slot initializations failed with the
following error messages. I guess the cause is Problem(1).
shpchp: pci_hp_register failed with error -17
shpchp: shpchp: slot initialization failed
- Problem(3)
All of the pcie slot initializations failed with the
following error messages. I guess the cause is Problem(1).
pciehp: pci_hp_register failed with error -17
pciehp: pciehp: slot initialization failed
BTW, I have some comments about the patches.
- I think your patches assume that all the hotplug slots have
ACPI _SUN method regardless of the hotplug controller type,
because pci_slot.c is the only place to call pci_slot_create().
But I think there are hotplug slots without _SUN. If the
hotplug slot doesn't have _SUN, slot initialization by the
hotplug controller driver doesn't work (does it?), because
there are no slot directory.
- I think pci_slot.o should be kernel module so that users can
select to enable or disable this functionality.
Here is the Call Trace mentioned above:
Unable to register kobject 1024<3>pci_slot: pci_create_slot returned -22
ACPI: Invalid ACPI Bus context for device <NULL>
sysfs: duplicate filename '1024' can not be created
WARNING: at fs/sysfs/dir.c:424 sysfs_add_one()
Call Trace:
[<a000000100014b60>] show_stack+0x40/0xa0
sp=e000014101befaf0 bsp=e000014101be0f60
[<a000000100014bf0>] dump_stack+0x30/0x60
sp=e000014101befcc0 bsp=e000014101be0f48
...