Short:
1. Calling pci_remove_bus_device() on some of the devices on a bus will
remove them. (ok)
2. Calling pci_scan_child_bus() on that bus will scan the bus for
devices and add them all. even the ones already in the bus device list,
which will now be duplicated in the bus device list! (no so ok)
3. Calling pci_bus_add_devices() Will try to add the device, assign
resource, and create procfs files. The devices which were not originally
removed are were being assigned resource they don't need and procfs
files already exist and will collide (Bug)
Background:
I have a programmable component which implements a pci device. Upon
warm update i need to remove my devices from the bus, upgrade, and add
them back.
I have bashed my head against pci.h API for the last week trying to make
this work with no success. Right now I have devised a hideous hack to
circumvent pci.h and directly remap the pci device registers.
demo code:
reload_my_hardware_design() {
struct pci_dev *dev = NULL;
struct pci_bus *bus = NULL;
while ((dev = pci_get_device(PCI_VENDOR_ID_MY_ID,PCI_ANY_ID,NULL))) {
pci_remove_bus_device(dev);
}
...
reload the hardware
...
bus = pci_find_bus(0,0); /* though hard coded in this example this
is the right bus*/
pci_scan_child_bus(bus);
pci_bus_add_devices(bus);
}
Is this a bug or am I doing something wrong?
Liberty
--
Dear Penguins, As a follow up on my own post ( http://lkml.org/lkml/2008/6/18/195 ), I have solved a small bug/misbehavior within the pci_scan_single_device(). The misbehavior manifest itself upon calling pci_scan_child_bus() with a bus which already contain some devices. After scanning is done devices that already existed will be present twice in the PCI bus devise list. Once with is_added indication and once without. Trying to add this bus will cause a resource conflict as the same device is already present and initialized. This patch will simply prevent a device to be added to a bus list if it is already there. Points to consider: 1. I am not a PCI Guru and might have over looked the bigger picture. Is it OK to prevent a device of being re-added? 2. I have decided that two devices are, in fact, the same instance if it has the same: vendor, device, and devfn. Is there a finer test for a device identity? p.s. As per http://www.kernel.org/pub/linux/docs/lkml/#s1-10 it is preferred for Mozilla Firefox clients to attach the patch, hence my patch is attached. Liberty Signed-off-by: Eran Liberty <liberty@extrictom.org> ---
The subject is wrong! it is 2.6.26-rc4. I will re-email with a corrected subject for automated scripts sake. Liberty --
Dear Penguins, As a follow up on my own post ( http://lkml.org/lkml/2008/6/18/195 ), I have solved a small bug/misbehavior within the pci_scan_single_device(). The misbehavior manifest itself upon calling pci_scan_child_bus() with a bus which already contain some devices. After scanning is done devices that already existed will be present twice in the PCI bus devise list. Once with is_added indication and once without. Trying to add this bus will cause a resource conflict as the same device is already present and initialized. This patch will simply prevent a device to be added to a pci bus list if it is already there. Points to consider: 1. I am not a PCI Guru and might have over looked the bigger picture. Is it OK to prevent a device of being re-added? 2. I have decided that two devices are, in fact, the same instance if it has the same: vendor, device, and devfn. Is there a finer test for a device identity? p.s. As per http://www.kernel.org/pub/linux/docs/lkml/#s1-10 it is preferred for Mozilla Firefox clients to attach the patch, hence my patch is attached. Liberty Signed-off-by: Eran Liberty <liberty@extricom.org> ---
I think this is your real problem, that you're rescanning the entire bus. I don't think that's the route we'd recommend taking. Why don't you call pci_scan_slot() instead? You won't get the benefit of pcibios_fixup_bus(), but I'm not convinced that's safe to call on a bus that's already been scanned. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." --
My stating point was that I have loaded a new design into a
programmable device which sits on the pci device. The new design can
implement numerous pci devices or non at all. I can think of an easy
way (or clean one) to scan only the programmable device. Scanning the
As said its not exactly a slot its more like a regular pci device that
someone suddenly welded into the pci bus. Its not a hotplug as well,
and I do not want to give up on the pcibios_fixup_bus()
As it is, with my patch applied i successfully go over the bus and
remove my own devices before I reprogram the
programmable device.
while ((dev = pci_get_device(PCI_VENDOR_ID_MYCOMP,PCI_DEVICE_ID_MYDEV,NULL))
!= NULL) {
pci_remove_bus_device(dev);
pci_dev_put(dev);
}
Load a new design into it.
Then scan the entire bus and add the newly discovered devices.
bus = null;
while ((bus = pci_find_next_bus(bus)) != NULL) {
pci_scan_child_bus(bus);
pci_bus_assign_resources(bus);
pci_bus_add_devices(bus);
}
As seen here, this sequence of instructions seems very intuitive. It
will fail without the patch upon pci_bus_add_devices().
Liberty
--
That's what pci_scan_slot() is for. It scans the first function at the device number, then (if the header indicates it's a multifunction device) scans the other functions associated with that device. eg you could call pci_scan_slot(bus, 0x30) and it will create function 06.0 (and potentially 06.1, 06.2, ...) You presumably already have the devfn for the existing device since Why not? What architecture are you using? What does pcibios_fixup_bus() do for you? (as a side-note, I'd like to reimplement the pcibios_fixup_*() routines; I think a lot of what they do can be done more generically these days. Seems utterly unintuitive to me. You're doing a lot of unnecessary work here, and if you have two cards in your machine, you'll take away both of them when you reload either of them. What you should do is cache the pci_bus and the devfn at startup: static struct pci_bus *my_bus; static int my_devfn; struct pci_dev *dev = pci_get_device(PCI_VENDOR_ID_MYCOMP, PCI_DEVICE_ID_MYDEV, NULL); if (!dev) return -ENODEV; my_bus = dev->bus; my_devfn = dev->devfn; pci_dev_put(dev); when you want to remove it: for (func = 0; func < 8; func++) struct pci_dev *dev = pci_get_slot(my_bus, my_devfn + func); if (!dev) continue; pci_remove_bus_device(dev); pci_dev_put(dev); } when you want to rescan it: pci_scan_slot(my_bus, my_devfn); (this only handles one programmable card. The basic idea could be extended to handle multiple cards if you need to do that). -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." --
Matthew, You seem to have a finer grasp of the subject then I do, please correct/educate me on any of the points I raise in the following lines. Each slot represent a single device which can have more then one function. pci_scan_slot is aimed for scanning these multiple functions. I, on the other hand, have programmable device on the pci bus which is, for the sake of this discussion, a complete black box. This black box up on loading can implement more then one device, which can have more then one function each. So as far as I see it, now I need to scan all slots on the bus. I work with ARCH=powerpc. pcibios_fixup_bus() will deal with all the resource bars allocation. Hmmm, I do want to remove all the devices that are implemented by the programmable unit which is reloaded. I have not considered the possibility of having more then one programmable unit. I think there is a hidden assumption in this code, again please correct me if I missed the point. This code assumes that the devices which will re-appear after the programmable unit is loaded has the same devfn and bus as the devices which were present before the reload. This assumption might be wrong. For example, I have a basic programmable image which has no pci devices at all. upon unloading I do not remove any device (as non are present) and up on reloading I suddenly have two. What is their bus? their devfn? Ultimately I would have expected to find a "int pci_scan_bus(struct <http://liberty/lxr/ident?v=e500-linux-2.6.26-rc4;i=pci_scan_bus>pci_bus <http://liberty/lxr/ident?v=e500-linux-2.6.26-rc4;i=pci_bus> *bus <http://liberty/lxr/ident?v=e500-linux-2.6.26-rc4;i=bus>);" the "pci_scan_child_bus <http://liberty/lxr/ident?v=e500-linux-2.6.26-rc4;i=pci_scan_child_bus>()" was the closest to the mark Liberty --
WOW, sorry for the html tags (though I had this Thunderbird under control). Here is the same mail without html tags (hopeully) Matthew, You seem to have a finer grasp of the subject then I do, please correct/educate me on any of the points I raise in the following lines. Each slot represent a single device which can have more then one function. pci_scan_slot is aimed for scanning these multiple functions. I, on the other hand, have programmable device on the pci bus which is, for the sake of this discussion, a complete black box. This black box up on loading can implement more then one device, which can have more then one function each. So as far as I see it, now I need to scan all slots on the bus. But to be honest, upon looking a way to make my device work I dismissed the "pci_scan_slot()" option as It did not reach the "fixup_resource ()" I work with ARCH=powerpc. pcibios_fixup_bus() will deal with all the resource bars allocation. If I can lend a hand there, let me know and I will try to squeeze it in Hmmm, I do want to remove all the devices that are implemented by the programmable unit which is reloaded. I have not considered the possibility of having more then one programmable unit. I think there is a hidden assumption in this code, again please correct me if I missed the point. This code assumes that the devices which will re-appear after the programmable unit is loaded has the same devfn and bus as the devices which were present before the reload. This assumption might be wrong. For example, I have a basic programmable image which has no pci devices at all. upon unloading I do not remove any device (as non are present) and up on reloading I suddenly have two. What is their bus? their devfn? Ultimately I would have expected to find a "int pci_scan_bus(struct pci_bus *bus );" the "pci_scan_child_bus ()" was the closest to the mark Liberty --
That's quite interesting. A normal PCI device has its IDSEL pin connected to one of the AD lines, so it's not possible for it to respond to more then one of the 32 possible devices. I suppose by decoding the configuration cycles on the bus, it's possible to make a programmable device respond to more than one device. I suppose you have to be careful to program it to not respond to the same device numbers as any other device on the bus. Or are there no other devices on this bus OK. I don't think it's safe to call powerpc's fixup_resource() more What I want to do is get rid of fixup_resource() and its equivalent on other architectures and call pcibios_bus_to_resource() from drivers/pci/probe.c. I've been working in this area recently; here's my current tree: http://git.kernel.org/?p=linux/kernel/git/willy/misc.git;a=shortlog;h=pci-read-base Yes, I did make that assumption ... I'm used to real PCI devices, not If this is all alone on a bus of its own, we can certainly handle that. Maybe we can make it a module parameter or something. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." --
If my devices where the only one on the bus I would not have stumbled on this bug. I would have removed them all before reloading and would have scanned and empty PCI bus list after the reloading. Alas, the CPU itself is a pci device. Upon scanning the bus it is it who is duplicated. Upon trying to "pci_bus_add_devices()" you get resource collision. Do you perceive any down side to my suggested patch? It seems (to me) really logical that a device can not and should not be added twice to a pci bus device list. Liberty --
That's not true, the code you posted:
bus = null;
while ((bus = pci_find_next_bus(bus)) != NULL) {
pci_scan_child_bus(bus);
pci_bus_assign_resources(bus);
pci_bus_add_devices(bus);
}
Does the CPU have BARs of its own? If you have no other devices with
BARs, it would explain why you have not noticed the problem with
The problem is the other side-effects of what you propose.
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--
Could you elaborate what ominous side-effects you predict? Liberty --
Look at the consequences of calling fixup_resource() twice on the same
resource:
res->start = (res->start + offset) & mask;
res->end = (res->end + offset) & mask;
You'll add 'offset' to the other resources twice.
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--
Tring to find heads and tails in the code, I dug in. This is what I understand from the overall flow... and my points of interests:
1. pci_scan_child_bus() starts with iterating over all the devfn and scanning for a device with that devfn
drivers/pci/probe.c
1056 unsigned int __devinit pci_scan_child_bus(struct pci_bus *bus)
1057 {
1063 /* Go find them, Rover! */
1064 for (devfn = 0; devfn < 0x100; devfn += 8)
1065 pci_scan_slot(bus, devfn);
2. pci_scan_slot() will continue to scan all the functions that devfn might have to over
drivers/pci/probe.c
1019 int pci_scan_slot(struct pci_bus *bus, int devfn)
1020 {
1026 for (func = 0; func < 8; func++, devfn++) {
1029 dev = pci_scan_single_device(bus, devfn);
3. pci_scan_single_device() will scan / test if this dev is valid.
drivers/pci/probe.c
996 struct pci_dev *__ref pci_scan_single_device(struct pci_bus *bus, int devfn)
997 {
998 struct pci_dev *dev;
999
1000 dev = pci_scan_device(bus, devfn);
and add it. REGARDLESS if it is already on the pci bus list or not.
1001 if (!dev)
1002 return NULL;
1003
1004 pci_device_add(dev, bus);
1005
1006 return dev;
1007 }
This is the first thing that needs to be fixed IMHO.
4. pci_scan_child_bus() will go on to fixup the bus whether it needs fixing or not.
drivers/pci/probe.c
1072 pcibios_fixup_bus(bus);
This might be the second thing that needs amending.
5. pcibios_fixup_bus(bus) calls _pcibios_fixup_bus which will call fixup_resource() once per a bus resource (in contrast of per device on that bus) and fixes the resources of that bus.
arch/powerpc/kernel/pci-common.c
784 static void __devinit __pcibios_fixup_bus(struct pci_bus *bus)
785 {
787 struct pci_dev *dev = bus->self;
798 for (i = 0; i < PCI_BUS_NUM_RESOURCES; ++i) {
833 fixup_resource(res, dev);
834 }
835 }
850 ...Found some time to test my own idea... it seems to work. :)
Its final version looks like this:
Before reloading the programmable unit, remove all the device implemented by it
struct pci_dev *dev;
while ((dev = pci_get_device(PCI_VENDOR_ID_MYCOMP,PCI_DEVICE_ID_MYCOMP_MYDEV,NULL)) != NULL) {
pci_remove_bus_device(dev);
pci_dev_put(dev);
}
After the programmable unit is loaded, scan and add only newly added device!
struct pci_bus *bus = NULL;
while ((bus = pci_find_next_bus(bus)) != NULL) {
int devfn;
for (devfn = 0; devfn < 0x100; devfn++) {
struct pci_dev *dev=pci_get_slot(bus,devfn);
if (dev)
continue; // <---- do not scan devices already present on the bus. Missing magic line :)
(void)pci_scan_single_device(bus, devfn);
}
pci_bus_assign_resources(bus);
pci_bus_add_devices(bus);
}
So... no need for patch after all.
Unless... Hey whats this... behind it... on the side.... naaa, no need of patch... :(
Liberty
--
-- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." --
After compiling a kernel with ftrace I started to experience all sorts of crashes. I have a powerpc env which closly resemble mpc8548cds_defconfig. I have produced the crash in seconds by issuing: > while [ 1 ] ; do find / > /dev/null ; echo . ; done Liberty here are some of the kernel crashes: --- ~ # find /proc/ -name kft find: /proc/150/net/softnet_stat: No such file or direOops: Exception in kernel mode, sig: 11 [#1] Exsw1600 Modules linked in: NIP: c00bd6fc LR: c00bd6fc CTR: 00000000 REGS: ddfc3d50 TRAP: 0700 Not tainted (2.6.27-rc2) MSR: 00029000 <EE,ME> CR: 20002284 XER: 20000000 TASK = ddcccde0[1699] 'find' THREAD: ddfc2000 GPR00: 00000000 ddfc3e00 ddcccde0 dd895580 ddfc3e30 c00b5da0 ddfc3e90 00000003 GPR08: c00ece54 00000370 00145505 00000003 80002282 100ad874 10083ca0 10083cd8 GPR16: 10083cd0 00000008 00000000 c00b5da0 ddfc3f18 c00ece54 00000000 df465720 GPR24: d7431900 ddfc3e30 dd895580 c0380000 dd895580 ddfc3e30 00000000 ddfc3e00 NIP [c00bd6fc] d_lookup+0x40/0x90 LR [c00bd6fc] d_lookup+0x40/0x90 Call Trace: [ddfc3e00] [c006ade4] rcu_process_callbacks+0x48/0x60 (unreliable) [ddfc3e20] [c00eb514] proc_fill_cache+0x94/0x1b0 [ddfc3e80] [c00ef720] proc_task_readdir+0x294/0x3c4 [ddfc3ee0] [c00b60fc] vfs_readdir+0xb8/0xd8 [ddfc3f10] [c00b619c] sys_getdents64+0x80/0x104 [ddfc3f40] [c0010554] ret_from_syscall+0x0/0x3c Instruction dump: 7c0802a6 bf61000c 3f60c038 7c3f0b78 90010024 7c7c1b78 7c9d2378 83db52a0 73c00001 7f83e378 7fa4eb78 4082002f <00000000> 2f830000 409e0030 801b52a0 ctory ---[ end trace 6b059a7f5ba5a55a ]--- --- Oops: Exception in kernel mode, sig: 11 [#1] Exsw1600 Modules linked in: gplonly_api NIP: c00bd6fc LR: c00bd6fc CTR: 00000000 REGS: df52dc40 TRAP: 0700 Not tainted (2.6.27-rc2) MSR: 00029000 <EE,ME> CR: 20082422 XER: 20000000 TASK = ddc2e060[2294] 'eeprom' THREAD: df52c000 GPR00: 00000000 df52dcf0 ddc2e060 dd82c700 df52dd58 df52dd48 dd82c75e 00000000 GPR08: c0800000 000311f0 0000ffff df52dd10 20002488 ...
Just to make sure... ftrace enables markers too, and RCU has tracing with the markers. This may not be the problem, but I just want to eliminate as many variables as possible. Could you disable ftrace, but keep the markers on too. Also, could you enable ftrace again and turn on the FTRACE_STARTUP_TEST. Thanks, --
Steve ? I just noticed this :
ftrace_modify_code(unsigned long ip, unsigned char *old_code,
unsigned char *new_code)
{
unsigned replaced;
unsigned old = *(unsigned *)old_code;
unsigned new = *(unsigned *)new_code;
int faulted = 0;
/*
* Note: Due to modules and __init, code can
* disappear and change, we need to protect against faulting
* as well as code changing.
*
* No real locking needed, this code is run through
* kstop_machine.
*/
asm volatile (
"1: lwz %1, 0(%2)\n"
" cmpw %1, %5\n"
" bne 2f\n"
" stwu %3, 0(%2)\n"
"2:\n"
".section .fixup, \"ax\"\n"
"3: li %0, 1\n"
" b 2b\n"
".previous\n"
".section __ex_table,\"a\"\n"
_ASM_ALIGN "\n"
_ASM_PTR "1b, 3b\n"
".previous"
: "=r"(faulted), "=r"(replaced)
: "r"(ip), "r"(new),
"0"(faulted), "r"(old)
: "memory");
if (replaced != old && replaced != new)
faulted = 2;
if (!faulted)
flush_icache_range(ip, ip + 8);
return faulted;
}
What happens if you are really unlucky and get the following execution
sequence ?
Load module A at address 0xddfc3e00
Populate ftrace function table while the system runs
Unload module A
Load module B at address 0xddfc3e00
Activate ftrace
-> At that point, since there is another module loaded in the same
address space, it won't fault. If there happens to be an instruction
which looks exactly like the instruction we are replacing at the same
location, we are introducing a bug. True both on x86 ans powerpc...
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C ...Hi Mathieu, Yep I know of this issue, and it is very unlikely it would happen. But that's not good enough, so this is why I did this: http://marc.info/?l=linux-kernel&m=121876928124384&w=2 http://marc.info/?l=linux-kernel&m=121876938524523&w=2 ;-) On powerpc it would be less likely an issue since the code segments are all 4 bytes aligned. And a call being replaced would be a call to mcount (relative jump). A call to mcount would be the same for both. Then again, we could be replacing a nop, but most likely that wouldn't hurt either, since nops are seldom used, and if we did call mcount, it would probably not hurt. But it would be an issue if Module B happen to put a data section that matched the code to jmp to mcount or a nop. -- Steve --
Ah, I did not see this one passing by :) That's the right solution indeed. I guess it's not in 2.6.27-rc2/rc3 though ? I wonder if the bug can be repeated with a module-free kernel ? -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
Some (most likely unrelated) nits in the above inline asm: Should use a "b" constraint for %2, or you could get r0. Or, use an "m" constraint with %U2%X2 after the lwz/stw. Why stwu with an offset of zero, BTW? %1 also needs to be an early clobber. -Scott --
Because it worked :-) Really, it has been a while since I did any PPC assembly, and I didn't break out the old reference manuals to do this. I simply looked at other asm code, and followed suit. I compiled and tested it, and it worked. I did make a disclaimer about my rusty PPC Not exactly sure what you mean by the above. -- Steve --
The advantage of the latter is that it allows GCC to choose indexed or update instructions -- but that's merely an optimization. Switching to %1 is written to before some inputs are consumed, so you need to use "=&r" rather than "=r" so that GCC won't use the same register for both. -Scott --
Why not use __get_user/__put_user ? Cheers, Ben. --
Hmm, this was originally copied from x86, where we did a cmpxchg, but that
is probably not needed since all of this is done in kstop_machine. Also,
only the "get" is needed. If we don't fault there, we wont fault on the
put (unless we have permissions wrong, and that would be a bug).
So are you recommending something like
int cmd;
if (__get_user(cmd, ip))
goto fault;
if (cmd != old)
goto not_same;
WARN_ON_ONCE(__put_user(cmd, ip));
If we did this, we could probably put this into the generic code:
if (copy_from_user(cmd, ip, ARCH_CALL_SIZE))
goto fault;
if (memcmp(cmd, old, ARCH_CALL_SIZE) != 0)
goto not_same;
WARN_ON_ONCE(copy_to_user(cmd, ip, ARCH_CALL_SIZE));
-- Steve
--
You need the __ variants or the access_ok() checks will bite you bad. Ben. --
Argh. See text_poke(). It's there exactly for this purpose on x86. -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
Ouch, I just did. text_poke is quite heavy. It would be interesting to see that performed on 20,000 locations at one time. I could play with it, but I'm a bit nervous. -- Steve --
It's alread used to modify the LOCK prefixes in alternative.c and did not seem to be too slow for that.. it should therefore be ok. -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
There's a lot more functions than locks ;-) Anyway, this is only needed after system boot up. But then we do it ever time we start or stop tracing. I'll try it out and see if it is noticable. -- Steve --
OK, I just tried text_poke and it unfortunately fails. The problem is that it requires that the text you are changing is aligned and fits on one page. We have no control over that. -- Steve --
Ok, there are two cases where it's ok : 1 - in stop_machine, considering we are not touching code executed in NMI handlers. 2 - when using my replace_instruction_safe() which uses a temporary breakpoint when doing the instruction replacement. In those cases you could use text_poke_early(). See http://git.kernel.org/?p=linux/kernel/git/compudj/linux-2.6-lttng.git;a=blob;f=arch/x8... For a use example. Basically it looks like : 360 pages[0] = virt_to_page((void *)bypass_eip); 361 vaddr = vmap(pages, 1, VM_MAP, PAGE_KERNEL); 362 BUG_ON(!vaddr); 363 text_poke_early(&vaddr[bypass_eip & ~PAGE_MASK], 364 (void *)addr, size); 365 /* 366 * Fill the rest with nops. 367 */ 368 len = NR_NOPS - size; 369 add_nops((void *) 370 &vaddr[(bypass_eip & ~PAGE_MASK) + size], 371 len); 372 print_dbg_bytes("inserted nops", 373 &vaddr[(bypass_eip & ~PAGE_MASK) + size], len); 374 vunmap(vaddr); Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
vunmap can not be called with interrupts disabled, and this is exactly what my code does. -- Steve --
It could be after the vmap rewrite (with a few other small tweaks). But a) it would be less robust when called from interrupt context and this code looks broken as it is WRT error handling; and b) it still costs several thousand cycles to vmap+touch+vunmap... --
Note that vmap/vunmap will be very slow. Ben. --
Don't we have Nick's speedups now?
J
--
Not sure what speedups this is, but it's stlil not something you want to do a lot ... setting up and tearing down MMU mappings isn't something I'd like to see happening on a per-mcount basis. Ben --
Ok, so i did a patch, but it doesn't fix the problem. So
there's something else whacking on the stack frames.
Still, here it is:
powerpc/ftrace: Fix broken assembly for code replacement
Instead, uses __get_user() and __put_user().
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
Index: linux-work/arch/powerpc/kernel/ftrace.c
===================================================================
--- linux-work.orig/arch/powerpc/kernel/ftrace.c 2008-08-19 12:45:49.000000000 +1000
+++ linux-work/arch/powerpc/kernel/ftrace.c 2008-08-19 12:52:51.000000000 +1000
@@ -16,7 +16,7 @@
#include <asm/cacheflush.h>
#include <asm/ftrace.h>
-
+#include <asm/uaccess.h>
static unsigned int ftrace_nop = 0x60000000;
@@ -72,10 +72,9 @@ notrace int
ftrace_modify_code(unsigned long ip, unsigned char *old_code,
unsigned char *new_code)
{
- unsigned replaced;
- unsigned old = *(unsigned *)old_code;
- unsigned new = *(unsigned *)new_code;
- int faulted = 0;
+ unsigned int old = *(unsigned int *)old_code;
+ unsigned int new = *(unsigned int *)new_code;
+ unsigned int instr;
/*
* Note: Due to modules and __init, code can
@@ -85,32 +84,13 @@ ftrace_modify_code(unsigned long ip, uns
* No real locking needed, this code is run through
* kstop_machine.
*/
- asm volatile (
- "1: lwz %1, 0(%2)\n"
- " cmpw %1, %5\n"
- " bne 2f\n"
- " stwu %3, 0(%2)\n"
- "2:\n"
- ".section .fixup, \"ax\"\n"
- "3: li %0, 1\n"
- " b 2b\n"
- ".previous\n"
- ".section __ex_table,\"a\"\n"
- _ASM_ALIGN "\n"
- _ASM_PTR "1b, 3b\n"
- ".previous"
- : "=r"(faulted), "=r"(replaced)
- : "r"(ip), "r"(new),
- "0"(faulted), "r"(old)
- : "memory");
-
- if (replaced != old && replaced != new)
- faulted = 2;
-
- if (!faulted)
- flush_icache_range(ip, ip + 8);
-
- return faulted;
+ if (__get_user(instr, (unsigned int __user *)ip))
+ return 1;
+ if (instr != old && instr != new)
+ return 2;
+ WARN_ON_ONCE(__put_user(new, ...Is it 32 bit or 64? I've tested the 64 bit quite a bit, but not so much the 32 (my powerbook is not usually that stable). Acked-by: Steven Rostedt <srostedt@redhat.com> -- Steve --
Yeah, I had a look, nothing obviously bad there, but I may just have missed it :-) Ben. --
Found the problem (or at least -a- problem), it's a gcc bug. Well, first I must say the code generated by -pg is just plain horrible :-) Appart from that, look at the exit of, for example, __d_lookup, as generated by gcc when ftrace is enabled: c00c0498: 38 60 00 00 li r3,0 c00c049c: 81 61 00 00 lwz r11,0(r1) c00c04a0: 80 0b 00 04 lwz r0,4(r11) c00c04a4: 7d 61 5b 78 mr r1,r11 c00c04a8: bb 0b ff e0 lmw r24,-32(r11) c00c04ac: 7c 08 03 a6 mtlr r0 c00c04b0: 4e 80 00 20 blr As you can see, it restores r1 -before- it pops r24..r31 off the stack ! I let you imagine what happens if an interrupt happens just in between those two instructions (mr and lmw). We don't do redzones on our ABI, so basically, the registers end up corrupted by the interrupt. Cheers, Ben. --
Ouch! You've disassembled this without -pg too, and it does not have this bug? What version of gcc do you have? -- Steve --
I have: gcc (Debian 4.3.1-2) 4.3.1 c00c64c8: 81 61 00 00 lwz r11,0(r1) c00c64cc: 7f 83 e3 78 mr r3,r28 c00c64d0: 80 0b 00 04 lwz r0,4(r11) c00c64d4: ba eb ff dc lmw r23,-36(r11) c00c64d8: 7d 61 5b 78 mr r1,r11 c00c64dc: 7c 08 03 a6 mtlr r0 c00c64e0: 4e 80 00 20 blr My version looks fine. I'm thinking that this is a separate issue than what Eran is seeing. Eran, can you do an "objdump -dr vmlinux" and search for __d_lookup, and print out the end of the function dump. Thanks, -- Steve --
powerpc-linux-gnu-objdump -dr --start-address=0xc00bb584 vmlinux | head -n 100 vmlinux: file format elf32-powerpc Disassembly of section .text: c00bb584 <__d_lookup>: c00bb584: 7c 08 02 a6 mflr r0 c00bb588: 90 01 00 04 stw r0,4(r1) c00bb58c: 4b f5 5c 51 bl c00111dc <_mcount> c00bb590: 94 21 ff d0 stwu r1,-48(r1) c00bb594: 7c 08 02 a6 mflr r0 c00bb598: 3d 20 9e 37 lis r9,-25033 c00bb59c: bf 01 00 10 stmw r24,16(r1) c00bb5a0: 61 29 00 01 ori r9,r9,1 c00bb5a4: 3d 60 c0 38 lis r11,-16328 c00bb5a8: 90 01 00 34 stw r0,52(r1) c00bb5ac: 7c 60 4a 78 xor r0,r3,r9 c00bb5b0: 54 00 d9 7e rlwinm r0,r0,27,5,31 c00bb5b4: 83 84 00 00 lwz r28,0(r4) c00bb5b8: 7c 3f 0b 78 mr r31,r1 c00bb5bc: 81 0b 1a 2c lwz r8,6700(r11) c00bb5c0: 39 6b 1a 2c addi r11,r11,6700 c00bb5c4: 7c 00 e2 14 add r0,r0,r28 c00bb5c8: 81 4b 00 04 lwz r10,4(r11) c00bb5cc: 7c 09 4a 78 xor r9,r0,r9 c00bb5d0: 83 24 00 04 lwz r25,4(r4) c00bb5d4: 7d 29 44 30 srw r9,r9,r8 c00bb5d8: 81 0b 00 08 lwz r8,8(r11) c00bb5dc: 7d 29 02 78 xor r9,r9,r0 c00bb5e0: 83 04 00 08 lwz r24,8(r4) c00bb5e4: 7d 29 50 38 and r9,r9,r10 c00bb5e8: 55 29 10 3a rlwinm r9,r9,2,0,29 c00bb5ec: 7c 09 40 2e lwzx r0,r9,r8 c00bb5f0: 7c 9a 23 78 mr r26,r4 c00bb5f4: 7c 7b 1b 78 mr r27,r3 c00bb5f8: 2f 80 00 00 cmpwi cr7,r0,0 c00bb5fc: 7c 1e 03 78 mr r30,r0 c00bb600: 40 be 00 14 bne+ cr7,c00bb614 <__d_lookup+0x90> c00bb604: 48 00 00 7c b c00bb680 <__d_lookup+0xfc> c00bb608: 83 de 00 00 lwz r30,0(r30) c00bb60c: 2f 9e 00 00 cmpwi cr7,r30,0 c00bb610: 41 9e 00 70 beq- ...
Yep, you have the same bug in your compiler. -- Steve --
Hmm... so whats now? Is there a way to prove this scenario is indeed the one that caused the opps? -- Liberty --
Manually edit the broken binary to change the order of the restore and see if the problem disappears. That will keep everything else -- Jon Smirl jonsmirl@gmail.com --
Does it only happen on this function? Or is it happening on others as well. So manually fixing it here won't help much unless we fix it everywhere. -- Steve --
checked with objdump where __d_lookup()
> powerpc-linux-gnu-objdump -dr --start-address=0xc0065790 vmlinux |
grep "<__d_lookup>:"
c00b91d8 <__d_lookup>:
> liberty@liberty:~/svn/exsw1600-2.6.27-rc2$ powerpc-linux-gnu-objdump
-dr --start-address=0xc00b91d8 vmlinux| head -n 100
vmlinux: file format elf32-powerpc
Disassembly of section .text:
c00b91d8 <__d_lookup>:
c00b91d8: 7c 08 02 a6 mflr r0
c00b91dc: 90 01 00 04 stw r0,4(r1)
....
c00b92bc: 4e 80 04 21 bctrl
c00b92c0: 2f 83 00 00 cmpwi cr7,r3,0
c00b92c4: 41 9e 00 50 beq- cr7,c00b9314 <__d_lookup+0x13c>
c00b92c8: 83 de 00 00 lwz r30,0(r30)
c00b92cc: 2f 9e 00 00 cmpwi cr7,r30,0
c00b92d0: 40 9e ff 98 bne+ cr7,c00b9268 <__d_lookup+0x90>
c00b92d4: 38 60 00 00 li r3,0
c00b92d8: 81 61 00 00 lwz r11,0(r1)
c00b92dc: 80 0b 00 04 lwz r0,4(r11)
c00b92e0: 7d 61 5b 78 mr r1,r11
<=== As explained by Steve, these two should be replaced ===>
c00b92e4: bb 0b ff e0 lmw r24,-32(r11)
c00b92e8: 7c 08 03 a6 mtlr r0
c00b92ec: 4e 80 00 20 blr
c00b92f0: 80 04 00 04 lwz r0,4(r4)
....
c00b9330: 7f a3 eb 78 mr r3,r29
c00b9334: 4b ff ff a4 b c00b92d8 <__d_lookup+0x100>
c00b9338 <d_lookup>:
c00b9338: 7c 08 02 a6 mflr r0
on the target I fired up xmon and replaced them.
~ # echo x > /proc/sysrq-trigger
SysRq : Entering xmon
Vector: 0 at [df51fdb8]
pc: c0025960: sysrq_handle_xmon+0x60/0x64
lr: c0025960: sysrq_handle_xmon+0x60/0x64
sp: df51fe80
msr: 21000
current = 0xdc22a9a0
pid = 1698, comm = echo
WARNING: exception is not recoverable, can't continue
enter ? for help
[df51fe90] c0193c38 __handle_sysrq+0xa8/0x178
[df51fec0] c00ee818 write_sysrq_trigger+0x78/0x7c
[df51fed0] c00e65e4 proc_reg_write+0x5c/0x84
[df51fef0] c00a299c ...This is a bug in gcc, and should be mentioned to the gcc maintainers. -- Steve --
I'm not convinced that this is the bug you were seeing. Do you still get the crash if you disable dynamic ftrace but leave ftrace on? -- Steve --
I've verified in sim that the crash in do_lookup() I was seeing (similar backtrace) was indeed caused by an interrupt coming between those two instructions and clobbering the saved GPRs. So I think it's very likely our problem. Ben. --
Segher was looking at this a bit this morning. He thinks it's really -fno-omit-frame-pointer that is causing this. That really shouldn't even be set on PowerPC, but FTRACE uses select which overrides the depends on stuff in Kconfig. Segher can probably tell you more once his email is back up. josh --
I can easily make a patch that makes that select an x86 only. -- Steve --
On Wed, 20 Aug 2008 10:22:49 -0400 (EDT) That's probably a first step, but you might want to wait until Segher can fill in more details. I'm sort of just relaying what he and I were talking about on IRC. IIRC, he was testing builds with and without both -pg and -fno-omit-frame-pointer and found the bug only when -fno-omit-frame-pointer was present. josh --
Dear Penguins,
Let me start of by saying my particular hardware must be buggy in some
way. When I issue a sysrq (Ctrl A+ F from minicom) I get a lot of sysrq
triggers.
I have worked around the problem and I think this workaround is a viable
patch even for platforms which do not exhibit this peculiar behavior.
upon getting numerous interrupts which request sysrq the function
uart_handle_break in include/linux/serial_core.h is hit multiple times.
The current code which looks like this:
static inline int uart_handle_break(struct uart_port *port)
{
struct uart_info *info = port->info;
#ifdef SUPPORT_SYSRQ
if (port->cons && port->cons->index == port->line) {
if (!port->sysrq) {
port->sysrq = jiffies + HZ*5;
return 1;
}
port->sysrq = 0;
}
#endif
if (port->flags & UPF_SAK)
do_SAK(info->tty);
return 0;
}
Will basicly toggle port->sysrq between a timeout value and zero. If you
are lucky this penguin rullet will stop on timeout and the next
character hit will trigger the sysrq in the function
"uart_handle_sysrq_char". But if you are not so lucky the last sysrq
interupt will toggle port->sysrq to zero and the next char hit will be
ignored (not trigger sysrq).
The suggested patch will do the next few things:
1. "port->sysrq" is now the time when the last sysrq was triggered and
not the timeout for the the next char
2. Stamped "port->sysrq" every time there is a sysrq rather then toggled
it up and down.
3. Always continue to consider UPF_SAK.
4. "port->sysrq" is toggled back to zero only in uart_handle_break() and
only if the a char has been accepted after the sysrq timeout (5 sec)
5. uart_handle_break() will ignore extra chars received in super human
speed after the last sysrq (0.01 sec)
Liberty
Signed-off-by: Eran Liberty <liberty@extricom.org>
---
On Mon, 15 Sep 2008 19:30:05 +0300
We prefer patches in `patch -p1' form, please.
The 0.01 is a big no-no. Sometimes gcc like to go into stupid mode and
starts doing floating point stuff.
A suitable fix would be to use HZ/100. But that assumes that HZ is
always >= 100. That's a pretty good assumption, and various parts of
the kernel will explode if HZ is set too small. However it's always
good to ensure that someone else's stuff will explode before yours
does, so how about we make it HZ/50? Will that still work OK for you?
--- a/include/linux/serial_core.h~serial-driver-handle-multiple-consecutive-sysrq-from-the-serial-fix
+++ a/include/linux/serial_core.h
@@ -444,8 +444,7 @@ static inline int
uart_handle_sysrq_char(struct uart_port *port, unsigned int ch)
{
#ifdef SUPPORT_SYSRQ
- if (port->sysrq && time_after(jiffies, port->sysrq +
- (unsigned long)(HZ*0.01))) {
+ if (port->sysrq && time_after(jiffies, port->sysrq + HZ / 50)) {
if (ch && time_before(jiffies, port->sysrq + HZ*5)) {
handle_sysrq(ch, port->info ? port->info->port.tty : NULL);
port->sysrq = 0;
_
--
HZ/50 is totally OK by me. -- Liberty --
Oops, should have mentioned it ;-) 4.2.3 (Ubuntu 4.2.3-2ubuntu7) Ben. --
I'm running a 4.3.1 (Debian 4.3.1-2) on my powerbook. Perhaps the issue is already taken care of. -- Steve --
What syntax to do that with? lwz %1,0(%U2) stu %3, 0(%X2) How else to do it? stwu %3, (%2) does not compile for me. -- Steve --
lwz%U2%X2 %1, %2 stw %3, 0(%2) The "u" tells it to write the effective address back to %2 -- but with an offset of zero, the effective address is unchanged. -Scott --
Ah! Right! /me should open up his PowerPC ref books again :-p -- Steve --
for the fun of it I took out all my propriety modules; so now its a non tainted kernel. Here is the matrix: !FTRACE x !MARKERS => stable !FTRACE x MARKERS => stable FTRACE x !MARKERS => n/a (FTRACE forces MARKERS) FTRACE x MARKERS => unstable FTRACE x FTRACE_STARTUP_TEST x MARKERS => unstable + tests passed Testing tracer sched_switch: PASSED Testing tracer ftrace: PASSED Testing dynamic ftrace: PASSED Oops: Exception in kernel mode, sig: 11 [#1] Exsw1600 Modules linked in: NIP: c00bbb20 LR: c00bbb20 CTR: 00000000 REGS: dd5b1c50 TRAP: 0700 Not tainted (2.6.27-rc2) MSR: 00029000 <EE,ME> CR: 24082282 XER: 20000000 TASK = ddcce060[1707] 'find' THREAD: dd5b0000 GPR00: 00000000 dd5b1d00 ddcce060 dd801180 dd5b1d68 dd5b1d58 dd80125b 100234ec GPR08: c0800000 00019330 0000ffff dd5b1d20 24000288 100ad874 100936f8 1008a1d0 GPR16: 10083f80 dd5b1e2c dd5b1d68 fffffff4 c0380000 dd5b1d60 dd5b1d58 dd802084 GPR24: dc3d7700 dd802018 dd5b1d68 c0380000 dd801180 dd5b1d68 00000000 dd5b1d00 NIP [c00bbb20] d_lookup+0x40/0x90 LR [c00bbb20] d_lookup+0x40/0x90 Call Trace: [dd5b1d00] [dd5b1d58] 0xdd5b1d58 (unreliable) [dd5b1d20] [c00aebc4] do_lookup+0xe8/0x220 [dd5b1d50] [c00b0a80] __link_path_walk+0x5a4/0xd54 [dd5b1dc0] [c00b1288] path_walk+0x58/0xe0 [dd5b1df0] [c00b13f8] do_path_lookup+0x78/0x13c [dd5b1e20] [c00b20f4] user_path_at+0x64/0xac [dd5b1e90] [c00a9028] vfs_lstat_fd+0x34/0x74 [dd5b1ec0] [c00a90fc] vfs_lstat+0x30/0x48 [dd5b1ed0] [c00a9144] sys_lstat64+0x30/0x5c [dd5b1f40] [c0010554] ret_from_syscall+0x0/0x3c Instruction dump: 7c0802a6 bf61000c 3f60c038 7c3f0b78 90010024 7c7c1b78 7c9d2378 83db32a0 73c00001 7f83e378 7fa4eb78 4082002f <00000000> 2f830000 409e0030 801b32a0 ---[ end trace 1eb8fd7adac2bb65 ]--- Liberty --
Can you check if, at some point during the system execution (starting from boot), 0xdd5b1d58 is an address where a module is loaded ? (the module can be later unloaded, what I wonder is if this address would appear to have had a loaded+unloaded module). Actually, could you try to compile your kernel without "MODULE_UNLOAD" ? -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
I've had similar crashes (with similar backtraces in the VFS) on machines with no modules, netbooted zImages. Ben. --
No modules... ~ # uname -a Linux 2.6.27-rc2 #6 Tue Aug 19 12:33:34 IDT 2008 ppc unknown ~ # gunzip /proc/config.gz -c | grep UNLOAD # CONFIG_MODULE_UNLOAD is not set ~ # gunzip /proc/config.gz -c | grep FTRACE CONFIG_HAVE_FTRACE=y CONFIG_HAVE_DYNAMIC_FTRACE=y CONFIG_FTRACE=y CONFIG_DYNAMIC_FTRACE=y CONFIG_FTRACE_SELFTEST=y CONFIG_FTRACE_STARTUP_TEST=y ~ # gunzip /proc/config.gz -c | grep MARKERS CONFIG_MARKERS=y ~ # lsmod Module Size Used by ~ # while [ 1 ] ; do find / > /dev/null ; echo . ; done . Oops: Exception in kernel mode, sig: 11 [#1] Exsw1600 Modules linked in: NIP: c00bb724 LR: c00bb724 CTR: 00000000 REGS: d8b59c50 TRAP: 0700 Not tainted (2.6.27-rc2) MSR: 00029000 <EE,ME> CR: 24082282 XER: 20000000 TASK = dbc68de0[1712] 'find' THREAD: d8b58000 GPR00: 00000000 d8b59d00 dbc68de0 dd801180 d8b59d68 d8b59d58 dd8019db 100234ec GPR08: c0800000 00019330 0000ffff d8b59d20 24000288 100ad874 100936f8 1008a1d0 GPR16: 10083f80 d8b59e2c d8b59d68 fffffff4 c0380000 d8b59d60 d8b59d58 dd802084 GPR24: ddc69500 dd802018 d8b59d68 c0380000 dd801180 d8b59d68 00000000 d8b59d00 NIP [c00bb724] d_lookup+0x40/0x90 LR [c00bb724] d_lookup+0x40/0x90 Call Trace: [d8b59d00] [d8b59d58] 0xd8b59d58 (unreliable) [d8b59d20] [c00ae7c8] do_lookup+0xe8/0x220 [d8b59d50] [c00b0684] __link_path_walk+0x5a4/0xd54 [d8b59dc0] [c00b0e8c] path_walk+0x58/0xe0 [d8b59df0] [c00b0ffc] do_path_lookup+0x78/0x13c [d8b59e20] [c00b1cf8] user_path_at+0x64/0xac [d8b59e90] [c00a8c98] vfs_lstat_fd+0x34/0x74 [d8b59ec0] [c00a8d6c] vfs_lstat+0x30/0x48 [d8b59ed0] [c00a8db4] sys_lstat64+0x30/0x5c [d8b59f40] [c0010554] ret_from_syscall+0x0/0x3c Instruction dump: 7c0802a6 bf61000c 3f60c038 7c3f0b78 90010024 7c7c1b78 7c9d2378 83db32a0 73c00001 7f83e378 7fa4eb78 4082002f <00000000> 2f830000 409e0030 801b32a0 ---[ end trace 7766edd310cd3442 ]--- --
Just like this instruction... maybe related ? -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
always the same and not always with the same value. The corruption -seems- to happen due to corruption of the location on the stack where it was saved to / restored from in a previous function call or interrupt. I haven't yet tracked down what actually whacks the stack. Ben. --
Could you load this into gdb for me and show me the output of: gdb> li *0xc00bbb20 (Assuming you compiled with debuginfo on) and... Ouch! we have a 00000000 instruction. I'm almost done with the new mcount record for PPC (I have it done for ppc64, I'm just porting it to 32). I'm thinking this new code may solve your issues too. I hate the daemon. -- Steve --
I have attached ddd. Up on crashing it points on this line
struct dentry * d_lookup(struct dentry * parent, struct qstr * name)
{
struct dentry * dentry = NULL;
unsigned long seq;
do {
seq = read_seqbegin(&rename_lock);
dentry = __d_lookup(parent, name); <==== ddd marker
if (dentry)
break;
} while (read_seqretry(&rename_lock, seq));
return dentry;
}
(gdb) bt
#0 d_lookup (parent=0xdd801180, name=0xd99b3d68) at fs/dcache.c:1361
#1 0xc00ae7c8 in real_lookup (nd=<value optimized out>, name=<value
optimized out>, parent=0xdd801180) at fs/namei.c:507
#2 do_lookup (nd=0xd99b3e28, name=0xd99b3d68, path=0xd99b3d58) at
fs/namei.c:825
#3 0xc00b0684 in __link_path_walk (name=0xddc23009 "", nd=0xd99b3e28)
at fs/namei.c:982
#4 0xc00b0e8c in link_path_walk (nd=<value optimized out>, name=<value
optimized out>) at fs/namei.c:570
#5 path_walk (name=0xddc23000 "/proc/mtd", nd=0xd99b3e28) at
fs/namei.c:1041
#6 0xc00b0ffc in do_path_lookup (dfd=<value optimized out>,
name=0xddc23000 "/proc/mtd", flags=<value optimized out>, nd=0xd99b3e28)
at fs/namei.c:1091
#7 0xc00b1cf8 in user_path_at (dfd=-100, name=<value optimized out>,
flags=0, path=0xd99b3e98) at fs/namei.c:1334
#8 0xc00a8c98 in vfs_lstat_fd (dfd=-578809472, name=0xd99b3d68 "",
stat=0xd99b3ed8) at fs/stat.c:83
#9 0xc00a8d6c in vfs_lstat (name=0xd99b3d68 "", stat=0xd99b3d58) at
fs/stat.c:93
#10 0xc00a8db4 in sys_lstat64 (filename=0xdd801180 "",
statbuf=0xbfe285d8) at fs/stat.c:381
#11 0xc0010554 in syscall_dotrace_cont ()
both cp & lr points to 0xC00BB724
(gdb) info registers
r0 0x0 0
r1 0xd99b3d00 3650829568
r2 0xddccf2e0 3721196256
r3 0xdd801180 3716157824
r4 0xd99b3d68 3650829672
r5 0xd99b3d58 3650829656
r6 0xdd822a5b 3716295259
r7 0x100234ec 268580076
r8 ...Can you also give us objdump -S --start-address=0xC00BB724 vmlinux | head 20 ? Then we could compare the result with the OOPS instruction dump : 7c0802a6 bf61000c 3f60c038 7c3f0b78 90010024 7c7c1b78 7c9d2378 83db32a0 73c00001 7f83e378 7fa4eb78 4082002f <00000000> 2f830000 409e0030 801b32a0 -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
to give you more context I have run:
> powerpc-linux-gnu-objdump -S --start-address=0xC00BB700 vmlinux |
head -n 60
the discrepancy starts at address:
c00bb720: 40 82 00 30 <=> 40 82 00 2f
c00bb724: 4b ff fe 61 <=> 00 00 00 00
vmlinux: file format elf32-powerpc
Disassembly of section .text:
c00bb700 <d_lookup+0x1c>:
* d_lookup() is protected against the concurrent renames in some unrelated
* directory using the seqlockt_t rename_lock.
*/
struct dentry * d_lookup(struct dentry * parent, struct qstr * name)
{
c00bb700: 7c 3f 0b 78 mr r31,r1
c00bb704: 90 01 00 24 stw r0,36(r1)
c00bb708: 7c 7c 1b 78 mr r28,r3
c00bb70c: 7c 9d 23 78 mr r29,r4
c00bb710: 83 db 32 a0 lwz r30,12960(r27)
{
unsigned ret;
repeat:
ret = sl->sequence;
smp_rmb();
c00bb714: 73 c0 00 01 andi. r0,r30,1
struct dentry * dentry = NULL;
unsigned long seq;
do {
seq = read_seqbegin(&rename_lock);
dentry = __d_lookup(parent, name);
c00bb718: 7f 83 e3 78 mr r3,r28
c00bb71c: 7f a4 eb 78 mr r4,r29
if (unlikely(ret & 1)) {
c00bb720: 40 82 00 30 bne- c00bb750 <d_lookup+0x6c>
c00bb724: 4b ff fe 61 bl c00bb584 <__d_lookup>
if (dentry)
c00bb728: 2f 83 00 00 cmpwi cr7,r3,0
c00bb72c: 40 9e 00 30 bne- cr7,c00bb75c <d_lookup+0x78>
*
* If sequence value changed then writer changed data while in section.
*/
static __always_inline int read_seqretry(const seqlock_t *sl, unsigned
start)
{
smp_rmb();
c00bb730: 80 1b 32 a0 lwz r0,12960(r27)
break;
} while (read_seqretry(&rename_lock, seq));
c00bb734: 7f 80 f0 00 cmpw cr7,r0,r30
c00bb738: 41 9e 00 24 beq- cr7,c00bb75c <d_lookup+0x78>
/* Start of read calculation -- fetch last complete writer token ...Ah !
I think I see what could be wrong :
First we have :
static unsigned int ftrace_nop = 0x60000000;
We probably replace the original function call by this nop.
Then we do :
notrace unsigned char *ftrace_call_replace(unsigned long ip, unsigned long addr)
{
static unsigned int op;
/*
* It would be nice to just use create_function_call, but that will
* update the code itself. Here we need to just return the
* instruction that is going to be modified, without modifying the
* code.
*/
addr = GET_ADDR(addr);
/* Set to "bl addr" */
op = 0x48000001 | (ftrace_calc_offset(ip, addr) & 0x03fffffc);
/*
* No locking needed, this must be called via kstop_machine
* which in essence is like running on a uniprocessor machine.
*/
return (unsigned char *)&op;
}
And I guess we must be doing a 0x48000001 | 0x0; or something ?
Also, we have to consider that POWERPC 64 functions are :
/* PowerPC64's functions are data that points to the functions */
And this does not seem to hold for ppc32. Therefore, it is strange to me
that the same code is used for the update.. are we updating the correct
site ?
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
Do you have PREEMPT_TRACER enabled, or any other tracer for that matter? --
I can see stack trace & context trace, but they are derived from other choices (I can not un select them) cat .config | grep TRACE CONFIG_STACKTRACE_SUPPORT=y CONFIG_HAVE_ARCH_TRACEHOOK=y # CONFIG_BLK_DEV_IO_TRACE is not set CONFIG_STACKTRACE=y # CONFIG_BACKTRACE_SELF_TEST is not set CONFIG_HAVE_FTRACE=y CONFIG_HAVE_DYNAMIC_FTRACE=y CONFIG_FTRACE=y # CONFIG_SCHED_TRACER is not set CONFIG_CONTEXT_SWITCH_TRACER=y CONFIG_DYNAMIC_FTRACE=y CONFIG_FTRACE_SELFTEST=y --
You must not have PREEMPT on, so the PREEMPT_TRACER will not show up. The reason I asked, is that my PowerBook runs fine without PREEMPT_TRACER but is very unstable when I have PREEMPT_TRACER enabled. --
I spent some time last week tracking one of those crashes and it appears that we are getting corruption of some of the non-volatile registers. So far, I found out that it -seems- to be coming from stack frame corruption during a timer interrupt. I haven't had a chance to dig further yet. Cheers, Ben. --
