[PATCH 2.6.26] SERIAL DRIVER: Handle Multiple consecutive sysrq from the serial

Previous thread: RT-Scheduler/cgroups: Possible overuse of resources assigned via cpu.rt_period_us and cpu.rt_runtime_us by Daniel K. on Wednesday, June 18, 2008 - 7:12 am. (5 messages)

Next thread: [PATCH] add kernel-doc for simple_read_from_buffer and memory_read_from_buffer by Akinobu Mita on Wednesday, June 18, 2008 - 8:04 am. (1 message)
From: Eran Liberty
Date: Wednesday, June 18, 2008 - 7:18 am

Short:
1. Calling pci_remove_bus_device() on some of the devices on a bus will 
remove them. (ok)
2. Calling pci_scan_child_bus() on that bus will scan the bus for 
devices and add them all. even the ones already in the bus device list, 
which will now be duplicated in the bus device list! (no so ok)
3. Calling pci_bus_add_devices() Will try to add the device, assign 
resource, and create procfs files. The devices which were not originally 
removed are were being assigned resource they don't need and procfs 
files already exist and will collide (Bug)

Background:
I have a programmable component  which implements a pci device. Upon 
warm update i need to remove my devices from the bus, upgrade, and add 
them back.
I have bashed my head against pci.h API for the last week trying to make 
this work with no success. Right now I have devised a hideous hack to 
circumvent pci.h and directly remap the pci device registers.

demo code:

reload_my_hardware_design() {
    struct pci_dev *dev = NULL;
    struct pci_bus *bus = NULL;

    while ((dev = pci_get_device(PCI_VENDOR_ID_MY_ID,PCI_ANY_ID,NULL))) {
         pci_remove_bus_device(dev);
    }

    ...
    reload the hardware
    ...

    bus = pci_find_bus(0,0); /* though hard coded in this example this 
is the right bus*/
    pci_scan_child_bus(bus);
    pci_bus_add_devices(bus);
}


Is this a bug or am I doing something wrong?

Liberty
--

From: Eran Liberty
Date: Sunday, July 20, 2008 - 3:31 am

Dear Penguins,

As  a follow up on my own post ( http://lkml.org/lkml/2008/6/18/195 ), I 
have  solved a small bug/misbehavior within the pci_scan_single_device().

The misbehavior manifest itself upon calling pci_scan_child_bus() with a 
bus which already contain some devices. After scanning is done devices 
that already existed will be present twice in the PCI bus devise list. 
Once with is_added indication and once without. Trying to add this bus 
will cause a resource conflict as the same device is already present and 
initialized.

This patch will simply prevent a device to be added to a bus list if it 
is already there.

Points to consider:
1. I am not a PCI Guru and might have over looked the bigger picture. Is 
it OK to prevent a device of being re-added?
2. I have decided that two devices are, in fact, the same instance if it 
has the same: vendor, device, and devfn. Is there a finer test for a 
device identity?

p.s. As per http://www.kernel.org/pub/linux/docs/lkml/#s1-10 it is 
preferred for Mozilla Firefox clients to attach the patch, hence my 
patch is attached.

Liberty

Signed-off-by: Eran Liberty <liberty@extrictom.org>
---
From: Eran Liberty
Date: Sunday, July 20, 2008 - 9:48 am

The subject is wrong! it is 2.6.26-rc4.

I will re-email with a corrected subject for automated scripts sake.

Liberty


--

From: Eran Liberty
Date: Monday, July 21, 2008 - 12:18 pm

Dear Penguins,

As a follow up on my own post ( http://lkml.org/lkml/2008/6/18/195 ), I
have  solved a small bug/misbehavior within the pci_scan_single_device().

The misbehavior manifest itself upon calling pci_scan_child_bus() with a
bus which already contain some devices. After scanning is done devices
that already existed will be present twice in the PCI bus devise list.
Once with is_added indication and once without. Trying to add this bus
will cause a resource conflict as the same device is already present and
initialized.

This patch will simply prevent a device to be added to a pci bus list if 
it is already there.

Points to consider:
1. I am not a PCI Guru and might have over looked the bigger picture. Is
it OK to prevent a device of being re-added?
2. I have decided that two devices are, in fact, the same instance if it
has the same: vendor, device, and devfn. Is there a finer test for a
device identity?

p.s. As per http://www.kernel.org/pub/linux/docs/lkml/#s1-10 it is
preferred for Mozilla Firefox clients to attach the patch, hence my
patch is attached.

Liberty

Signed-off-by: Eran Liberty <liberty@extricom.org>
---

From: Matthew Wilcox
Date: Monday, July 21, 2008 - 12:49 pm

I think this is your real problem, that you're rescanning the entire
bus.  I don't think that's the route we'd recommend taking.

Why don't you call pci_scan_slot() instead?  You won't get the benefit of
pcibios_fixup_bus(), but I'm not convinced that's safe to call on a bus
that's already been scanned.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--

From: eran liberty
Date: Tuesday, July 22, 2008 - 1:21 am

My stating point was that I have loaded a new design into a
programmable device which sits on the pci device. The new design can
implement numerous pci devices or non at all. I can think of an easy
way (or clean one) to scan only the programmable device. Scanning the

As said its not exactly a slot its more like a regular pci device that
someone suddenly welded into the pci bus. Its not a hotplug as well,
and I do not want to give up on the pcibios_fixup_bus()

As it is, with my patch applied i successfully go over the bus and
remove my own devices before I reprogram the
programmable device.

while ((dev = pci_get_device(PCI_VENDOR_ID_MYCOMP,PCI_DEVICE_ID_MYDEV,NULL))
!= NULL) {
	pci_remove_bus_device(dev);
	pci_dev_put(dev);
}

Load a new design into it.

Then scan the entire bus and add the newly discovered devices.

bus = null;
while ((bus = pci_find_next_bus(bus)) != NULL) {
	pci_scan_child_bus(bus);
	pci_bus_assign_resources(bus);
	pci_bus_add_devices(bus);
}

As seen here, this sequence of instructions seems very intuitive. It
will fail without the patch upon pci_bus_add_devices().

Liberty
--

From: Matthew Wilcox
Date: Tuesday, July 22, 2008 - 4:49 am

That's what pci_scan_slot() is for.  It scans the first function at the
device number, then (if the header indicates it's a multifunction
device) scans the other functions associated with that device.  eg you
could call pci_scan_slot(bus, 0x30) and it will create function 06.0
(and potentially 06.1, 06.2, ...)

You presumably already have the devfn for the existing device since

Why not?  What architecture are you using?  What does
pcibios_fixup_bus() do for you?

(as a side-note, I'd like to reimplement the pcibios_fixup_*() routines;
I think a lot of what they do can be done more generically these days.

Seems utterly unintuitive to me.  You're doing a lot of unnecessary work
here, and if you have two cards in your machine, you'll take away both
of them when you reload either of them.

What you should do is cache the pci_bus and the devfn at startup:

static struct pci_bus *my_bus;
static int my_devfn;

	struct pci_dev *dev = pci_get_device(PCI_VENDOR_ID_MYCOMP,
					PCI_DEVICE_ID_MYDEV, NULL);
	if (!dev)
		return -ENODEV;
	my_bus = dev->bus;
	my_devfn = dev->devfn;
	pci_dev_put(dev);

when you want to remove it:

	for (func = 0; func < 8; func++)
		struct pci_dev *dev = pci_get_slot(my_bus, my_devfn + func);
		if (!dev)
			continue;
		pci_remove_bus_device(dev);
		pci_dev_put(dev);
	}

when you want to rescan it:

	pci_scan_slot(my_bus, my_devfn);

(this only handles one programmable card.  The basic idea could be
extended to handle multiple cards if you need to do that).

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--

From: Eran Liberty
Date: Tuesday, July 22, 2008 - 6:08 am

Matthew,

You seem to have a finer grasp of the subject then I do, please correct/educate me on any of the points I raise in the following lines.

Each slot represent a single device which can have more then one function.  pci_scan_slot is aimed for scanning these multiple functions. 
I, on the other hand, have programmable device on the pci bus which is, for the sake of this discussion, a complete black box. 
This black box up on loading can implement more then one device, which can have more then one function each. 
So as far as I see it, now I need to scan all slots on the bus.

I work with ARCH=powerpc. pcibios_fixup_bus() will deal with all the resource bars allocation. 
Hmmm, I do want to remove all the devices that are implemented by the programmable unit which is reloaded. 
I have not considered the possibility of having more then one programmable unit. 
I think there is a hidden assumption in this code, again please correct me if I missed the point. 
This code assumes that the devices which will re-appear after the programmable unit is loaded has the same devfn and bus as the devices which were present before the reload. 
This assumption might be wrong. 

For example, I have a basic programmable image which has no pci devices at all. 
upon unloading I do not remove any device (as non are present) and up on reloading I suddenly have two. What is their bus? their devfn?

Ultimately I would have expected to find a "int pci_scan_bus(struct  <http://liberty/lxr/ident?v=e500-linux-2.6.26-rc4;i=pci_scan_bus>pci_bus <http://liberty/lxr/ident?v=e500-linux-2.6.26-rc4;i=pci_bus> *bus <http://liberty/lxr/ident?v=e500-linux-2.6.26-rc4;i=bus>);" the "pci_scan_child_bus <http://liberty/lxr/ident?v=e500-linux-2.6.26-rc4;i=pci_scan_child_bus>()" was the closest to the mark

Liberty

--

From: Eran Liberty
Date: Tuesday, July 22, 2008 - 6:14 am

WOW, sorry for the html tags (though I had this Thunderbird under 
control). Here is the same mail without html tags (hopeully)


Matthew,
You seem to have a finer grasp of the subject then I do, please 
correct/educate me on any of the points I raise in the following lines.

Each slot represent a single device which can have more then one 
function.  pci_scan_slot is aimed for scanning these multiple functions.
I, on the other hand, have programmable device on the pci bus which is, 
for the sake of this discussion, a complete black box.
This black box up on loading can implement more then one device, which 
can have more then one function each.
So as far as I see it, now I need to scan all slots on the bus.

But to be honest, upon looking a way to make my device work I dismissed 
the "pci_scan_slot()" option as It did not reach the "fixup_resource ()" 
I work with ARCH=powerpc. pcibios_fixup_bus() will deal with all the 
resource bars allocation.
If I can lend a hand there, let me know and I will try to squeeze it in 
Hmmm, I do want to remove all the devices that are implemented by the 
programmable unit which is reloaded.
I have not considered the possibility of having more then one 
programmable unit.
I think there is a hidden assumption in this code, again please correct 
me if I missed the point.
This code assumes that the devices which will re-appear after the 
programmable unit is loaded has the same devfn and bus as the devices 
which were present before the reload.
This assumption might be wrong.

For example, I have a basic programmable image which has no pci devices 
at all.
upon unloading I do not remove any device (as non are present) and up on 
reloading I suddenly have two. What is their bus? their devfn?

Ultimately I would have expected to find a "int pci_scan_bus(struct 
pci_bus *bus );" the "pci_scan_child_bus ()" was the closest to the mark

Liberty

--

From: Matthew Wilcox
Date: Tuesday, July 22, 2008 - 7:13 am

That's quite interesting.  A normal PCI device has its IDSEL pin
connected to one of the AD lines, so it's not possible for it to respond
to more then one of the 32 possible devices.  I suppose by decoding
the configuration cycles on the bus, it's possible to make a
programmable device respond to more than one device.  I suppose you have
to be careful to program it to not respond to the same device numbers as
any other device on the bus.  Or are there no other devices on this bus

OK.  I don't think it's safe to call powerpc's fixup_resource() more

What I want to do is get rid of fixup_resource() and its equivalent on
other architectures and call pcibios_bus_to_resource() from
drivers/pci/probe.c.

I've been working in this area recently; here's my current tree:

http://git.kernel.org/?p=linux/kernel/git/willy/misc.git;a=shortlog;h=pci-read-base


Yes, I did make that assumption ... I'm used to real PCI devices, not

If this is all alone on a bus of its own, we can certainly handle that.
Maybe we can make it a module parameter or something.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--

From: Eran Liberty
Date: Tuesday, July 22, 2008 - 8:25 am

If my devices where the only one on the bus I would not have stumbled on 
this bug. I would have removed them all before reloading and would have 
scanned and empty PCI bus list after the reloading.

Alas, the CPU itself is a pci device. Upon scanning the bus it is it who 
is duplicated. Upon trying to "pci_bus_add_devices()" you get resource 
collision.

Do you perceive any down side to my suggested patch?
It seems (to me) really logical that a device can not and should not be 
added twice to a pci bus device list.

Liberty
--

From: Matthew Wilcox
Date: Tuesday, July 22, 2008 - 9:52 am

That's not true, the code you posted:

bus = null;
while ((bus = pci_find_next_bus(bus)) != NULL) {
        pci_scan_child_bus(bus);
        pci_bus_assign_resources(bus);
        pci_bus_add_devices(bus);
}


Does the CPU have BARs of its own?  If you have no other devices with
BARs, it would explain why you have not noticed the problem with

The problem is the other side-effects of what you propose.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--

From: Eran Liberty
Date: Tuesday, July 22, 2008 - 10:41 am

Could you elaborate what ominous side-effects you predict?

Liberty
--

From: Matthew Wilcox
Date: Tuesday, July 22, 2008 - 11:11 am

Look at the consequences of calling fixup_resource() twice on the same
resource:

        res->start = (res->start + offset) & mask;
        res->end = (res->end + offset) & mask;

You'll add 'offset' to the other resources twice.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--

From: Eran Liberty
Date: Wednesday, July 23, 2008 - 11:31 am

Tring to find heads and tails in the code, I dug in. This is what I understand from the overall flow... and my points of interests:

1. pci_scan_child_bus() starts with iterating over all the devfn and scanning for a device with that devfn

drivers/pci/probe.c
1056 unsigned int __devinit pci_scan_child_bus(struct pci_bus *bus)
1057 {
1063         /* Go find them, Rover! */
1064         for (devfn = 0; devfn < 0x100; devfn += 8)
1065                 pci_scan_slot(bus, devfn);

2. pci_scan_slot() will continue to scan all the functions that devfn might have to over
drivers/pci/probe.c
1019 int pci_scan_slot(struct pci_bus *bus, int devfn)
1020 {
1026         for (func = 0; func < 8; func++, devfn++) {
1029                 dev = pci_scan_single_device(bus, devfn);

3. pci_scan_single_device() will scan / test if this dev is valid.
drivers/pci/probe.c
996 struct pci_dev *__ref pci_scan_single_device(struct pci_bus *bus, int devfn)
997 {
998         struct pci_dev *dev;
999 
1000         dev = pci_scan_device(bus, devfn);

and add it. REGARDLESS if it is already on the pci bus list or not.
1001         if (!dev)
1002                 return NULL;
1003 
1004         pci_device_add(dev, bus);
1005 
1006         return dev;
1007 }

This is the first thing that needs to be fixed IMHO.

4. pci_scan_child_bus() will go on to fixup the bus whether it needs fixing or not.
drivers/pci/probe.c
1072         pcibios_fixup_bus(bus);

This might be the second thing that needs amending.

5.  pcibios_fixup_bus(bus) calls _pcibios_fixup_bus which will call fixup_resource() once per a bus resource (in contrast of per device on that bus) and fixes the resources of that bus.
arch/powerpc/kernel/pci-common.c	
784 static void __devinit __pcibios_fixup_bus(struct pci_bus *bus)
785 {
787         struct pci_dev *dev = bus->self;
798                 for (i = 0; i < PCI_BUS_NUM_RESOURCES; ++i) {
833                         fixup_resource(res, dev);
834                 }
835         }
850 ...
From: Eran Liberty
Date: Sunday, July 27, 2008 - 4:01 am

Found some time to test my own idea... it seems to work. :)

Its final version looks like this:

Before reloading the programmable unit, remove all the device implemented by it

	struct pci_dev *dev;
	while ((dev = pci_get_device(PCI_VENDOR_ID_MYCOMP,PCI_DEVICE_ID_MYCOMP_MYDEV,NULL)) != NULL) {
		pci_remove_bus_device(dev);
		pci_dev_put(dev);
	}

After the programmable unit is loaded, scan and add only newly added device!

 	struct pci_bus *bus = NULL;
	while ((bus = pci_find_next_bus(bus)) != NULL) {
		int devfn;
		for (devfn = 0; devfn < 0x100; devfn++) {
			struct pci_dev *dev=pci_get_slot(bus,devfn);
			if (dev) 
				continue;  // <---- do not scan devices already present on the bus. Missing magic line :)
			(void)pci_scan_single_device(bus, devfn);
		}
		pci_bus_assign_resources(bus);
		pci_bus_add_devices(bus);
	}

So... no need for patch after all.
Unless... Hey whats this... behind it... on the side.... naaa, no need of patch... :(

Liberty

--

From: Matthew Wilcox
Date: Sunday, July 27, 2008 - 8:08 am

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--

From: Eran Liberty
Date: Monday, August 18, 2008 - 1:08 am

After compiling a kernel with ftrace I started to experience all sorts 
of crashes.
I have a powerpc env which closly resemble mpc8548cds_defconfig.

I have produced the crash in seconds by issuing:
 > while [ 1 ] ; do find / > /dev/null ; echo .  ; done

Liberty

here are some of the kernel crashes:
---
~ # find /proc/ -name kft
find: /proc/150/net/softnet_stat: No such file or direOops: Exception in 
kernel mode, sig: 11 [#1]
Exsw1600
Modules linked in:
NIP: c00bd6fc LR: c00bd6fc CTR: 00000000
REGS: ddfc3d50 TRAP: 0700   Not tainted  (2.6.27-rc2)
MSR: 00029000 <EE,ME>  CR: 20002284  XER: 20000000
TASK = ddcccde0[1699] 'find' THREAD: ddfc2000
GPR00: 00000000 ddfc3e00 ddcccde0 dd895580 ddfc3e30 c00b5da0 ddfc3e90 
00000003
GPR08: c00ece54 00000370 00145505 00000003 80002282 100ad874 10083ca0 
10083cd8
GPR16: 10083cd0 00000008 00000000 c00b5da0 ddfc3f18 c00ece54 00000000 
df465720
GPR24: d7431900 ddfc3e30 dd895580 c0380000 dd895580 ddfc3e30 00000000 
ddfc3e00
NIP [c00bd6fc] d_lookup+0x40/0x90
LR [c00bd6fc] d_lookup+0x40/0x90
Call Trace:
[ddfc3e00] [c006ade4] rcu_process_callbacks+0x48/0x60 (unreliable)
[ddfc3e20] [c00eb514] proc_fill_cache+0x94/0x1b0
[ddfc3e80] [c00ef720] proc_task_readdir+0x294/0x3c4
[ddfc3ee0] [c00b60fc] vfs_readdir+0xb8/0xd8
[ddfc3f10] [c00b619c] sys_getdents64+0x80/0x104
[ddfc3f40] [c0010554] ret_from_syscall+0x0/0x3c
Instruction dump:
7c0802a6 bf61000c 3f60c038 7c3f0b78 90010024 7c7c1b78 7c9d2378 83db52a0
73c00001 7f83e378 7fa4eb78 4082002f <00000000> 2f830000 409e0030 801b52a0
ctory
---[ end trace 6b059a7f5ba5a55a ]---

---

Oops: Exception in kernel mode, sig: 11 [#1]
Exsw1600
Modules linked in: gplonly_api
NIP: c00bd6fc LR: c00bd6fc CTR: 00000000
REGS: df52dc40 TRAP: 0700   Not tainted  (2.6.27-rc2)
MSR: 00029000 <EE,ME>  CR: 20082422  XER: 20000000
TASK = ddc2e060[2294] 'eeprom' THREAD: df52c000
GPR00: 00000000 df52dcf0 ddc2e060 dd82c700 df52dd58 df52dd48 dd82c75e 
00000000
GPR08: c0800000 000311f0 0000ffff df52dd10 20002488 ...
From: Steven Rostedt
Date: Monday, August 18, 2008 - 8:07 am

Just to make sure...

ftrace enables markers too, and RCU has tracing with the markers. This 
may not be the problem, but I just want to eliminate as many variables 
as possible.
Could you disable ftrace, but keep the markers on too.  Also, could you 
enable ftrace again and turn on the FTRACE_STARTUP_TEST.

Thanks,


--

From: Mathieu Desnoyers
Date: Monday, August 18, 2008 - 8:47 am

Steve ? I just noticed this :


ftrace_modify_code(unsigned long ip, unsigned char *old_code,
                   unsigned char *new_code)
{
        unsigned replaced;
        unsigned old = *(unsigned *)old_code;
        unsigned new = *(unsigned *)new_code;
        int faulted = 0;

        /*
         * Note: Due to modules and __init, code can
         *  disappear and change, we need to protect against faulting
         *  as well as code changing.
         *
         * No real locking needed, this code is run through
         * kstop_machine.
         */
        asm volatile (
                "1: lwz         %1, 0(%2)\n"
                "   cmpw        %1, %5\n"
                "   bne         2f\n"
                "   stwu        %3, 0(%2)\n"
                "2:\n"
                ".section .fixup, \"ax\"\n"
                "3:     li %0, 1\n"
                "       b 2b\n"
                ".previous\n"
                ".section __ex_table,\"a\"\n"
                _ASM_ALIGN "\n"
                _ASM_PTR "1b, 3b\n"
                ".previous"
                : "=r"(faulted), "=r"(replaced)
                : "r"(ip), "r"(new),
                  "0"(faulted), "r"(old)
                : "memory");

        if (replaced != old && replaced != new)
                faulted = 2;

        if (!faulted)
                flush_icache_range(ip, ip + 8);

        return faulted;
}

What happens if you are really unlucky and get the following execution
sequence ?


Load module A at address 0xddfc3e00
Populate ftrace function table while the system runs
Unload module A
Load module B at address 0xddfc3e00
Activate ftrace
-> At that point, since there is another module loaded in the same
address space, it won't fault. If there happens to be an instruction
which looks exactly like the instruction we are replacing at the same
location, we are introducing a bug. True both on x86 ans powerpc...


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C ...
From: Steven Rostedt
Date: Monday, August 18, 2008 - 9:12 am

Hi Mathieu,

Yep I know of this issue, and it is very unlikely it would happen.  But 
that's not good enough, so this is why I did this:

http://marc.info/?l=linux-kernel&m=121876928124384&w=2
http://marc.info/?l=linux-kernel&m=121876938524523&w=2

;-)

On powerpc it would be less likely an issue since the code segments are 
all 4 bytes aligned. And a call being replaced would be a call to mcount 
(relative jump). A call to mcount would be the same for both. Then 
again, we could be replacing a nop, but most likely that wouldn't hurt 
either, since nops are seldom used, and if we did call mcount, it would 
probably not hurt. But it would be an issue if Module B happen to put a 
data section that matched the code to jmp to mcount or a nop.

-- Steve

--

From: Mathieu Desnoyers
Date: Monday, August 18, 2008 - 10:04 am

Ah, I did not see this one passing by :) That's the right solution
indeed. I guess it's not in 2.6.27-rc2/rc3 though ? I wonder if the
bug can be repeated with a module-free kernel ?


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Scott Wood
Date: Monday, August 18, 2008 - 10:21 am

Some (most likely unrelated) nits in the above inline asm:

Should use a "b" constraint for %2, or you could get r0.  Or, use an "m" 
constraint with %U2%X2 after the lwz/stw.  Why stwu with an offset of 
zero, BTW?

%1 also needs to be an early clobber.

-Scott
--

From: Steven Rostedt
Date: Monday, August 18, 2008 - 11:27 am

Because it worked :-)  Really, it has been a while since I did any PPC 
assembly, and I didn't break out the old reference manuals to do this.
I simply looked at other asm code, and followed suit. I compiled and 
tested it, and it worked.  I did make a disclaimer about my rusty PPC

Not exactly sure what you mean by the above.

-- Steve
--

From: Scott Wood
Date: Monday, August 18, 2008 - 11:29 am

The advantage of the latter is that it allows GCC to choose indexed or 
update instructions -- but that's merely an optimization.  Switching to 

%1 is written to before some inputs are consumed, so you need to use 
"=&r" rather than "=r" so that GCC won't use the same register for both.

-Scott
--

From: Benjamin Herrenschmidt
Date: Monday, August 18, 2008 - 6:53 pm

Why not use __get_user/__put_user ?

Cheers,
Ben.


--

From: Steven Rostedt
Date: Monday, August 18, 2008 - 7:28 pm

Hmm, this was originally copied from x86, where we did a cmpxchg, but that 
is probably not needed since all of this is done in kstop_machine. Also, 
only the "get" is needed. If we don't fault there, we wont fault on the 
put (unless we have permissions wrong, and that would be a bug).

So are you recommending something like

        int cmd;

	if (__get_user(cmd, ip))
		goto fault;

	if (cmd != old)
		goto not_same;

	WARN_ON_ONCE(__put_user(cmd, ip));

If we did this, we could probably put this into the generic code:


	if (copy_from_user(cmd, ip, ARCH_CALL_SIZE))
		goto fault;

	if (memcmp(cmd, old, ARCH_CALL_SIZE) != 0)
		goto not_same;

	WARN_ON_ONCE(copy_to_user(cmd, ip, ARCH_CALL_SIZE));


-- Steve

--

From: Benjamin Herrenschmidt
Date: Monday, August 18, 2008 - 7:39 pm

You need the __ variants or the access_ok() checks will bite
you bad.

Ben.


--

From: Steven Rostedt
Date: Monday, August 18, 2008 - 7:41 pm

Ah, good point ;-)

-- Steve

--

From: Mathieu Desnoyers
Date: Monday, August 18, 2008 - 7:47 pm

Argh. See text_poke(). It's there exactly for this purpose on x86.


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Steven Rostedt
Date: Monday, August 18, 2008 - 8:32 pm

Ouch, I just did. text_poke is quite heavy. It would be interesting to see 
that performed on 20,000 locations at one time. I could play with it, but 
I'm a bit nervous.

-- Steve
--

From: Mathieu Desnoyers
Date: Monday, August 18, 2008 - 8:36 pm

It's alread used to modify the LOCK prefixes in alternative.c and did
not seem to be too slow for that.. it should therefore be ok.


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Steven Rostedt
Date: Monday, August 18, 2008 - 9:00 pm

There's a lot more functions than locks ;-)

Anyway, this is only needed after system boot up. But then we do it ever 
time we start or stop tracing. I'll try it out and see if it is noticable.

-- Steve

--

From: Steven Rostedt
Date: Tuesday, August 19, 2008 - 9:47 am

OK, I just tried text_poke and it unfortunately fails. The problem is that 
it requires that the text you are changing is aligned and fits on one 
page. We have no control over that.

-- Steve

--

From: Mathieu Desnoyers
Date: Tuesday, August 19, 2008 - 10:34 am

Ok, there are two cases where it's ok :

1 - in stop_machine, considering we are not touching code executed in
NMI handlers.
2 - when using my replace_instruction_safe() which uses a temporary
breakpoint when doing the instruction replacement.

In those cases you could use text_poke_early().

See
http://git.kernel.org/?p=linux/kernel/git/compudj/linux-2.6-lttng.git;a=blob;f=arch/x8...

For a use example. Basically it looks like :


360                 pages[0] = virt_to_page((void *)bypass_eip);
361                 vaddr = vmap(pages, 1, VM_MAP, PAGE_KERNEL);
362                 BUG_ON(!vaddr);
363                 text_poke_early(&vaddr[bypass_eip & ~PAGE_MASK],
364                         (void *)addr, size);
365                 /*
366                  * Fill the rest with nops.
367                  */
368                 len = NR_NOPS - size;
369                 add_nops((void *)
370                         &vaddr[(bypass_eip & ~PAGE_MASK) + size],
371                         len);
372                 print_dbg_bytes("inserted nops",
373                         &vaddr[(bypass_eip & ~PAGE_MASK) + size], len);
374                 vunmap(vaddr);

Mathieu


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Steven Rostedt
Date: Tuesday, August 19, 2008 - 2:08 pm

vunmap can not be called with interrupts disabled, and this is exactly 
what my code does.

-- Steve
--

From: Nick Piggin
Date: Wednesday, August 20, 2008 - 2:40 am

It could be after the vmap rewrite (with a few other small tweaks).
But a) it would be less robust when called from interrupt context
and this code looks broken as it is WRT error handling; and b) it
still costs several thousand cycles to vmap+touch+vunmap...
--

From: Benjamin Herrenschmidt
Date: Tuesday, August 19, 2008 - 2:47 pm

Note that vmap/vunmap will be very slow.

Ben.


--

From: Jeremy Fitzhardinge
Date: Tuesday, August 19, 2008 - 4:58 pm

Don't we have Nick's speedups now?

    J
--

From: Benjamin Herrenschmidt
Date: Tuesday, August 19, 2008 - 6:17 pm

Not sure what speedups this is, but it's stlil not something you want to
do a lot ... setting up and tearing down MMU mappings isn't something
I'd like to see happening on a per-mcount basis.

Ben

--

From: Benjamin Herrenschmidt
Date: Monday, August 18, 2008 - 7:56 pm

Ok, so i did a patch, but it doesn't fix the problem. So
there's something else whacking on the stack frames.

Still, here it is:

powerpc/ftrace: Fix broken assembly for code replacement

Instead, uses __get_user() and __put_user().

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---

Index: linux-work/arch/powerpc/kernel/ftrace.c
===================================================================
--- linux-work.orig/arch/powerpc/kernel/ftrace.c	2008-08-19 12:45:49.000000000 +1000
+++ linux-work/arch/powerpc/kernel/ftrace.c	2008-08-19 12:52:51.000000000 +1000
@@ -16,7 +16,7 @@
 
 #include <asm/cacheflush.h>
 #include <asm/ftrace.h>
-
+#include <asm/uaccess.h>
 
 static unsigned int ftrace_nop = 0x60000000;
 
@@ -72,10 +72,9 @@ notrace int
 ftrace_modify_code(unsigned long ip, unsigned char *old_code,
 		   unsigned char *new_code)
 {
-	unsigned replaced;
-	unsigned old = *(unsigned *)old_code;
-	unsigned new = *(unsigned *)new_code;
-	int faulted = 0;
+	unsigned int old = *(unsigned int *)old_code;
+	unsigned int new = *(unsigned int *)new_code;
+	unsigned int instr;
 
 	/*
 	 * Note: Due to modules and __init, code can
@@ -85,32 +84,13 @@ ftrace_modify_code(unsigned long ip, uns
 	 * No real locking needed, this code is run through
 	 * kstop_machine.
 	 */
-	asm volatile (
-		"1: lwz		%1, 0(%2)\n"
-		"   cmpw	%1, %5\n"
-		"   bne		2f\n"
-		"   stwu	%3, 0(%2)\n"
-		"2:\n"
-		".section .fixup, \"ax\"\n"
-		"3:	li %0, 1\n"
-		"	b 2b\n"
-		".previous\n"
-		".section __ex_table,\"a\"\n"
-		_ASM_ALIGN "\n"
-		_ASM_PTR "1b, 3b\n"
-		".previous"
-		: "=r"(faulted), "=r"(replaced)
-		: "r"(ip), "r"(new),
-		  "0"(faulted), "r"(old)
-		: "memory");
-
-	if (replaced != old && replaced != new)
-		faulted = 2;
-
-	if (!faulted)
-		flush_icache_range(ip, ip + 8);
-
-	return faulted;
+	if (__get_user(instr, (unsigned int __user *)ip))
+		return 1;
+	if (instr != old && instr != new)
+		return 2;
+	WARN_ON_ONCE(__put_user(new, ...
From: Steven Rostedt
Date: Monday, August 18, 2008 - 8:12 pm

Is it 32 bit or 64?  I've tested the 64 bit quite a bit, but not so much 
the 32 (my powerbook is not usually that stable).



Acked-by: Steven Rostedt <srostedt@redhat.com>

-- Steve

--

From: Benjamin Herrenschmidt
Date: Monday, August 18, 2008 - 9:17 pm

Yeah, I had a look, nothing obviously bad there, but I may just
have missed it :-)

Ben.


--

From: Benjamin Herrenschmidt
Date: Wednesday, August 20, 2008 - 12:18 am

Found the problem (or at least -a- problem), it's a gcc bug.

Well, first I must say the code generated by -pg is just plain
horrible :-)

Appart from that, look at the exit of, for example, __d_lookup, as
generated by gcc when ftrace is enabled:

c00c0498:       38 60 00 00     li      r3,0
c00c049c:       81 61 00 00     lwz     r11,0(r1)
c00c04a0:       80 0b 00 04     lwz     r0,4(r11)
c00c04a4:       7d 61 5b 78     mr      r1,r11
c00c04a8:       bb 0b ff e0     lmw     r24,-32(r11)
c00c04ac:       7c 08 03 a6     mtlr    r0
c00c04b0:       4e 80 00 20     blr

As you can see, it restores r1 -before- it pops r24..r31 off
the stack ! I let you imagine what happens if an interrupt happens
just in between those two instructions (mr and lmw). We don't do
redzones on our ABI, so basically, the registers end up corrupted
by the interrupt.

Cheers,
Ben.


--

From: Steven Rostedt
Date: Wednesday, August 20, 2008 - 6:14 am

Ouch!  You've disassembled this without -pg too, and it does not have this 
bug? What version of gcc do you have?

-- Steve

--

From: Steven Rostedt
Date: Wednesday, August 20, 2008 - 6:19 am

I have:
 gcc (Debian 4.3.1-2) 4.3.1

c00c64c8:       81 61 00 00     lwz     r11,0(r1)
c00c64cc:       7f 83 e3 78     mr      r3,r28
c00c64d0:       80 0b 00 04     lwz     r0,4(r11)
c00c64d4:       ba eb ff dc     lmw     r23,-36(r11)
c00c64d8:       7d 61 5b 78     mr      r1,r11
c00c64dc:       7c 08 03 a6     mtlr    r0
c00c64e0:       4e 80 00 20     blr


My version looks fine.  I'm thinking that this is a separate issue than 
what Eran is seeing.

Eran, can you do an "objdump -dr vmlinux" and search for __d_lookup, and 
print out the end of the function dump.

Thanks,

-- Steve

--

From: Eran Liberty
Date: Wednesday, August 20, 2008 - 6:36 am

powerpc-linux-gnu-objdump -dr --start-address=0xc00bb584 vmlinux | head 
-n 100

vmlinux:     file format elf32-powerpc

Disassembly of section .text:

c00bb584 <__d_lookup>:
c00bb584:       7c 08 02 a6     mflr    r0
c00bb588:       90 01 00 04     stw     r0,4(r1)
c00bb58c:       4b f5 5c 51     bl      c00111dc <_mcount>
c00bb590:       94 21 ff d0     stwu    r1,-48(r1)
c00bb594:       7c 08 02 a6     mflr    r0
c00bb598:       3d 20 9e 37     lis     r9,-25033
c00bb59c:       bf 01 00 10     stmw    r24,16(r1)
c00bb5a0:       61 29 00 01     ori     r9,r9,1
c00bb5a4:       3d 60 c0 38     lis     r11,-16328
c00bb5a8:       90 01 00 34     stw     r0,52(r1)
c00bb5ac:       7c 60 4a 78     xor     r0,r3,r9
c00bb5b0:       54 00 d9 7e     rlwinm  r0,r0,27,5,31
c00bb5b4:       83 84 00 00     lwz     r28,0(r4)
c00bb5b8:       7c 3f 0b 78     mr      r31,r1
c00bb5bc:       81 0b 1a 2c     lwz     r8,6700(r11)
c00bb5c0:       39 6b 1a 2c     addi    r11,r11,6700
c00bb5c4:       7c 00 e2 14     add     r0,r0,r28
c00bb5c8:       81 4b 00 04     lwz     r10,4(r11)
c00bb5cc:       7c 09 4a 78     xor     r9,r0,r9
c00bb5d0:       83 24 00 04     lwz     r25,4(r4)
c00bb5d4:       7d 29 44 30     srw     r9,r9,r8
c00bb5d8:       81 0b 00 08     lwz     r8,8(r11)
c00bb5dc:       7d 29 02 78     xor     r9,r9,r0
c00bb5e0:       83 04 00 08     lwz     r24,8(r4)
c00bb5e4:       7d 29 50 38     and     r9,r9,r10
c00bb5e8:       55 29 10 3a     rlwinm  r9,r9,2,0,29
c00bb5ec:       7c 09 40 2e     lwzx    r0,r9,r8
c00bb5f0:       7c 9a 23 78     mr      r26,r4
c00bb5f4:       7c 7b 1b 78     mr      r27,r3
c00bb5f8:       2f 80 00 00     cmpwi   cr7,r0,0
c00bb5fc:       7c 1e 03 78     mr      r30,r0
c00bb600:       40 be 00 14     bne+    cr7,c00bb614 <__d_lookup+0x90>
c00bb604:       48 00 00 7c     b       c00bb680 <__d_lookup+0xfc>
c00bb608:       83 de 00 00     lwz     r30,0(r30)
c00bb60c:       2f 9e 00 00     cmpwi   cr7,r30,0
c00bb610:       41 9e 00 70     beq-    ...
From: Steven Rostedt
Date: Wednesday, August 20, 2008 - 6:43 am

Yep, you have the same bug in your compiler.

-- Steve


--

From: Eran Liberty
Date: Wednesday, August 20, 2008 - 7:02 am

Hmm... so whats now?

Is there a way to prove this scenario is indeed the one that caused the 
opps?

-- Liberty
--

From: Jon Smirl
Date: Wednesday, August 20, 2008 - 7:55 am

Manually edit the broken binary to change the order of the restore and
see if the problem disappears. That will keep everything else


-- 
Jon Smirl
jonsmirl@gmail.com
--

From: Steven Rostedt
Date: Wednesday, August 20, 2008 - 8:23 am

Does it only happen on this function? Or is it happening on others as 
well. So manually fixing it here won't help much unless we fix it 
everywhere.

-- Steve

--

From: Eran Liberty
Date: Wednesday, August 20, 2008 - 11:23 am

checked with objdump where __d_lookup()

 > powerpc-linux-gnu-objdump -dr --start-address=0xc0065790 vmlinux | 
grep "<__d_lookup>:"
c00b91d8 <__d_lookup>:

 > liberty@liberty:~/svn/exsw1600-2.6.27-rc2$ powerpc-linux-gnu-objdump 
-dr --start-address=0xc00b91d8 vmlinux| head -n 100

vmlinux:     file format elf32-powerpc

Disassembly of section .text:

c00b91d8 <__d_lookup>:
c00b91d8:       7c 08 02 a6     mflr    r0
c00b91dc:       90 01 00 04     stw     r0,4(r1)
 ....

c00b92bc:       4e 80 04 21     bctrl
c00b92c0:       2f 83 00 00     cmpwi   cr7,r3,0
c00b92c4:       41 9e 00 50     beq-    cr7,c00b9314 <__d_lookup+0x13c>
c00b92c8:       83 de 00 00     lwz     r30,0(r30)
c00b92cc:       2f 9e 00 00     cmpwi   cr7,r30,0
c00b92d0:       40 9e ff 98     bne+    cr7,c00b9268 <__d_lookup+0x90>
c00b92d4:       38 60 00 00     li      r3,0
c00b92d8:       81 61 00 00     lwz     r11,0(r1)
c00b92dc:       80 0b 00 04     lwz     r0,4(r11)
c00b92e0:       7d 61 5b 78     mr      r1,r11
       <=== As explained by Steve, these two should be replaced ===>
c00b92e4:       bb 0b ff e0     lmw     r24,-32(r11)
c00b92e8:       7c 08 03 a6     mtlr    r0
c00b92ec:       4e 80 00 20     blr
c00b92f0:       80 04 00 04     lwz     r0,4(r4)
  ....
c00b9330:       7f a3 eb 78     mr      r3,r29
c00b9334:       4b ff ff a4     b       c00b92d8 <__d_lookup+0x100>

c00b9338 <d_lookup>:
c00b9338:       7c 08 02 a6     mflr    r0

on the target I fired up xmon and replaced them.

~ # echo x > /proc/sysrq-trigger
SysRq : Entering xmon
Vector: 0  at [df51fdb8]
    pc: c0025960: sysrq_handle_xmon+0x60/0x64
    lr: c0025960: sysrq_handle_xmon+0x60/0x64
    sp: df51fe80
   msr: 21000
  current = 0xdc22a9a0
    pid   = 1698, comm = echo
WARNING: exception is not recoverable, can't continue
enter ? for help
[df51fe90] c0193c38 __handle_sysrq+0xa8/0x178
[df51fec0] c00ee818 write_sysrq_trigger+0x78/0x7c
[df51fed0] c00e65e4 proc_reg_write+0x5c/0x84
[df51fef0] c00a299c ...
From: Steven Rostedt
Date: Wednesday, August 20, 2008 - 11:33 am

This is a bug in gcc, and should be mentioned to the gcc maintainers.

-- Steve

--

From: Steven Rostedt
Date: Wednesday, August 20, 2008 - 8:27 am

I'm not convinced that this is the bug you were seeing. Do you still get 
the crash if you disable dynamic ftrace but leave ftrace on?

-- Steve

--

From: Benjamin Herrenschmidt
Date: Wednesday, August 20, 2008 - 2:37 pm

I've verified in sim that the crash in do_lookup() I was seeing (similar
backtrace) was indeed caused by an interrupt coming between those two
instructions and clobbering the saved GPRs.

So I think it's very likely our problem.

Ben.


--

From: Josh Boyer
Date: Wednesday, August 20, 2008 - 7:16 am

Segher was looking at this a bit this morning.  He thinks it's really
-fno-omit-frame-pointer that is causing this.  That really shouldn't
even be set on PowerPC, but FTRACE uses select which overrides the
depends on stuff in Kconfig.

Segher can probably tell you more once his email is back up.

josh

--

From: Steven Rostedt
Date: Wednesday, August 20, 2008 - 7:22 am

I can easily make a patch that makes that select an x86 only.

-- Steve

--

From: Josh Boyer
Date: Wednesday, August 20, 2008 - 7:50 am

On Wed, 20 Aug 2008 10:22:49 -0400 (EDT)

That's probably a first step, but you might want to wait until Segher
can fill in more details.  I'm sort of just relaying what he and I were
talking about on IRC.  IIRC, he was testing builds with and without
both -pg and -fno-omit-frame-pointer and found the bug only when
-fno-omit-frame-pointer was present.

josh
--

From: Eran Liberty
Date: Monday, September 15, 2008 - 9:30 am

Dear Penguins,

Let me start of by saying my particular hardware must be buggy in some 
way. When I issue a sysrq (Ctrl A+ F from minicom) I get a lot of sysrq 
triggers.

I have worked around the problem and I think this workaround is a viable 
patch even for platforms which do not exhibit this peculiar behavior.

upon getting numerous interrupts which request sysrq the function 
uart_handle_break in include/linux/serial_core.h is hit multiple times.
The current code which looks like this:

static inline int uart_handle_break(struct uart_port *port)
 {
        struct uart_info *info = port->info;
 #ifdef SUPPORT_SYSRQ
        if (port->cons && port->cons->index == port->line) {
               if (!port->sysrq) {
                       port->sysrq = jiffies + HZ*5;
                       return 1;
               }
               port->sysrq = 0;
        }
 #endif
        if (port->flags & UPF_SAK)
                do_SAK(info->tty);
       return 0;
}

Will basicly toggle port->sysrq between a timeout value and zero. If you 
are lucky this penguin rullet will stop on timeout and the next 
character hit will trigger the sysrq in the function 
"uart_handle_sysrq_char". But if you are not so lucky the last sysrq 
interupt will toggle port->sysrq to zero and the next char hit will be 
ignored (not trigger sysrq).

The suggested patch will do the next few things:

1. "port->sysrq" is now the time when the last sysrq was triggered and 
not the timeout for the the next char
2. Stamped "port->sysrq" every time there is a sysrq rather then toggled 
it up and down.
3. Always continue to consider UPF_SAK.
4. "port->sysrq" is toggled back to zero only in uart_handle_break() and 
only if the a char has been accepted after the sysrq timeout (5 sec)
5. uart_handle_break() will ignore extra chars received in super human 
speed after the last sysrq (0.01 sec)

Liberty

Signed-off-by: Eran Liberty <liberty@extricom.org>
---


From: Andrew Morton
Date: Wednesday, September 17, 2008 - 4:46 pm

On Mon, 15 Sep 2008 19:30:05 +0300


We prefer patches in `patch -p1' form, please.


The 0.01 is a big no-no.  Sometimes gcc like to go into stupid mode and
starts doing floating point stuff.

A suitable fix would be to use HZ/100.  But that assumes that HZ is
always >= 100.  That's a pretty good assumption, and various parts of
the kernel will explode if HZ is set too small.  However it's always
good to ensure that someone else's stuff will explode before yours
does, so how about we make it HZ/50?  Will that still work OK for you?

--- a/include/linux/serial_core.h~serial-driver-handle-multiple-consecutive-sysrq-from-the-serial-fix
+++ a/include/linux/serial_core.h
@@ -444,8 +444,7 @@ static inline int
 uart_handle_sysrq_char(struct uart_port *port, unsigned int ch)
 {
 #ifdef SUPPORT_SYSRQ
-	if (port->sysrq && time_after(jiffies, port->sysrq +
-					(unsigned long)(HZ*0.01))) {
+	if (port->sysrq && time_after(jiffies, port->sysrq + HZ / 50)) {
 		if (ch && time_before(jiffies, port->sysrq + HZ*5)) {
 			handle_sysrq(ch, port->info ? port->info->port.tty : NULL);
 			port->sysrq = 0;
_


--

From: Eran Liberty
Date: Wednesday, September 17, 2008 - 11:58 pm

HZ/50 is totally OK by me.


-- Liberty
--

From: Benjamin Herrenschmidt
Date: Wednesday, August 20, 2008 - 2:36 pm

Oops, should have mentioned it ;-)

4.2.3 (Ubuntu 4.2.3-2ubuntu7)

Ben.


--

From: Steven Rostedt
Date: Wednesday, August 20, 2008 - 2:44 pm

I'm running a 4.3.1 (Debian 4.3.1-2) on my powerbook. Perhaps the issue is 
already taken care of.

-- Steve

--

From: Steven Rostedt
Date: Monday, August 18, 2008 - 11:47 am

What syntax to do that with?

  lwz %1,0(%U2)
  stu %3, 0(%X2)


How else to do it?  stwu %3, (%2) does not compile for me.

-- Steve

--

From: Scott Wood
Date: Monday, August 18, 2008 - 11:56 am

lwz%U2%X2 %1, %2

stw %3, 0(%2)

The "u" tells it to write the effective address back to %2 -- but with 
an offset of zero, the effective address is unchanged.

-Scott
--

From: Steven Rostedt
Date: Monday, August 18, 2008 - 12:28 pm

Ah! Right!

/me should open up his PowerPC ref books again :-p

-- Steve

--

From: Eran Liberty
Date: Monday, August 18, 2008 - 11:25 am

for the fun of it I took out all my propriety modules; so now its a non 
tainted kernel.

Here is the matrix:

!FTRACE x !MARKERS => stable
!FTRACE x MARKERS => stable
FTRACE x !MARKERS => n/a (FTRACE forces MARKERS)
FTRACE x MARKERS => unstable
FTRACE x FTRACE_STARTUP_TEST x MARKERS => unstable + tests passed

Testing tracer sched_switch: PASSED
Testing tracer ftrace: PASSED
Testing dynamic ftrace: PASSED

Oops: Exception in kernel mode, sig: 11 [#1]
Exsw1600
Modules linked in:
NIP: c00bbb20 LR: c00bbb20 CTR: 00000000
REGS: dd5b1c50 TRAP: 0700   Not tainted  (2.6.27-rc2)
MSR: 00029000 <EE,ME>  CR: 24082282  XER: 20000000
TASK = ddcce060[1707] 'find' THREAD: dd5b0000
GPR00: 00000000 dd5b1d00 ddcce060 dd801180 dd5b1d68 dd5b1d58 dd80125b 
100234ec
GPR08: c0800000 00019330 0000ffff dd5b1d20 24000288 100ad874 100936f8 
1008a1d0
GPR16: 10083f80 dd5b1e2c dd5b1d68 fffffff4 c0380000 dd5b1d60 dd5b1d58 
dd802084
GPR24: dc3d7700 dd802018 dd5b1d68 c0380000 dd801180 dd5b1d68 00000000 
dd5b1d00
NIP [c00bbb20] d_lookup+0x40/0x90
LR [c00bbb20] d_lookup+0x40/0x90
Call Trace:
[dd5b1d00] [dd5b1d58] 0xdd5b1d58 (unreliable)
[dd5b1d20] [c00aebc4] do_lookup+0xe8/0x220
[dd5b1d50] [c00b0a80] __link_path_walk+0x5a4/0xd54
[dd5b1dc0] [c00b1288] path_walk+0x58/0xe0
[dd5b1df0] [c00b13f8] do_path_lookup+0x78/0x13c
[dd5b1e20] [c00b20f4] user_path_at+0x64/0xac
[dd5b1e90] [c00a9028] vfs_lstat_fd+0x34/0x74
[dd5b1ec0] [c00a90fc] vfs_lstat+0x30/0x48
[dd5b1ed0] [c00a9144] sys_lstat64+0x30/0x5c
[dd5b1f40] [c0010554] ret_from_syscall+0x0/0x3c
Instruction dump:
7c0802a6 bf61000c 3f60c038 7c3f0b78 90010024 7c7c1b78 7c9d2378 83db32a0
73c00001 7f83e378 7fa4eb78 4082002f <00000000> 2f830000 409e0030 801b32a0
---[ end trace 1eb8fd7adac2bb65 ]---

Liberty


--

From: Mathieu Desnoyers
Date: Monday, August 18, 2008 - 11:41 am

Can you check if, at some point during the system execution (starting
from boot), 0xdd5b1d58 is an address where a module is loaded ? (the
module can be later unloaded, what I wonder is if this address would
appear to have had a loaded+unloaded module).

Actually, could you try to compile your kernel without "MODULE_UNLOAD" ?


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Benjamin Herrenschmidt
Date: Monday, August 18, 2008 - 6:54 pm

I've had similar crashes (with similar backtraces in the VFS) on
machines with no modules, netbooted zImages.

Ben.


--

From: Eran Liberty
Date: Tuesday, August 19, 2008 - 2:56 am

No modules...

~ # uname -a
Linux 2.6.27-rc2 #6 Tue Aug 19 12:33:34 IDT 2008 ppc unknown
~ # gunzip /proc/config.gz -c | grep UNLOAD
# CONFIG_MODULE_UNLOAD is not set
~ # gunzip /proc/config.gz -c | grep FTRACE
CONFIG_HAVE_FTRACE=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_FTRACE=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_FTRACE_SELFTEST=y
CONFIG_FTRACE_STARTUP_TEST=y
~ # gunzip /proc/config.gz -c | grep MARKERS
CONFIG_MARKERS=y
~ # lsmod
Module                  Size  Used by
~ # while [ 1 ] ; do find / > /dev/null ; echo .  ; done
.
Oops: Exception in kernel mode, sig: 11 [#1]
Exsw1600
Modules linked in:
NIP: c00bb724 LR: c00bb724 CTR: 00000000
REGS: d8b59c50 TRAP: 0700   Not tainted  (2.6.27-rc2)
MSR: 00029000 <EE,ME>  CR: 24082282  XER: 20000000
TASK = dbc68de0[1712] 'find' THREAD: d8b58000
GPR00: 00000000 d8b59d00 dbc68de0 dd801180 d8b59d68 d8b59d58 dd8019db 
100234ec
GPR08: c0800000 00019330 0000ffff d8b59d20 24000288 100ad874 100936f8 
1008a1d0
GPR16: 10083f80 d8b59e2c d8b59d68 fffffff4 c0380000 d8b59d60 d8b59d58 
dd802084
GPR24: ddc69500 dd802018 d8b59d68 c0380000 dd801180 d8b59d68 00000000 
d8b59d00
NIP [c00bb724] d_lookup+0x40/0x90
LR [c00bb724] d_lookup+0x40/0x90
Call Trace:
[d8b59d00] [d8b59d58] 0xd8b59d58 (unreliable)
[d8b59d20] [c00ae7c8] do_lookup+0xe8/0x220
[d8b59d50] [c00b0684] __link_path_walk+0x5a4/0xd54
[d8b59dc0] [c00b0e8c] path_walk+0x58/0xe0
[d8b59df0] [c00b0ffc] do_path_lookup+0x78/0x13c
[d8b59e20] [c00b1cf8] user_path_at+0x64/0xac
[d8b59e90] [c00a8c98] vfs_lstat_fd+0x34/0x74
[d8b59ec0] [c00a8d6c] vfs_lstat+0x30/0x48
[d8b59ed0] [c00a8db4] sys_lstat64+0x30/0x5c
[d8b59f40] [c0010554] ret_from_syscall+0x0/0x3c
Instruction dump:
7c0802a6 bf61000c 3f60c038 7c3f0b78 90010024 7c7c1b78 7c9d2378 83db32a0
73c00001 7f83e378 7fa4eb78 4082002f <00000000> 2f830000 409e0030 801b32a0
---[ end trace 7766edd310cd3442 ]---

--

From: Mathieu Desnoyers
Date: Tuesday, August 19, 2008 - 6:02 am

Just like this instruction... maybe related ?


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Benjamin Herrenschmidt
Date: Tuesday, August 19, 2008 - 2:46 pm

always the same and not always with the same value.

The corruption -seems- to happen due to corruption of the location on
the stack where it was saved to / restored from in a previous function
call or interrupt.

I haven't yet tracked down what actually whacks the stack.

Ben.


--

From: Steven Rostedt
Date: Monday, August 18, 2008 - 11:50 am

Could you load this into gdb for me and show me the output of:

gdb> li *0xc00bbb20

(Assuming you compiled with debuginfo on)

and...


Ouch! we have a 00000000 instruction. I'm almost done with the new mcount 
record for PPC (I have it done for ppc64, I'm just porting it to 32). I'm 
thinking this new code may solve your issues too. I hate the daemon.

-- Steve
--

From: Eran Liberty
Date: Tuesday, August 19, 2008 - 5:09 am

I have attached ddd. Up on crashing it points on this line

struct dentry * d_lookup(struct dentry * parent, struct qstr * name)
{
        struct dentry * dentry = NULL;
        unsigned long seq;
   
        do {
                seq = read_seqbegin(&rename_lock);
                dentry = __d_lookup(parent, name);     <==== ddd marker
                if (dentry)
                        break;
        } while (read_seqretry(&rename_lock, seq));
        return dentry;
}

(gdb) bt
#0  d_lookup (parent=0xdd801180, name=0xd99b3d68) at fs/dcache.c:1361
#1  0xc00ae7c8 in real_lookup (nd=<value optimized out>, name=<value 
optimized out>, parent=0xdd801180) at fs/namei.c:507
#2  do_lookup (nd=0xd99b3e28, name=0xd99b3d68, path=0xd99b3d58) at 
fs/namei.c:825
#3  0xc00b0684 in __link_path_walk (name=0xddc23009 "", nd=0xd99b3e28) 
at fs/namei.c:982
#4  0xc00b0e8c in link_path_walk (nd=<value optimized out>, name=<value 
optimized out>) at fs/namei.c:570
#5  path_walk (name=0xddc23000 "/proc/mtd", nd=0xd99b3e28) at 
fs/namei.c:1041
#6  0xc00b0ffc in do_path_lookup (dfd=<value optimized out>, 
name=0xddc23000 "/proc/mtd", flags=<value optimized out>, nd=0xd99b3e28) 
at fs/namei.c:1091
#7  0xc00b1cf8 in user_path_at (dfd=-100, name=<value optimized out>, 
flags=0, path=0xd99b3e98) at fs/namei.c:1334
#8  0xc00a8c98 in vfs_lstat_fd (dfd=-578809472, name=0xd99b3d68 "", 
stat=0xd99b3ed8) at fs/stat.c:83
#9  0xc00a8d6c in vfs_lstat (name=0xd99b3d68 "", stat=0xd99b3d58) at 
fs/stat.c:93
#10 0xc00a8db4 in sys_lstat64 (filename=0xdd801180 "", 
statbuf=0xbfe285d8) at fs/stat.c:381
#11 0xc0010554 in syscall_dotrace_cont ()

both cp & lr points to 0xC00BB724
(gdb) info registers
r0             0x0    0
r1             0xd99b3d00    3650829568
r2             0xddccf2e0    3721196256
r3             0xdd801180    3716157824
r4             0xd99b3d68    3650829672
r5             0xd99b3d58    3650829656
r6             0xdd822a5b    3716295259
r7             0x100234ec    268580076
r8     ...
From: Mathieu Desnoyers
Date: Tuesday, August 19, 2008 - 6:05 am

Can you also give us

objdump -S --start-address=0xC00BB724 vmlinux | head 20

?

Then we could compare the result with the OOPS instruction dump :

7c0802a6 bf61000c 3f60c038 7c3f0b78 90010024 7c7c1b78 7c9d2378 83db32a0
73c00001 7f83e378 7fa4eb78 4082002f <00000000> 2f830000 409e0030 801b32a0


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Eran Liberty
Date: Tuesday, August 19, 2008 - 7:21 am

to give you more context I have run:
 > powerpc-linux-gnu-objdump -S --start-address=0xC00BB700 vmlinux | 
head -n 60

the discrepancy starts at address:
c00bb720: 40 82 00 30 <=> 40 82 00 2f
c00bb724: 4b ff fe 61 <=> 00 00 00 00

vmlinux:     file format elf32-powerpc

Disassembly of section .text:

c00bb700 <d_lookup+0x1c>:
 * d_lookup() is protected against the concurrent renames in some unrelated
 * directory using the seqlockt_t rename_lock.
 */

struct dentry * d_lookup(struct dentry * parent, struct qstr * name)
{
c00bb700:       7c 3f 0b 78     mr      r31,r1
c00bb704:       90 01 00 24     stw     r0,36(r1)
c00bb708:       7c 7c 1b 78     mr      r28,r3
c00bb70c:       7c 9d 23 78     mr      r29,r4
c00bb710:       83 db 32 a0     lwz     r30,12960(r27)
{
        unsigned ret;

repeat:
        ret = sl->sequence;
        smp_rmb();
c00bb714:       73 c0 00 01     andi.   r0,r30,1
        struct dentry * dentry = NULL;
        unsigned long seq;

        do {
                seq = read_seqbegin(&rename_lock);
                dentry = __d_lookup(parent, name);
c00bb718:       7f 83 e3 78     mr      r3,r28
c00bb71c:       7f a4 eb 78     mr      r4,r29
        if (unlikely(ret & 1)) {
c00bb720:       40 82 00 30     bne-    c00bb750 <d_lookup+0x6c>
c00bb724:       4b ff fe 61     bl      c00bb584 <__d_lookup>
                if (dentry)
c00bb728:       2f 83 00 00     cmpwi   cr7,r3,0
c00bb72c:       40 9e 00 30     bne-    cr7,c00bb75c <d_lookup+0x78>
 *
 * If sequence value changed then writer changed data while in section.
 */
static __always_inline int read_seqretry(const seqlock_t *sl, unsigned 
start)
{
        smp_rmb();
c00bb730:       80 1b 32 a0     lwz     r0,12960(r27)
                        break;
        } while (read_seqretry(&rename_lock, seq));
c00bb734:       7f 80 f0 00     cmpw    cr7,r0,r30
c00bb738:       41 9e 00 24     beq-    cr7,c00bb75c <d_lookup+0x78>
/* Start of read calculation -- fetch last complete writer token ...
From: Mathieu Desnoyers
Date: Tuesday, August 19, 2008 - 7:42 am

Ah !

I think I see what could be wrong :

First we have :

static unsigned int ftrace_nop = 0x60000000;

We probably replace the original function call by this nop.

Then we do :

notrace unsigned char *ftrace_call_replace(unsigned long ip, unsigned long addr)
{
        static unsigned int op;

        /*
         * It would be nice to just use create_function_call, but that will
         * update the code itself. Here we need to just return the
         * instruction that is going to be modified, without modifying the
         * code.
         */
        addr = GET_ADDR(addr);

        /* Set to "bl addr" */
        op = 0x48000001 | (ftrace_calc_offset(ip, addr) & 0x03fffffc);

        /*
         * No locking needed, this must be called via kstop_machine
         * which in essence is like running on a uniprocessor machine.
         */
        return (unsigned char *)&op;
}

And I guess we must be doing a 0x48000001 | 0x0; or something ?

Also, we have to consider that POWERPC 64 functions are :

/* PowerPC64's functions are data that points to the functions */

And this does not seem to hold for ppc32. Therefore, it is strange to me
that the same code is used for the update.. are we updating the correct
site ?


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Steven Rostedt
Date: Tuesday, August 19, 2008 - 1:15 pm

Do you have PREEMPT_TRACER enabled, or any other tracer for that matter?

--

From: Eran Liberty
Date: Wednesday, August 20, 2008 - 4:18 am

I can see stack trace & context trace, but they are derived from other 
choices (I can not un select them)

cat .config | grep TRACE
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_ARCH_TRACEHOOK=y
# CONFIG_BLK_DEV_IO_TRACE is not set
CONFIG_STACKTRACE=y
# CONFIG_BACKTRACE_SELF_TEST is not set
CONFIG_HAVE_FTRACE=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_FTRACE=y
# CONFIG_SCHED_TRACER is not set
CONFIG_CONTEXT_SWITCH_TRACER=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_FTRACE_SELFTEST=y

--

From: Steven Rostedt
Date: Wednesday, August 20, 2008 - 6:12 am

You must not have PREEMPT on, so the PREEMPT_TRACER will not show up.

The reason I asked, is that my PowerBook runs fine without PREEMPT_TRACER 
but is very unstable when I have PREEMPT_TRACER enabled.

--

From: Benjamin Herrenschmidt
Date: Monday, August 18, 2008 - 6:51 pm

I spent some time last week tracking one of those crashes and it appears
that we are getting corruption of some of the non-volatile registers.

So far, I found out that it -seems- to be coming from stack frame
corruption during a timer interrupt.

I haven't had a chance to dig further yet.

Cheers,
Ben.


--

Previous thread: RT-Scheduler/cgroups: Possible overuse of resources assigned via cpu.rt_period_us and cpu.rt_runtime_us by Daniel K. on Wednesday, June 18, 2008 - 7:12 am. (5 messages)

Next thread: [PATCH] add kernel-doc for simple_read_from_buffer and memory_read_from_buffer by Akinobu Mita on Wednesday, June 18, 2008 - 8:04 am. (1 message)