login
Header Space

 
 

Re: 2.6.25-mm1: Failing to probe IDE interface

Previous thread: BUG: MAX_STACK_TRACE_ENTRIES too low! by Christian Kujau on Thursday, April 17, 2008 - 6:01 pm. (1 message)

Next thread: [git patches] IDE updates part 1 by Bartlomiej Zolnierkiewicz on Thursday, April 17, 2008 - 7:28 pm. (1 message)
To: Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Pekka Enberg <penberg@...>
Cc: <linux-usb@...>, <linux-kernel@...>, <linux-mm@...>, James Morris <jmorris@...>, Stephen Smalley <sds@...>
Date: Thursday, April 17, 2008 - 7:03 pm

I repulled all the trees an hour or two ago, installed everything on an
8-way x86_64 box and:


stack-protector:

Testing -fstack-protector-all feature
No -fstack-protector-stack-frame!
-fstack-protector-all test failed
------------[ cut here ]------------
WARNING: at kernel/panic.c:369 __stack_chk_test+0x4b/0x51()
Modules linked in:
Pid: 1, comm: swapper Not tainted 2.6.25-mm1 #4

Call Trace:
 [&lt;ffffffff80256692&gt;] ? print_modules+0x88/0x8f
 [&lt;ffffffff80237b70&gt;] warn_on_slowpath+0x58/0x7f
 [&lt;ffffffff802388fe&gt;] ? printk+0x67/0x69
 [&lt;ffffffff8034ec74&gt;] ? debug_write_lock_after+0x18/0x1f
 [&lt;ffffffff8034ed43&gt;] ? _raw_write_unlock+0x29/0x7b
 [&lt;ffffffff804f0254&gt;] ? _write_unlock+0x9/0xb
 [&lt;ffffffff8023d25e&gt;] ? insert_resource+0xe3/0xea
 [&lt;ffffffff80237be2&gt;] __stack_chk_test+0x4b/0x51
 [&lt;ffffffff8092f912&gt;] kernel_init+0x16c/0x29e
 [&lt;ffffffff8020ce58&gt;] child_rip+0xa/0x12
 [&lt;ffffffff8092f7a6&gt;] ? kernel_init+0x0/0x29e
 [&lt;ffffffff8020ce4e&gt;] ? child_rip+0x0/0x12

---[ end trace da2bc9ee81defeda ]---


usb/sysfs:

ACPI: PCI Interrupt 0000:00:1d.0[A] -&gt; GSI 17 (level, low) -&gt; IRQ 17
uhci_hcd 0000:00:1d.0: UHCI Host Controller
uhci_hcd 0000:00:1d.0: new USB bus registered, assigned bus number 1
uhci_hcd 0000:00:1d.0: irq 17, io base 0x00002080
usb usb1: configuration #1 chosen from 1 choice
hub 1-0:1.0: USB hub found
hub 1-0:1.0: 2 ports detected
sysfs: duplicate filename '189:0' can not be created
------------[ cut here ]------------
WARNING: at fs/sysfs/dir.c:425 sysfs_add_one+0x42/0x7c()
Modules linked in: uhci_hcd(+)
Pid: 600, comm: insmod Tainted: G        W 2.6.25-mm1 #4

Call Trace:
 [&lt;ffffffff80256692&gt;] ? print_modules+0x88/0x8f
 [&lt;ffffffff80237b70&gt;] warn_on_slowpath+0x58/0x7f
 [&lt;ffffffff802388fe&gt;] ? printk+0x67/0x69
 [&lt;ffffffff804f0249&gt;] ? _spin_unlock+0x9/0xb
 [&lt;ffffffff802a932f&gt;] ? ifind+0x72/0x82
 [&lt;ffffffff802e0c49&gt;] ? sysfs_ilookup_test+0x0/0x14
 [...
To: Andrew Morton <akpm@...>
Cc: <linux-kernel@...>, <linux-mm@...>, <bzolnier@...>
Date: Monday, April 28, 2008 - 12:42 pm

An old T21 is failing to boot and the relevant message appears to be

[    1.929536] Probing IDE interface ide0...
[   36.939317] ide0: Wait for ready failed before probe !
[   37.502676] ide0: DISABLED, NO IRQ
[   37.506356] ide0: failed to initialize IDE interface

The owner of ide-mm-ide-add-struct-ide_io_ports-take-2.patch with the
"DISABLED, NO IRQ" message is cc'd. I've attached the config, full boot log
and lspci -v for the machine in question. I'll start reverting some of the
these patches to see if ide-mm-ide-add-struct-ide_io_ports-take-2.patch
is really the culprit.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
To: Mel Gorman <mel@...>
Cc: Andrew Morton <akpm@...>, <linux-kernel@...>, <linux-mm@...>
Date: Monday, April 28, 2008 - 2:44 pm

Hi,


Please try reverting ide-fix-hwif-s-initialization.patch first - it has
already been dropped from IDE tree because people were reporting problems
similar to the one encountered by you.

Thanks,
Bart
--
To: Bartlomiej Zolnierkiewicz <bzolnier@...>
Cc: Andrew Morton <akpm@...>, <linux-kernel@...>, <linux-mm@...>
Date: Tuesday, April 29, 2008 - 5:43 am

Thanks.

I reverted this patch and ide-mm-ide-make-ide_hwifs-static.patch (for compile
breakage reasons). It's better but still fails to find the IDE device.
What is better is that it finds ide0 at;

ide0 at 0x1f0-0x1f7,0x3f6 on irq 14

but does not identify any of the disks nor does it find ide1. For
comparison, a "good" dmesg looks like

[    1.793244] Probing IDE interface ide0...
[    2.235292] hda: IBM-DJSA-220, ATA DISK drive
[    2.915457] Probing IDE interface ide1...
[    3.787516] hdc: CRN-8241U, ATAPI CD/DVD-ROM drive
[    4.475650] ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
[    4.478096] ide1 at 0x170-0x177,0x376 on irq 15
[    4.484547] hda: max request size: 128KiB
[    4.522696] hda: 39070080 sectors (20003 MB) w/1874KiB Cache, CHS=41344/15/63
[    4.530706] hda: cache flushes not supported
[    4.538724]  hda: hda1 hda2 hda3 hda4
[    4.569606] hdc: ATAPI 24X CD-ROM drive, 128kB Cache
[    4.587678] Uniform CD-ROM driver Revision: 3.20
[    4.595690] Driver 'sd' needs updating - please use bus_type methods


Here is the bootlog with the two patches reverted.

root            (hd0,0)
 Filesystem type is ext2fs, partition type 0x83
kernel          /boot/vmlinuz-2.6.25-mm1 root=/dev/hda1 mminit_loglevel=4 logle
vel=9 console=tty0 console=ttyS0,9600 ro earlyprintk=serial,ttyS0,9600 kernelco
re=384MB movablecore=384MB profile=sleep,2 resume=/dev/hda2
   [Linux-bzImage, setup=0x2c00, size=0x1d9390]
savedefault
boot
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Linux version 2.6.25-mm1 (mel@arnold) (gcc version 4.2.3 (Debian 4.2.3-3)) #1 SMP Tue Apr 29 10:04:35 IST 2008
[    0.000000] BIOS-provided physical RAM map:
[    0.000000]  BIOS-e820: 0000000000000000 - 000000000009f800 (usable)
[    0.000000]  BIOS-e820: 000000000009f800 - 00000000000a0000 (reserved)
[    0.000000]  BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
[    0.000000]  BIOS-e820: 0000000000100000 - 000000001fff0000 (usable)
[    0.000000]  BIOS-e820: 000000001fff0000 - ...
To: Bartlomiej Zolnierkiewicz <bzolnier@...>
Cc: Andrew Morton <akpm@...>, <linux-kernel@...>, <linux-mm@...>
Date: Tuesday, April 29, 2008 - 11:49 am

Interestingly, bisection firmly blames this patch and QEMU boots with the two
patches reverted but fails with them applied so that patch does cause problems.
The failure on the laptop must be depending on some follow-on patch. I tried
a hatchet-job revert of the IDE patches between IDE-START and IDE-END in
the series file and it similarly fails to probe the IDE devices. So either
I made a mess of the reverts (strong possibility) or there is more than one
problem patch.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--
To: Bartlomiej Zolnierkiewicz <bzolnier@...>, <ink@...>
Cc: Andrew Morton <akpm@...>, <linux-kernel@...>, <linux-mm@...>, <gregkh@...>
Date: Tuesday, April 29, 2008 - 12:58 pm

The third patch that needed reverting was
gregkh-pci-pci-clean-up-resource-alignment-management.patch (owners added
to cc). The relevant hint in the a diff between a broken and working bootlog was;

 system 00:09: ioport range 0x15e0-0x15ef has been reserved
+ PCI: bogus alignment of resource 7 [100:1ff] (flags 100) of 0000:00:02.0
+ PCI: bogus alignment of resource 8 [100:1ff] (flags 100) of 0000:00:02.0
+ PCI: bogus alignment of resource 9 [4000000:7ffffff] (flags 1200) of 0000:00:02.0
+ PCI: bogus alignment of resource 10 [4000000:7ffffff] (flags 200) of 0000:00:02.0
+ PCI: bogus alignment of resource 7 [100:1ff] (flags 100) of 0000:00:02.1
+ PCI: bogus alignment of resource 8 [100:1ff] (flags 100) of 0000:00:02.1
+ PCI: bogus alignment of resource 9 [4000000:7ffffff] (flags 1200) of 0000:00:02.1
+ PCI: bogus alignment of resource 10 [4000000:7ffffff] (flags 200) of 0000:00:02.1

With the resource alignment patch and the two IDE patches reverted, the
laptop is able to boot.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--
To: Mel Gorman <mel@...>
Cc: <ink@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, <linux-mm@...>, <gregkh@...>
Date: Tuesday, April 29, 2008 - 5:37 pm

Thanks for tracking it down.

Hmm, it seems that the above patch was merged a week ago:

commit bda0c0afa7a694bb1459fd023515aca681e4d79a
Merge: 904e0ab... af40b48...
Author: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Date:   Mon Apr 21 15:58:35 2008 -0700

    Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/pci-2.6
...
      PCI: clean up resource alignment management
...

but it could be that the issue has been already fixed in git tree
(could you verify it please?).

BTW according to lspci output you should be able to use piix driver
instead of ide_generic on this laptop.

Thanks,
Bart
--
To: Bartlomiej Zolnierkiewicz <bzolnier@...>
Cc: <ink@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, <linux-mm@...>, <gregkh@...>
Date: Wednesday, April 30, 2008 - 7:16 am

I know but the config is a bit minimal for faster building as it's only
intended for sniff-testing patches.

Thanks for the help.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--
To: Mel Gorman <mel@...>
Cc: <linux-kernel@...>, <linux-mm@...>, <bzolnier@...>
Date: Monday, April 28, 2008 - 12:59 pm

ide-mm-ide-add-struct-ide_io_ports-take-2.patch is now in mainline so a
quicky confirmation would be to test Linus's tree.

--
To: Andrew Morton <akpm@...>
Cc: <linux-kernel@...>, <linux-mm@...>, <bzolnier@...>
Date: Tuesday, April 29, 2008 - 5:39 am

2.6.25 and latest git are both booting fine.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--
To: Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Pekka Enberg <penberg@...>, <linux-usb@...>, <linux-kernel@...>, <linux-mm@...>, James Morris <jmorris@...>, Stephen Smalley <sds@...>
Cc: Peter Zijlstra <a.p.zijlstra@...>
Date: Friday, April 18, 2008 - 3:50 am

Another runtime warning on the t61p:


Brought up 2 CPUs
Total of 2 processors activated (9583.80 BogoMIPS).
CPU0 attaching sched-domain:
 domain 0: span 00000000,00000003
  groups: 00000000,00000001 00000000,00000002
  domain 1: span 00000000,00000003
   groups: 00000000,00000003
CPU1 attaching sched-domain:
 domain 0: span 00000000,00000003
  groups: 00000000,00000002 00000000,00000001
  domain 1: span 00000000,00000003
   groups: 00000000,00000003
------------[ cut here ]------------
WARNING: at kernel/lockdep.c:2677 check_flags+0x84/0x11f()
Modules linked in:
Pid: 0, comm: swapper Not tainted 2.6.25-mm1 #15

Call Trace:
 [&lt;ffffffff8105f7ec&gt;] ? print_modules+0x88/0x8f
 [&lt;ffffffff81037b55&gt;] warn_on_slowpath+0x58/0x7f
 [&lt;ffffffff81056143&gt;] ? trace_hardirqs_off+0xd/0xf
 [&lt;ffffffff810560b7&gt;] ? trace_hardirqs_off_caller+0x1d/0x9c
 [&lt;ffffffff81056143&gt;] ? trace_hardirqs_off+0xd/0xf
 [&lt;ffffffff810560b7&gt;] ? trace_hardirqs_off_caller+0x1d/0x9c
 [&lt;ffffffff81056143&gt;] ? trace_hardirqs_off+0xd/0xf
 [&lt;ffffffff81058576&gt;] ? __lock_acquire+0x809/0x893
 [&lt;ffffffff810560b7&gt;] ? trace_hardirqs_off_caller+0x1d/0x9c
 [&lt;ffffffff81056143&gt;] ? trace_hardirqs_off+0xd/0xf
 [&lt;ffffffff812b94d1&gt;] ? __atomic_notifier_call_chain+0x0/0x81
 [&lt;ffffffff8105627e&gt;] check_flags+0x84/0x11f
 [&lt;ffffffff81058914&gt;] lock_acquire+0x54/0xb4
 [&lt;ffffffff812b9515&gt;] __atomic_notifier_call_chain+0x44/0x81
 [&lt;ffffffff8100a2c2&gt;] ? mwait_idle+0x0/0x49
 [&lt;ffffffff812b9561&gt;] atomic_notifier_call_chain+0xf/0x11
 [&lt;ffffffff8100a228&gt;] __exit_idle+0x27/0x29
 [&lt;ffffffff8100b33c&gt;] cpu_idle+0xdf/0xf7
 [&lt;ffffffff812b10da&gt;] start_secondary+0xb2/0xb4

---[ end trace 93d72a36b9146f22 ]---
possible reason: unannotated irqs-on.
irq event stamp: 34
hardirqs last  enabled at (33): [&lt;ffffffff812b63f0&gt;] trace_hardirqs_on_thunk+0x3a/0x3f
hardirqs last disabled at (34): [&lt;ffffffff81056143&gt;] trace_hardirqs_off+0xd/0xf
...
To: Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Pekka Enberg <penberg@...>, <linux-usb@...>, <linux-kernel@...>, <linux-mm@...>, James Morris <jmorris@...>, Stephen Smalley <sds@...>, Peter Zijlstra <a.p.zijlstra@...>
Cc: <linux-pm@...>, <linux-usb@...>, Greg KH <greg@...>, Rafael J. Wysocki <rjw@...>, Pavel Machek <pavel@...>
Date: Friday, April 18, 2008 - 3:53 am

oop, there's more:


sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
firewire_core: created device fw0: GUID 00016c2000174bad, S400
PM: Device usb4 failed to restore: error -113
eth0: Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX
eth0: 10/100 speed: disabling TSO
PM: Device usb5 failed to restore: error -113
PM: Device usb7 failed to restore: error -113
sd 0:0:0:0: [sda] Starting disk
PM: Image restored successfully.
Restarting tasks ... done.
PM: Basic memory bitmaps freed

Those USB restore failures are new.  They're similar to the ones on the
doesnt-resume-properly-any-more Vaio.  They came out from the machine's
second (successful) resume-from-disk.
--
To: Andrew Morton <akpm@...>
Cc: Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Pekka Enberg <penberg@...>, <linux-usb@...>, <linux-kernel@...>, <linux-mm@...>, James Morris <jmorris@...>, Stephen Smalley <sds@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-pm@...>, Greg KH <greg@...>, Rafael J. Wysocki <rjw@...>
Date: Friday, April 18, 2008 - 7:07 am

I got USB messages after s2ram + suspend to disk combination, too, but
machine seems to work.

ata1.00: ACPI cmd ef/10:03:00:00:00:a0 succeeded
ata1.00: configured for UDMA/100
ata1.00: configured for UDMA/100
ata1: EH complete
sd 0:0:0:0: [sda] 117210240 512-byte hardware sectors (60012 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't
support DPO or FUA
sd 0:0:0:0: [sda] 117210240 512-byte hardware sectors (60012 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't
support DPO or FUA
PM: Device usb2 failed to restore: error -113
PM: Device usb3 failed to restore: error -113
PM: Device usb4 failed to restore: error -113
PM: Image restored successfully.
Restarting tasks ... done.
PM: Basic memory bitmaps freed
wlan0: RX disassociation from 00:11:2f:0e:95:a0 (reason=7)
wlan0: disassociated

(Apart from some wireless problems, solved by reconnecting...)

(And ipw3945 LED indication now seems to work, good!)
									Pavel 

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To: Andrew Morton <akpm@...>
Cc: Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Pekka Enberg <penberg@...>, <linux-usb@...>, <linux-kernel@...>, <linux-mm@...>, James Morris <jmorris@...>, Stephen Smalley <sds@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-pm@...>, Greg KH <greg@...>, Rafael J. Wysocki <rjw@...>
Date: Friday, April 18, 2008 - 5:42 am

Try rmmod usb / insmod usb around suspend to see if it is
usb-specific, or if something went seriously wrong in core.

Or you might just bisect it ;-).
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To: Pavel Machek <pavel@...>
Cc: Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Pekka Enberg <penberg@...>, <linux-usb@...>, <linux-kernel@...>, <linux-mm@...>, James Morris <jmorris@...>, Stephen Smalley <sds@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-pm@...>, Greg KH <greg@...>, Rafael J. Wysocki <rjw@...>
Date: Friday, April 18, 2008 - 11:22 am

There's no need to worry about them.  They merely indicate that the 
root hubs didn't resume along with everything else, because they were 
already suspended when the system went to sleep and so they were left 
suspended.  The return codes in usbcore will be changed soon so that 
this won't appear to be an error.

Alan Stern

--
To: Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Pekka Enberg <penberg@...>, <linux-usb@...>, <linux-kernel@...>, <linux-mm@...>, James Morris <jmorris@...>, Stephen Smalley <sds@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-pm@...>, Greg KH <greg@...>, Rafael J. Wysocki <rjw@...>, Pavel Machek <pavel@...>
Date: Friday, April 18, 2008 - 3:57 am

I found another machine!  This one's an old 4-way Nocona (x86_64)

http://userweb.kernel.org/~akpm/config-x.txt
http://userweb.kernel.org/~akpm/dmesg-x.txt



CPU: Physical Processor ID: 0
CPU: Processor Core ID: 0
CPU0: Thermal monitoring enabled (TM1)
ACPI: Core revision 20080321
Parsing all Control Methods:
Table [DSDT](id 0001) - 461 Objects with 50 Devices 130 Methods 11 Regions
 tbxface-0598 [00] tb_load_namespace     : ACPI Tables successfully acquired
evxfevnt-0091 [00] enable                : Transition to ACPI mode successful
------------[ cut here ]------------
WARNING: at arch/x86/kernel/genapic_64.c:86 read_apic_id+0x31/0x67()
Modules linked in:
Pid: 1, comm: swapper Not tainted 2.6.25-mm1 #16

Call Trace:
 [&lt;ffffffff8025272f&gt;] ? print_modules+0x88/0x8f
 [&lt;ffffffff80233493&gt;] warn_on_slowpath+0x58/0x81
 [&lt;ffffffff80351ceb&gt;] ? debug_spin_lock_after+0x18/0x1f
 [&lt;ffffffff8035217a&gt;] ? _raw_spin_lock+0x116/0x120
 [&lt;ffffffff80228398&gt;] ? sub_preempt_count+0x6d/0x74
 [&lt;ffffffff804e9ba3&gt;] ? _spin_unlock_irqrestore+0x33/0x40
 [&lt;ffffffff803523e6&gt;] ? debug_smp_processor_id+0x32/0xc4
 [&lt;ffffffff8021ede5&gt;] read_apic_id+0x31/0x67
 [&lt;ffffffff8066f7f2&gt;] verify_local_APIC+0xa7/0x163
 [&lt;ffffffff8066e837&gt;] native_smp_prepare_cpus+0x1ed/0x301
 [&lt;ffffffff80669ab2&gt;] kernel_init+0x5a/0x276
 [&lt;ffffffff804e9a1e&gt;] ? _spin_unlock_irq+0x2a/0x35
 [&lt;ffffffff8022b7c2&gt;] ? finish_task_switch+0x68/0x7f
 [&lt;ffffffff8020c1d8&gt;] child_rip+0xa/0x12
 [&lt;ffffffff80669a58&gt;] ? kernel_init+0x0/0x276
 [&lt;ffffffff8020c1ce&gt;] ? child_rip+0x0/0x12

---[ end trace 4eaa2a86a8e2da22 ]---
------------[ cut here ]------------
WARNING: at arch/x86/kernel/genapic_64.c:86 read_apic_id+0x31/0x67()
Modules linked in:
Pid: 1, comm: swapper Tainted: G        W 2.6.25-mm1 #16

Call Trace:
 [&lt;ffffffff8025272f&gt;] ? print_modules+0x88/0x8f
 [&lt;ffffffff80233493&gt;] warn_on_slowpath+0x58/0x81
 [&lt;ffffffff80351ceb&gt;] ...
To: Andrew Morton <akpm@...>
Cc: Thomas Gleixner <tglx@...>, Pekka Enberg <penberg@...>, <linux-usb@...>, <linux-kernel@...>, <linux-mm@...>, James Morris <jmorris@...>, Stephen Smalley <sds@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-pm@...>, Greg KH <greg@...>, Rafael J. Wysocki <rjw@...>, Pavel Machek <pavel@...>, Jack Steiner <steiner@...>, Mike Travis <travis@...>, Alan Mayer <ajm@...>
Date: Friday, April 18, 2008 - 5:22 am

that came in via the UV-APIC patchset but the warning is entirely 
harmless. At that point we've got a single CPU running only so 
preemption of that code to another CPU is not possible.

native_smp_prepare_cpus() should probably just disable preemption, that 
way we could remove all those ugly preempt disable-enable calls from the 
called functions - per the patch below. (not boot tested yet - might 
provoke atomic-scheduling warnings if i forgot about some schedule point 
in this rather large codepath)

	Ingo

-------------------&gt;
Subject: x86: disable preemption in native_smp_prepare_cpus
From: Ingo Molnar &lt;mingo@elte.hu&gt;
Date: Fri Apr 18 11:07:10 CEST 2008

Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
---
 arch/x86/kernel/smpboot.c |    2 ++
 1 file changed, 2 insertions(+)

Index: linux-x86.q/arch/x86/kernel/smpboot.c
===================================================================
--- linux-x86.q.orig/arch/x86/kernel/smpboot.c
+++ linux-x86.q/arch/x86/kernel/smpboot.c
@@ -1181,6 +1181,7 @@ static void __init smp_cpu_index_default
  */
 void __init native_smp_prepare_cpus(unsigned int max_cpus)
 {
+	preempt_disable();
 	nmi_watchdog_default();
 	smp_cpu_index_default();
 	current_cpu_data = boot_cpu_data;
@@ -1237,6 +1238,7 @@ void __init native_smp_prepare_cpus(unsi
 	printk(KERN_INFO "CPU%d: ", 0);
 	print_cpu_info(&amp;cpu_data(0));
 	setup_boot_clock();
+	preempt_enable();
 }
 /*
  * Early setup to make printk work.
--
To: Andrew Morton <akpm@...>
Cc: Thomas Gleixner <tglx@...>, Pekka Enberg <penberg@...>, <linux-usb@...>, <linux-kernel@...>, <linux-mm@...>, James Morris <jmorris@...>, Stephen Smalley <sds@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-pm@...>, Greg KH <greg@...>, Rafael J. Wysocki <rjw@...>, Pavel Machek <pavel@...>, Jack Steiner <steiner@...>, Mike Travis <travis@...>, Alan Mayer <ajm@...>
Date: Friday, April 18, 2008 - 8:18 am

that should be the patch below.

	Ingo

------------&gt;
Subject: x86: disable preemption in native_smp_prepare_cpus
From: Ingo Molnar &lt;mingo@elte.hu&gt;
Date: Fri Apr 18 11:07:10 CEST 2008

Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
---
 arch/x86/kernel/smpboot.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Index: linux-x86.q/arch/x86/kernel/smpboot.c
===================================================================
--- linux-x86.q.orig/arch/x86/kernel/smpboot.c
+++ linux-x86.q/arch/x86/kernel/smpboot.c
@@ -1181,6 +1181,7 @@ static void __init smp_cpu_index_default
  */
 void __init native_smp_prepare_cpus(unsigned int max_cpus)
 {
+	preempt_disable();
 	nmi_watchdog_default();
 	smp_cpu_index_default();
 	current_cpu_data = boot_cpu_data;
@@ -1197,7 +1198,7 @@ void __init native_smp_prepare_cpus(unsi
 	if (smp_sanity_check(max_cpus) &lt; 0) {
 		printk(KERN_INFO "SMP disabled\n");
 		disable_smp();
-		return;
+		goto out;
 	}
 
 	preempt_disable();
@@ -1237,6 +1238,8 @@ void __init native_smp_prepare_cpus(unsi
 	printk(KERN_INFO "CPU%d: ", 0);
 	print_cpu_info(&amp;cpu_data(0));
 	setup_boot_clock();
+out:
+	preempt_enable();
 }
 /*
  * Early setup to make printk work.
--
To: Andrew Morton <akpm@...>
Cc: Thomas Gleixner <tglx@...>, Pekka Enberg <penberg@...>, <linux-usb@...>, <linux-kernel@...>, <linux-mm@...>, James Morris <jmorris@...>, Stephen Smalley <sds@...>, Arjan van de Ven <arjan@...>
Date: Friday, April 18, 2008 - 3:09 am

that's the stackprotector self-test: you probably have a gcc that cannot 
build a proper stackprotector kernel. No damage other than having no 
stackprotector. Arjan Cc:-ed.

	Ingo
--
To: Andrew Morton <akpm@...>
Cc: Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, <linux-usb@...>, <linux-kernel@...>, <linux-mm@...>, James Morris <jmorris@...>, Stephen Smalley <sds@...>
Date: Friday, April 18, 2008 - 2:40 am

On Fri, Apr 18, 2008 at 2:03 AM, Andrew Morton

Andrew, you don't seem to have slab debugging enabled:

# CONFIG_DEBUG_SLAB is not set

And quite frankly, the oops looks unlikely to be a slab bug but rather
a plain old slab corruption cause by the callers...

                                    Pekka
--
To: Pekka Enberg <penberg@...>
Cc: Andrew Morton <akpm@...>, Thomas Gleixner <tglx@...>, <linux-usb@...>, <linux-kernel@...>, <linux-mm@...>, James Morris <jmorris@...>, Stephen Smalley <sds@...>
Date: Friday, April 18, 2008 - 3:24 am

hm, there's sel_netnode_free() in the stackframe - that's from 
security/selinux/netnode.c. Andrew, any recent changes in that area?

	Ingo
--
To: Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>
Cc: Pekka Enberg <penberg@...>, Andrew Morton <akpm@...>, Thomas Gleixner <tglx@...>, <linux-usb@...>, <linux-kernel@...>, <linux-mm@...>, Stephen Smalley <sds@...>, Paul Moore <paul.moore@...>
Date: Friday, April 18, 2008 - 6:32 am

I've reverted the -mm only change to that file in 

git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/selinux-2.6.git#for-akpm


commit f777964ad75cf4a119d911d12e81948d2402677f
Author: James Morris &lt;jmorris@namei.org&gt;
Date:   Fri Apr 18 20:27:24 2008 +1000

    Revert "SELinux: Made netnode cache adds faster"
    
    This reverts commit 6bf8f41d4efdf9d4eeb4f7df9c591e281f7da93e.
    
    Possible cause of slab corruption in -mm.



-- 
James Morris
&lt;jmorris@namei.org&gt;
--
To: Ingo Molnar <mingo@...>
Cc: Andrew Morton <akpm@...>, Thomas Gleixner <tglx@...>, <linux-usb@...>, <linux-kernel@...>, <linux-mm@...>, James Morris <jmorris@...>, Stephen Smalley <sds@...>
Date: Friday, April 18, 2008 - 3:25 am

Keep in mind that slab might have been corrupted by someone else much 
earlier but we didn't notice due to the lack of CONFIG_SLAB_DEBUG.
--
To: Pekka Enberg <penberg@...>
Cc: Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, <linux-usb@...>, <linux-kernel@...>, <linux-mm@...>, James Morris <jmorris@...>, Stephen Smalley <sds@...>
Date: Friday, April 18, 2008 - 2:56 am

Yes, I'd agree.  All has been peachy since I dropped git-selinux.
--
To: Andrew Morton <akpm@...>
Cc: Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Pekka Enberg <penberg@...>, <linux-usb@...>, <linux-kernel@...>, <linux-mm@...>, James Morris <jmorris@...>, Stephen Smalley <sds@...>
Date: Friday, April 18, 2008 - 1:49 am

On Thu, 17 Apr 2008 16:03:31 -0700

do you have a stack-protector capable GCC? I guess not.

This is a catch-22. You do not have stack-protector. Should we make that 
a silent failure? or do you want to know that you don't have a security
feature you thought you had.... complaining seems to be the right thing to do imo.



-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--
To: Arjan van de Ven <arjan@...>
Cc: Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Pekka Enberg <penberg@...>, <linux-usb@...>, <linux-kernel@...>, <linux-mm@...>, James Morris <jmorris@...>, Stephen Smalley <sds@...>
Date: Friday, April 18, 2008 - 2:10 am

A #warning sounds more appropriate.
--
To: Andrew Morton <akpm@...>
Cc: Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>, Pekka Enberg <penberg@...>, <linux-usb@...>, <linux-kernel@...>, <linux-mm@...>, James Morris <jmorris@...>, Stephen Smalley <sds@...>
Date: Friday, April 18, 2008 - 3:19 am

this warning is telling the user that the security feature that got 
enabled in the .config is completely, 100% not working due to using a
stack-protector-incapable GCC.

it's analogous as if there was a bug in gcc that made SELinux totally 
ineffective in some mitigate-exploit-damage scenarios. No harm done on a 
perfectly bug-free system - but once a bug happens that SELinux should 
have mitigated, the breakage becomes real. Having a prominent warning is 
the _minimum_.

having a build failure would be nice too because this is a build 
environment problem. (not a build warning - warnings can easily be 
missed because on a typical kernel build there's so many false positives 
that get emitted by various other warning mechanisms) Arjan?

	Ingo
--
To: Ingo Molnar <mingo@...>
Cc: Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>, Pekka Enberg <penberg@...>, <linux-usb@...>, <linux-kernel@...>, <linux-mm@...>, James Morris <jmorris@...>, Stephen Smalley <sds@...>
Date: Friday, April 18, 2008 - 3:28 am

Not really.  In the selinux case we don't know that it'll break at compile

Yeah, #error would work too.
--
To: Andrew Morton <akpm@...>
Cc: Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, <sam@...>, <arjan@...>, <linux-kernel@...>
Date: Friday, April 18, 2008 - 9:58 am

On Fri, 18 Apr 2008 00:28:58 -0700

I'm totally fine with that, but I think I need Sam's help on making that happen
the right way; this is going to need makefile fu L(

Sam:
Basically what I need is that if the
scripts/gcc-x86_64-has-stack-protector.sh script fails, the build aborts with
a message/#error that says that the compiler is not capable of supporting this feature.

Right now the script is used like this:

	stackp := $(CONFIG_SHELL) $(srctree)/scripts/gcc-x86_64-has-stack-protector.sh
        stackp-$(CONFIG_CC_STACKPROTECTOR) := $(shell $(stackp) \
                "$(CC)" -fstack-protector )

It's obviously easy to make this script print a warning.. but how do we make it stop the build?

-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--
To: Arjan van de Ven <arjan@...>
Cc: Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, <sam@...>, <linux-kernel@...>
Date: Friday, April 18, 2008 - 12:57 pm

ok I found a way that works for me:

From: Arjan van de Ven &lt;arjan@linux.intel.com&gt;
Subject: [PATCH] stackprotector: turn not having the right gcc into an #error

If the user selects the stack-protector config option, but does not have
a gcc that has the right bits enabled (for example because it isn't build
with a glibc that supports TLS, as is common for cross-compilers, but also
because it may be too old), then the runtime test fails right now.

Andrew rightfully points out that this is a condition we can detect at
build time, and we should error out at that point instead.

This patch adds an error message for this scenario. This error accomplishes
two goals
1) the user is informed that the security option he selective isn't available
2) the user has enough info to turn of the CONFIG option that won't work for him,
    and would make the runtime test fail anyway.

Signed-off-by: Arjan van de Ven &lt;arjan@linux.intel.com&gt;
---
  arch/x86/Makefile |    2 +-
  kernel/panic.c    |    3 +++
  2 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 3cff3c8..c3e0eee 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -73,7 +73,7 @@ else

          stackp := $(CONFIG_SHELL) $(srctree)/scripts/gcc-x86_64-has-stack-protector.sh
          stackp-$(CONFIG_CC_STACKPROTECTOR) := $(shell $(stackp) \
-                "$(CC)" -fstack-protector )
+                "$(CC)" "-fstack-protector -DGCC_HAS_SP" )
          stackp-$(CONFIG_CC_STACKPROTECTOR_ALL) += $(shell $(stackp) \
                  "$(CC)" -fstack-protector-all )

diff --git a/kernel/panic.c b/kernel/panic.c
index c92c1e2..7cbcd8e 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -321,6 +321,9 @@ EXPORT_SYMBOL(warn_on_slowpath);

  #ifdef CONFIG_CC_STACKPROTECTOR

+#ifndef GCC_HAS_SP
+#error You have selected the CONFIG_CC_STACKPROTECTOR option, but the gcc used does not support this.
+#endif
  static unsigned long __stack_check_testing;
  /*
   *...
To: Andrew Morton <akpm@...>
Cc: Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>, Pekka Enberg <penberg@...>, <linux-usb@...>, <linux-kernel@...>, <linux-mm@...>, James Morris <jmorris@...>, Stephen Smalley <sds@...>
Date: Friday, April 18, 2008 - 5:28 am

you noticed it ;-) Distro maintainers will notice it too if it pops up 
when something breaks StackProtector. Normal user might not notice. (but 
normal user might not notice a few hundred guest roots either)

but ... the real thing that made it slip into your config was that it 
was default-enabled in x86/latest - the patch below should fix that.

we need the warning: it could have caught the toplevel Makefile change 
last October that broke StackProtector completely. So no, we wont be and 
cannot be silent about this anymore - we need and now have an end-to-end 
test about it.

	Ingo

------------------&gt;
Subject: stackprotector: non default
From: Ingo Molnar &lt;mingo@elte.hu&gt;
Date: Fri Apr 18 11:13:17 CEST 2008

Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
---
 arch/x86/Kconfig |    1 -
 1 file changed, 1 deletion(-)

Index: linux-x86.q/arch/x86/Kconfig
===================================================================
--- linux-x86.q.orig/arch/x86/Kconfig
+++ linux-x86.q/arch/x86/Kconfig
@@ -1146,7 +1146,6 @@ config CC_STACKPROTECTOR
 	bool "Enable -fstack-protector buffer overflow detection (EXPERIMENTAL)"
 	depends on X86_64
 	select CC_STACKPROTECTOR_ALL
-	default y
 	help
           This option turns on the -fstack-protector GCC feature. This
 	  feature puts, at the beginning of functions, a canary value on
--
To: Andrew Morton <akpm@...>
Cc: Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Pekka Enberg <penberg@...>, <linux-usb@...>, <linux-kernel@...>, <linux-mm@...>, James Morris <jmorris@...>, Stephen Smalley <sds@...>
Date: Thursday, April 17, 2008 - 7:55 pm

For what it's worth I just looked over the changes in netnode.c and 
nothing is jumping out at me.  The changes ran fine for me when tested 
on the later 2.6.25-rcX kernels but I suppose that doesn't mean a whole 
lot.

I've got a 4-way x86_64 box but it needs to be installed (which means 
I'm not going to be able to do anything useful with it until tomorrow 
at the earliest).  I'll try it out and see if I can recreate the 
problem.

-- 
paul moore
linux @ hp
--
To: Paul Moore <paul.moore@...>
Cc: <mingo@...>, <tglx@...>, <penberg@...>, <linux-usb@...>, <linux-kernel@...>, <linux-mm@...>, <jmorris@...>, <sds@...>
Date: Thursday, April 17, 2008 - 9:35 pm

On Thu, 17 Apr 2008 19:55:46 -0400

I dropped git-selinux and that crash seems to have gone away.  It took about
five minutes before, but would presumably have happened earlier if I'd
reduced the cache size.

btw, wouldn't this

--- a/security/selinux/netnode.c~a
+++ a/security/selinux/netnode.c
@@ -190,7 +190,7 @@ static int sel_netnode_insert(struct sel
 	if (sel_netnode_hash[idx].size == SEL_NETNODE_HASH_BKT_LIMIT) {
 		struct sel_netnode *tail;
 		tail = list_entry(node-&gt;list.prev, struct sel_netnode, list);
-		list_del_rcu(node-&gt;list.prev);
+		list_del_rcu(&amp;tail-&gt;list);
 		call_rcu(&amp;tail-&gt;rcu, sel_netnode_free);
 	} else
 		sel_netnode_hash[idx].size++;
_

be a bit clearer?  If it's correct - I didn't try too hard :)
--
To: Andrew Morton <akpm@...>
Cc: <mingo@...>, <tglx@...>, <penberg@...>, <linux-usb@...>, <linux-kernel@...>, <linux-mm@...>, <jmorris@...>, <sds@...>
Date: Friday, April 18, 2008 - 10:57 am

Looks good to me, although before I fix this let me try and figure out 
why this code is causing the machine to puke all over itself.  
Priorities you know :)

-- 
paul moore
linux @ hp
--
To: Paul Moore <paul.moore@...>
Cc: <mingo@...>, <tglx@...>, <penberg@...>, <linux-usb@...>, <linux-kernel@...>, <linux-mm@...>, <jmorris@...>, <sds@...>
Date: Thursday, April 17, 2008 - 8:04 pm

On Thu, 17 Apr 2008 19:55:46 -0400

Perhaps it was tested only against slub?  That config uses slab.
--
To: Andrew Morton <akpm@...>
Cc: <mingo@...>, <tglx@...>, <penberg@...>, <linux-usb@...>, <linux-kernel@...>, <linux-mm@...>, <jmorris@...>, <sds@...>
Date: Friday, April 18, 2008 - 10:55 am

Yes, I believe it was testing it with slub.

-- 
paul moore
linux @ hp
--
To: <mingo@...>, <tglx@...>, <penberg@...>, <linux-usb@...>, <linux-kernel@...>, <linux-mm@...>, <jmorris@...>, <sds@...>
Date: Thursday, April 17, 2008 - 7:40 pm

On Thu, 17 Apr 2008 16:03:31 -0700

With git-selinux at top-of tree it's repeatably hanging in the CPA
self-tests (git-x86 stuff).  Last two lines are:

CPA self-test:
 4k 8704 large 4847 gb 0 x 0[0-0] miss 0

(clear as mud ;))

I will find the config knob to disable that test.  Of course, it could be
telling me that CPA is buggy.
--
To: <mingo@...>, <tglx@...>, <penberg@...>, <linux-usb@...>, <linux-kernel@...>, <linux-mm@...>, <jmorris@...>, <sds@...>
Date: Thursday, April 17, 2008 - 8:14 pm

On Thu, 17 Apr 2008 16:40:34 -0700

Disabling CPA_DEBUG didn't help.  It's still hanging.  The final initcall
is init_kgdbts() and disabling KGDB prevents the hang.

--
To: Andrew Morton <akpm@...>
Cc: <mingo@...>, <tglx@...>, <penberg@...>, <linux-usb@...>, <linux-kernel@...>, <linux-mm@...>, <jmorris@...>, <sds@...>
Date: Thursday, April 17, 2008 - 11:05 pm

In this case you do not have to disable kgdb, but just disable the
kgdb test suite.  Certainly I would be interested to know where it is
failing as it would indicate that there is a regression that is caused
by a change that occurred somewhere else in the kernel or a latent
defect in kgdb was triggered.  The kgdb test suite exercises a number
of kernel fault systems as well as arch specific single stepping when
it runs and when it fails it is likely worth it to track down which
test failed and why.

If you are looking to bypass the kgdb test suite you have two options.

The kernel option that runs the tests on boot (which is not on by
default) is CONFIG_KGDB_TESTS_ON_BOOT, and make sure this is off.

You can turn off the tests in an already compiled kernel that had the
testing turned on with boot by adding the boot argument with nothing
on the other side of the = sign of the kgdbts paramter.  Like:

kgdbts=


In terms of debugging what happened, if you have console output you
can save, please do send me the output of kernel boot with the kernel
boot argument:

kgdbts=V2

That enables verbose logging of exactly what is going on and will show
where wheels fall off the cart.  If the kernel is dying silently it
means the early exception code has completely failed in some way on
the kernel architecture that was selected, and of course the .config
is always useful in this case.

Thanks,
Jason.
--
To: Jason Wessel <jason.wessel@...>
Cc: Andrew Morton <akpm@...>, <tglx@...>, <penberg@...>, <linux-usb@...>, <linux-kernel@...>, <linux-mm@...>, <jmorris@...>, <sds@...>
Date: Friday, April 18, 2008 - 3:37 am

incidentally, just today, in overnight testing i triggered a similar 
hang in the KGDB self-test:

  http://redhat.com/~mingo/misc/config-Thu_Apr_17_23_46_36_CEST_2008.bad

to get a similar tree to the one i tested, pick up sched-devel/latest 
from:

   http://people.redhat.com/mingo/sched-devel.git/README 

pick up that failing .config, do 'make oldconfig' and accept all the 
defaults to get a comparable kernel to mine. (kgdb is embedded in 
sched-devel.git.)

the hang was at:

[   12.504057] Calling initcall 0xffffffff80b800c1: init_kgdbts+0x0/0x1b()
[   12.511298] kgdb: Registered I/O driver kgdbts.
[   12.515062] kgdbts:RUN plant and detach test
[   12.520283] kgdbts:RUN sw breakpoint test
[   12.524651] kgdbts:RUN bad memory access test
[   12.529052] kgdbts:RUN singlestep breakpoint test

full log:

  http://redhat.com/~mingo/misc/log-Thu_Apr_17_23_46_36_CEST_2008.bad

note that this was a 64-bit config too - our tests do a perfect mix of 
50% 32-bit and 50% 64-bit kernels. So single-stepping of the kernel 
broke in some circumstances.

find the boot log below. (it also includes all command line parameters) 

This is the first time ever i saw the self-test in KGDB hanging, so it's 
some recent non-KGDB change that provoked it or made it more likely. The 
KGDB self-test runs very frequently in my bootup tests:

[   12.508236] kgdb: Registered I/O driver kgdbts.
[   12.511245] kgdbts:RUN plant and detach test
[   12.517418] kgdbts:RUN sw breakpoint test
[   12.521056] kgdbts:RUN bad memory access test
[   12.525515] kgdbts:RUN singlestep breakpoint test
[   12.531483] kgdbts:RUN hw breakpoint test
[   12.536142] kgdbts:RUN hw write breakpoint test
[   12.541007] kgdbts:RUN access write breakpoint test
[   12.546223] kgdbts:RUN do_fork for 100 breakpoints

so the latest kgdb-light tree literally survived thousands of such tests 
since it was changed last.

unfortunately, the condition was not reproducible - i booted it once 
more and then it came up just f...
To: Ingo Molnar <mingo@...>
Cc: Andrew Morton <akpm@...>, <tglx@...>, <penberg@...>, <linux-usb@...>, <linux-kernel@...>, <linux-mm@...>, <jmorris@...>, <sds@...>
Date: Friday, April 18, 2008 - 5:54 pm

So I pulled your tree and I would agree there was a problem.  But it
seems unrelated to kgdb.  I bisected the tree because it worked starting
with the kgdb-light merge. 

It fails once with the patch below, but it is not clear as to why other
than the lock must have something to do with it.

I'll submit a patch to the kgdb test suite to increase the amount of
loops through the single step test as it is it can definitely catch
things :-)

Jason.


From 84556fe84dd975161e70b782d7d7cc7bd080c06a Mon Sep 17 00:00:00 2001
From: Ingo Molnar &lt;mingo@elte.hu&gt;
Date: Thu, 28 Feb 2008 21:00:21 +0100
Subject: [PATCH 0883/1078] sched: make cpu_clock() globally synchronous

Alexey Zaytsev reported (and bisected) that the introduction of
cpu_clock() in printk made the timestamps jump back and forth.

Make cpu_clock() more reliable while still keeping it fast when it's
called frequently.

Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
---
 kernel/sched.c |   52 +++++++++++++++++++++++++++++++++++++++++++++++++---
 1 files changed, 49 insertions(+), 3 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 8dcdec6..7377222 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -632,11 +632,39 @@ int sysctl_sched_rt_runtime = 950000;
  */
 #define RUNTIME_INF    ((u64)~0ULL)
 
+static const unsigned long long time_sync_thresh = 100000;
+
+static DEFINE_PER_CPU(unsigned long long, time_offset);
+static DEFINE_PER_CPU(unsigned long long, prev_cpu_time);
+
 /*
- * For kernel-internal use: high-speed (but slightly incorrect) per-cpu
- * clock constructed from sched_clock():
+ * Global lock which we take every now and then to synchronize
+ * the CPUs time. This method is not warp-safe, but it's good
+ * enough to synchronize slowly diverging time sources and thus
+ * it's good enough for tracing:
  */
-unsigned long long cpu_clock(int cpu)
+static DEFINE_SPINLOCK(time_sync_lock);
+static unsigned long long prev_global_time;
+
+static unsigned long long __sync_cpu_cloc...
To: Ingo Molnar <mingo@...>
Cc: Jason Wessel <jason.wessel@...>, Andrew Morton <akpm@...>, <tglx@...>, <penberg@...>, <linux-usb@...>, <linux-kernel@...>, <linux-mm@...>, <jmorris@...>, <sds@...>
Date: Friday, April 18, 2008 - 7:46 am

With the patch below, it seems 100% reproducible to me (7 out of 7
bootups hung).

The number of loops it could do before hanging were, in order: 697,
898, 237, 55, 45, 92, 59

It seems timing-related, so I'm guessing it could be some interaction
with interrupts?


Vegard


diff --git a/drivers/misc/kgdbts.c b/drivers/misc/kgdbts.c
index 6d6286c..ee87820 100644
--- a/drivers/misc/kgdbts.c
+++ b/drivers/misc/kgdbts.c
@@ -895,7 +895,13 @@ static void kgdbts_run_tests(void)
        v1printk("kgdbts:RUN bad memory access test\n");
        run_bad_read_test();
        v1printk("kgdbts:RUN singlestep breakpoint test\n");
-       run_singlestep_break_test();
+
+       while(1) {
+               static int i = 0;
+
+               run_singlestep_break_test();
+               printk(KERN_EMERG "test #%d successfull\n", i++);
+       }

        /* ===Optional tests=== */
--
To: Vegard Nossum <vegard.nossum@...>
Cc: Jason Wessel <jason.wessel@...>, Andrew Morton <akpm@...>, <tglx@...>, <penberg@...>, <linux-usb@...>, <linux-kernel@...>, <linux-mm@...>, <jmorris@...>, <sds@...>
Date: Friday, April 18, 2008 - 8:34 am

cool! Jason: i think that particular self-test should be repeated 1000 
times before reporting success ;-)

	Ingo
--
To: Ingo Molnar <mingo@...>
Cc: Jason Wessel <jason.wessel@...>, Andrew Morton <akpm@...>, <tglx@...>, <penberg@...>, <linux-usb@...>, <linux-kernel@...>, <linux-mm@...>, <jmorris@...>, <sds@...>
Date: Friday, April 18, 2008 - 8:41 am

[Empty message]
To: Vegard Nossum <vegard.nossum@...>
Cc: Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <tglx@...>, <penberg@...>, <linux-usb@...>, <linux-kernel@...>, <linux-mm@...>, <jmorris@...>, <sds@...>
Date: Friday, April 18, 2008 - 9:02 am

I assume this was SMP?

While I had not tried it yet, my guess would have been this did not
happen on a UP kernel.  If it does occur on a UP kernel it means the
problem is squarely between the task scheduling after the exception is
handled and the kgdb state logic for re-entering the debug state after a
single step exception occurs.

It seems reasonable to go for 1000 iterations of this particular test to
declare success as pointed out by Ingo.  Previous versions of kgdb
handled some of the irq + single step + cpu sync slightly differently
and it is entirely possible there is a regression there.

Jason.
--
To: Jason Wessel <jason.wessel@...>
Cc: Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <tglx@...>, <penberg@...>, <linux-usb@...>, <linux-kernel@...>, <linux-mm@...>, <jmorris@...>, <sds@...>
Date: Friday, April 18, 2008 - 9:22 am

On Fri, Apr 18, 2008 at 3:02 PM, Jason Wessel

Yes. But now that I realize this, I tried running same kernel with
qemu, using -smp 16, and it seems to be stuck here:

[   16.562659] kgdb: Registered I/O driver kgdbts.
[   16.565875] kgdbts:RUN plant and detach test

and the code is at kgdb_handle_exception():

        /*
         * Wait for the other CPUs to be notified and be waiting for us:
         */
        for_each_online_cpu(i) {
                while (!atomic_read(&amp;cpu_in_kgdb[i]))
                        cpu_relax();


Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036
--
To: Vegard Nossum <vegard.nossum@...>
Cc: Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <tglx@...>, <penberg@...>, <linux-usb@...>, <linux-kernel@...>, <linux-mm@...>, <jmorris@...>, <sds@...>
Date: Friday, April 18, 2008 - 9:27 am

Unless you have a qemu with the NMI patches, kgdb does not work on SMP
with qemu.  The very first test is going to fail because the IPI sent by
the kernel is not handled in qemu's hardware emulation.

Jason.
--
To: Jason Wessel <jason.wessel@...>
Cc: Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <tglx@...>, <penberg@...>, <linux-usb@...>, <linux-kernel@...>, <linux-mm@...>, <jmorris@...>, <sds@...>
Date: Friday, April 18, 2008 - 10:47 am

Oops, no, and that makes sense.

I now picked up qemu 0.9.1 and applied the three NMI/SMI patches by Jan Kiszka.

So in qemu it seems to run fine now, except that I need to prod it
sometimes (it gets stuck in cpu_clock() and I have to break/continue
from gdb to make it proceed). Oh, there it made it to 1056, and gdb
can't interrupt anymore. Hmm. This is probably not a very good

But booting with nosmp on real hardware gets easily above 100,000
iterations of the loop (before I reboot), so it seems to be related to
that, anyway.

Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036
--
To: Jason Wessel <jason.wessel@...>
Cc: Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <tglx@...>, <penberg@...>, <linux-usb@...>, <linux-kernel@...>, <linux-mm@...>, <jmorris@...>, <sds@...>
Date: Friday, April 18, 2008 - 12:02 pm

It gets stuck in kgdb_roundup_cpus(), verified by putting a printk()
before and after this call (in kgdb_handle_exception()). Simple, but
effective :-)


Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036
--
To: Andrew Morton <akpm@...>
Cc: Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Pekka Enberg <penberg@...>, <linux-usb@...>, <linux-kernel@...>, <linux-mm@...>, James Morris <jmorris@...>, Stephen Smalley <sds@...>
Date: Thursday, April 17, 2008 - 7:24 pm

Interesting, that's the new major:minor code.  I'll go poke at it...

thanks,

greg k-h
--
To: Greg KH <greg@...>
Cc: Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Pekka Enberg <penberg@...>, <linux-usb@...>, <linux-kernel@...>, <linux-mm@...>, James Morris <jmorris@...>, Stephen Smalley <sds@...>
Date: Thursday, April 17, 2008 - 8:48 pm

Is this with the deprecated CONFIG_USB_DEVICE_CLASS=y? They have the
same dev_t as usb_device and would be a reason for the duplicates.

Thanks,
Kay
--
To: Kay Sievers <kay.sievers@...>
Cc: Greg KH <greg@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Pekka Enberg <penberg@...>, <linux-usb@...>, <linux-kernel@...>, <linux-mm@...>, James Morris <jmorris@...>, Stephen Smalley <sds@...>, Alexey Dobriyan <adobriyan@...>
Date: Friday, April 18, 2008 - 12:07 am

The mac g5 is warning us about stuff too:

io scheduler deadline registered
io scheduler cfq registered
io scheduler bfq registered
proc_dir_entry '00' already registered
Call Trace:
[c00000017a0fbb80] [c000000000012018] .show_stack+0x58/0x1dc (unreliable)
[c00000017a0fbc30] [c00000000013f68c] .proc_register+0x218/0x260
[c00000017a0fbce0] [c00000000013fab8] .proc_mkdir_mode+0x40/0x74
[c00000017a0fbd60] [c0000000001f49a8] .pci_proc_attach_device+0x90/0x134
[c00000017a0fbe00] [c0000000005f0084] .pci_proc_init+0x68/0xa0
[c00000017a0fbe80] [c0000000005cbc94] .kernel_init+0x1ec/0x430
[c00000017a0fbf90] [c000000000026fc0] .kernel_thread+0x4c/0x68
nvidiafb: Device ID: 10de0141 
nvidiafb: CRTC0 analog not found

http://userweb.kernel.org/~akpm/config-g5.txt
http://userweb.kernel.org/~akpm/dmesg-g5.txt
--
To: Kay Sievers <kay.sievers@...>
Cc: <greg@...>, <mingo@...>, <tglx@...>, <penberg@...>, <linux-usb@...>, <linux-kernel@...>, <linux-mm@...>, <jmorris@...>, <sds@...>
Date: Thursday, April 17, 2008 - 9:12 pm

On Fri, 18 Apr 2008 02:48:19 +0200

--
To: Andrew Morton <akpm@...>
Cc: Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Pekka Enberg <penberg@...>, <linux-usb@...>, <linux-kernel@...>, <linux-mm@...>, James Morris <jmorris@...>, Stephen Smalley <sds@...>
Date: Thursday, April 17, 2008 - 7:24 pm

On Thu, Apr 17, 2008 at 4:03 PM, Andrew Morton

The duplicate filename &lt;major&gt;:&lt;minor&gt; messages are coming from
"sysfs-add-sys-dev-char-block-to-lookup-sysfs-path-by-major-minor.patch"
now in Greg's tree.  I'll take a look.

--
Dan
--
Previous thread: BUG: MAX_STACK_TRACE_ENTRIES too low! by Christian Kujau on Thursday, April 17, 2008 - 6:01 pm. (1 message)

Next thread: [git patches] IDE updates part 1 by Bartlomiej Zolnierkiewicz on Thursday, April 17, 2008 - 7:28 pm. (1 message)
speck-geostationary