ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.25-rc3/2.6.25-rc3-mm1/ Boilerplate: - See the `hot-fixes' directory for any important updates to this patchset. - To fetch an -mm tree using git, use (for example) git-fetch git://git.kernel.org/pub/scm/linux/kernel/git/smurf/linux-trees.git tag v2.6.16-rc2-mm1 git-checkout -b local-v2.6.16-rc2-mm1 v2.6.16-rc2-mm1 - -mm kernel commit activity can be reviewed by subscribing to the mm-commits mailing list. echo "subscribe mm-commits" | mail majordomo@vger.kernel.org - If you hit a bug in -mm and it is not obvious which patch caused it, it is most valuable if you can perform a bisection search to identify which patch introduced the bug. Instructions for this process are at http://www.zip.com.au/~akpm/linux/patches/stuff/bisecting-mm-trees.txt But beware that this process takes some time (around ten rebuilds and reboots), so consider reporting the bug first and if we cannot immediately identify the faulty patch, then perform the bisection search. - When reporting bugs, please try to Cc: the relevant maintainer and mailing list on any email. - When reporting bugs in this kernel via email, please also rewrite the email Subject: in some manner to reflect the nature of the bug. Some developers filter by Subject: when looking for messages to read. - Occasional snapshots of the -mm lineup are uploaded to ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/mm/ and are announced on the mm-commits list. These probably are at least compilable. - More-than-daily -mm snapshots may be found at http://userweb.kernel.org/~akpm/mmotm/. These are almost certainly not compileable. Changes since 2.6.25-rc2-mm1: origin.patch git-x86.patch git-acpi.patch git-alsa.patch git-avr32.patch git-cifs.patch git-cpufreq.patch git-powerpc.patch git-drm.patch git-dvb.patch git-hwmon.patch git-gfs2-nmw.patch git-dlm.patch git-hid.patch ...
On Tue, 4 Mar 2008 01:19:28 -0800, This should go into 2.6.25, as it fixes a panic (see http://marc.info/?l=linux-kernel&m=120411157302447&w=2, http://marc.info/?l=linux-kernel&m=120412001416810&w=2). --
Will add it to that queue to send to Linus in a bit, thanks for poking me. Hint, when sending patches, please at least change the Subject so that I don't accidentally pass it by, it was burried in a longer thread that I missed the first time through. thanks, greg k-h --
Hi Andrew,
The 2.6.25-rc3-mm1 kernel panics while bootup on power box. The machine booted up
without the panic on the third attempt, but badness call trace were seen while running
tests
1) The kernel panic on first attempt
Unable to handle kernel paging request for data at address 0x00000000
Faulting instruction address: 0xc00000000000cb2c
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=128 NUMA pSeries
Modules linked in:
NIP: c00000000000cb2c LR: c00000000000caf8 CTR: 0000000000000226
REGS: c00000000068f360 TRAP: 0300 Not tainted (2.6.25-rc3-mm1-autotest)
MSR: 8000000000001032 <ME,IR,DR> CR: 28000024 XER: 20000001
DAR: 0000000000000000, DSISR: 0000000040000000
TASK = c0000000005c8590[0] 'swapper' THREAD: c00000000068c000 CPU: 0
GPR00: c00000000068f5e0 c00000000068f5e0 c00000000068e690 0000000000000000
GPR04: 00000000000035e0 000000000087264e c000000008011280 c000000000594000
GPR08: c0000000005c9300 0000000000000000 c000000000591090 c00000000068c000
GPR12: 8000000000009032 c0000000005c9300 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000008000 0000000000000000
GPR20: 0000000000000000 0000000000000000 000000000000007f 0000000000018000
GPR24: 0000000000000001 0000000000000080 0000000000000018 0000000000000000
GPR28: 0000000000000c00 c000000000588988 c000000000639be8 c000000008001c00
NIP [c00000000000cb2c] .do_IRQ+0x74/0x1c4
LR [c00000000000caf8] .do_IRQ+0x40/0x1c4
Call Trace:
[c00000000068f5e0] [c00000000000caf8] .do_IRQ+0x40/0x1c4 (unreliable)
[c00000000068f680] [c000000000004790] hardware_interrupt_entry+0x18/0x1c
--- Exception: 501 at .memset+0x70/0xfc
LR = .__alloc_bootmem_core+0x39c/0x3dc
[c00000000068f970] [c00000000068fa10] init_thread_union+0x3a10/0x4000 (unreliable)
[c00000000068fa30] [c00000000057237c] .__alloc_bootmem_node+0x38/0x8c
[c00000000068fad0] [c0000000003c477c] .zone_wait_table_init+0x74/0x108
[c00000000068fb60] [c0000000003d9058] .init_currently_empty_zone+0x40/0x11c
[c00000000068fc00] ...I'm not getting a crash but I am getting this: start_kernel(): bug: interrupts were enabled *very* early, fixing it ...and you're getting a null pointer access here (in do_IRQ): irq = ppc_md.get_irq(); Are we somehow enabling interrupts before we've setup ppc_md.get_irq? --
Yes, we are - it's the semaphore rewrite which is doing this in start_kernel(). It's being discussed. Enabling interrupts too early on powerpc was discovered to be fatal on powerpc years ago. It looks like that remains the case. --
Yes, it is and will probably always be. All that semaphore mucking around that hard-enables interrupts is just asking for trouble (and on more than just powerpc... heh, how do you do if your main interrupt controller hasn't even been initialized yet ?) Ben. --
Regarding these issues. I could make it non fatal and just WARN_ON, provided that I have a way to differentiate legal vs. illegal calls to local_irq_enable(). We already have that function mostly out of line in C code due to our lazy irq disabling scheme, so the overhead of testing some global kernel state would be minimum here. However, I don't see anything around init/main.c:start_kernel() that I can use. What do you reckon here we should do ? Add some kind of global we set before calling local_irq_enable() ? Or make early_boot_irqs_on() do that generically It's currently defined as an empty inline without CONFIG_TRACE_IRQFLAGS but we could make it set a flag instead. I'm pretty sure other archs have similar problems, especially in the embedded world where you are booted with random junk firmwares that may leave devices, interrupt controllers etc... in random state, and enabling incoming IRQs before the arch code properly initializes the main interrupt controller can be fatal. I know at least of an ARM board I worked on a while ago that had a similar issues. On ppc32, unfortunately, our local_irq_enable/restore are nice inlines that whack the appropriate MSR bits directly, thus adding a test for a global flag would add some bloat/overhead that I'd like to avoid, at least until we decide to also do lazy disabling on those, if ever... Cheers, Ben. --
On Thu, 06 Mar 2008 11:03:31 +1100 I'd have thought that the way to do this would be to add it to lockdep - lockdep already has all the infrastructure and code sites to do this. Set some special flag saying its-ok-to-enable-interrupts-now and test that in lockdep. akpm:/usr/src/25> grep LOCKDEP arch/powerpc/Kconfig akpm:/usr/src/25> losers ;) Still, doing it for akpm:/usr/src/25> grep -l LOCKDEP arch/*/Kconfig arch/arm/Kconfig arch/avr32/Kconfig arch/mips/Kconfig arch/s390/Kconfig arch/sh/Kconfig arch/sparc64/Kconfig arch/um/Kconfig arch/x86/Kconfig should give pretty good coverage. --
/* Convert GFP flags to their corresponding migrate type */
static inline int allocflags_to_migratetype(gfp_t gfp_flags)
{
WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
Mel, Pekka: would you have some head-scratching time for this one please?
--
On Tue, 04 Mar 2008 18:42:19 +0530 Kamalesh Babulal On Tue, Mar 4, 2008 at 8:36 PM, Andrew Morton Sure. Just to double-check, this is with SLAB, right? Do you see this with SLUB? --
What we have is __getblk() -> __getblk_slow() -> grow_buffers() -> grow_dev_page() doing find_or_create_page() with __GFP_MOVABLE set. That path then eventually does radix_tree_preload -> kmem_cache_alloc() to a cache that has SLAB_RECLAIM_ACCOUNT set which implies __GFP_RECLAIMABLE (for both SLAB and SLUB). So we oops there. I suspect the WARN_ON() is bogus although I really don't know that part of the code all too well. Mel? Pekka --
We are taking a HW interrupt ... we aren't supposed to take HW interrupts that early during boot afaik. Is it yet another case of somebody hard-enabling interrupts with local_irq_enable() ? --
i386 allmodconfig gives me this: ERROR: "probe_4drives" [drivers/ide/ide-core.ko] undefined! --- ~Randy --
Hi, It was also reported by Andrew & Stephen but the thing is that it doesn't happen here with IDE tree, also it is quite strange that only probe_4drives causes error and other probe_* variables don't. I think that it is caused by something else in -mm / linux-next... Thanks, Bart --
With CONFIG_BLK_CPQ_DA=m CONFIG_BLK_CPQ_CISS_DA=m # CONFIG_CISS_SCSI_TAPE is not set I'm getting In file included from drivers/block/cciss.c:230: drivers/block/cciss_scsi.c:1498:38: error: macro parameters must be comma-separated drivers/block/cciss.c: In function 'cciss_seq_show_header': drivers/block/cciss.c:271: error: implicit declaration of function 'cciss_seq_tape_report' drivers/block/cciss.c: In function 'cciss_proc_write': drivers/block/cciss.c:392: error: implicit declaration of function 'cciss_engage_scsi' make[2]: *** [drivers/block/cciss.o] Error 1 make[1]: *** [drivers/block] Error 2 make[1]: *** Waiting for unfinished jobs.... --- ~Randy --
Randy, It looks like you have the original broken patch. I resubmitted and I think Jens picked up the fixed patch but I don't know where it is... :( -- mikem --
s/you/latest -mm/ I thought that this had been fixed, but I can't find it either... :( Jens, did you queue a patch for this? -- ~Randy --
I did, here: http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=89b6e743788516491846724d7ef89bc... -- Jens Axboe --
use-page_cache_xxx-in-ext2.patch gave me lots of EXT2-fs error (device
loop0): ext2_find_entry: dir 52629 size 5120 exceeds block count 2
so I stopped it quickly. Creating a directory entry was muddling up the
directory and the linked inode, writing directory page out to the latter.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
---
fs/ext2/dir.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- 2.6.25-rc3-mm1/fs/ext2/dir.c 2008-03-04 11:37:47.000000000 +0000
+++ linux/fs/ext2/dir.c 2008-03-04 18:25:24.000000000 +0000
@@ -472,7 +472,7 @@ void ext2_set_link(struct inode *dir, st
int ext2_add_link (struct dentry *dentry, struct inode *inode)
{
struct inode *dir = dentry->d_parent->d_inode;
- struct address_space *mapping = inode->i_mapping;
+ struct address_space *mapping = dir->i_mapping;
const char *name = dentry->d_name.name;
int namelen = dentry->d_name.len;
unsigned chunk_size = ext2_chunk_size(dir);
--
Hi Andrew,
kernel bug is triggered while running libhugetlbfs test with 2.6.25-rc3-mm1 kernel
over the x86 and power machines.
------------[ cut here ]------------
kernel BUG at mm/hugetlb.c:295!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/devices/system/node/possible
Modules linked in:
Pid: 5484, comm: counters Not tainted (2.6.25-rc3-mm1-autokern1 #1)
EIP: 0060:[<c10535cf>] EFLAGS: 00010202 CPU: 0
EIP is at alloc_buddy_huge_page+0x7a/0xb0
EAX: c13acd01 EBX: f7d3a000 ECX: 00000000 EDX: 00006363
ESI: 00000001 EDI: 00000000 EBP: 00000000 ESP: f5539ebc
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process counters (pid: 5484, ti=f5538000 task=f60afa20 task.ti=f5538000)
Stack: 00000001 c1053669 fffffff4 00000001 f5539ecc f5539ecc 00000001 fffffff4
f55d0e78 00000001 c105480c 00000001 00200000 c1054875 00000000 f54426c0
00200000 00000000 f54426c0 c10b0fb8 fffffff4 00200000 00000000 f55d0e78
Call Trace:
[<c1053669>] gather_surplus_pages+0x64/0x16d
[<c105480c>] hugetlb_acct_memory+0x1e/0x4a
[<c1054875>] hugetlb_reserve_pages+0x3d/0x6b
[<c10b0fb8>] hugetlbfs_file_mmap+0x9b/0xe1
[<c104bf9f>] mmap_region+0x1dc/0x3ae
[<c104bd42>] do_mmap_pgoff+0x27f/0x28e
[<c1005af2>] sys_mmap2+0x5a/0x78
[<c10029fa>] syscall_call+0x7/0xb
=======================
Code: c1 e8 ed 27 1c 00 85 db 74 41 83 7b 04 00 75 10 68 c0 93 27 c1 e8 02 92 fc ff 58 e8 c1 02 fb ff f0 ff 4b 04 0f 94 c0 84 c0 74 04 <0f> 0b eb fe c7 43 38 3e 33 05 c1 8b 03 c1 e8 1c ff 04 85 60 ce
EIP: [<c10535cf>] alloc_buddy_huge_page+0x7a/0xb0 SS:ESP 0068:f5539ebc
---[ end trace 5a47484f8fe93a33 ]---
------------[ cut here ]------------
cpu 0x3: Vector: 700 (Program Check) at [c0000000fb277740]
pc: c0000000000c6f54: .alloc_buddy_huge_page+0x120/0x1dc
lr: c0000000000c6f20: .alloc_buddy_huge_page+0xec/0x1dc
sp: c0000000fb2779c0
msr: 8000000000029032
current = 0xc0000000fc4cae90
paca = 0xc0000000004fae80
pid = 6828, comm = counters
kernel BUG at ...On Wed, 05 Mar 2008 00:50:17 +0530
Please send Adam a copy of that libhugetlbfs test ;)
hugetlb-correct-page-count-for-surplus-huge-pages.patch adds:
if (page) {
/*
* This page is now managed by the hugetlb allocator and has
* no users -- drop the buddy allocator's reference.
*/
int page_count = put_page_testzero(page);
BUG_ON(page_count != 0);
--
Ugh I got bitten by put_page_testzero(). When it returns 1, the page count is zero (not the page count). My initial version had a BUG_ON() with side-effects. When a reviewer pointed it out, I thought I could fix the patch up on its way out the door. I have self-administered my punishment. This patch will fix it: Signed-off-by: Adam Litke <agl@us.ibm.com> --- mm/hugetlb.c.orig 2008-03-04 13:36:30.000000000 -0800 +++ mm/hugetlb.c 2008-03-04 13:39:30.000000000 -0800 @@ -291,8 +291,8 @@ static struct page *alloc_buddy_huge_pag * This page is now managed by the hugetlb allocator and has * no users -- drop the buddy allocator's reference. */ - int page_count = put_page_testzero(page); - BUG_ON(page_count != 0); + put_page_testzero(page); + VM_BUG_ON(page_count(page)); nid = page_to_nid(page); set_compound_page_dtor(page, free_huge_page); /* -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center --
Hi Adam, Thanks the patch fixes the kernel bug while running the libhugetlbfs test. -- Thanks & Regards, Kamalesh Babulal, Linux Technology Center, IBM, ISTL. --
Both x86_64 and i386 builds throw these messages at me: LD arch/x86/kernel/acpi/realmode/wakeup.elf ld: warning: dot moved backwards before `.text' ld: warning: dot moved backwards before `.text' ld: warning: dot moved backwards before `.text' OBJCOPY arch/x86/kernel/acpi/realmode/wakeup.bin --- ~Randy --
I think I saw something like this on a system with an "older" toolchain. I'm not seeing it on openSUSE 10.3, though (using gcc 4.2.1). Added CCs to the experts. Thanks, Rafael --
Google turned up this post: http://sourceware.org/ml/binutils/2006-08/msg00235.html I have no time to dig more into it the next days. Sam --
"make htmldocs" gives me: HOSTCC scripts/basic/fixdep HOSTCC scripts/basic/docproc make[1]: *** No rule to make target `Documentation/DocBook/9p-overview.eps', needed by `Documentation/DocBook/9p.xml'. Stop. make: *** [htmldocs] Error 2 Are we missing the .eps and .png files? --- ~Randy --
Actually looks like we are missing a .fig (which generates the .eps or
.png as appropriate) and the template file.
Ugh, sorry, I must have messed up the patch. I'll fix it in my tree tonight.
-eric
--
This probably causes userspace damage: dbus: prctl(0x8, 0x1, 0, 0, 0) = -1 EINVAL (Invalid argument) named: named: -u with Linux threads not supported: requires kernel support for prctl(PR_SET_KEEPCAPS) prctl(0x8, 0x1, 0, 0, 0) = -1 EINVAL (Invalid argument) ntpd: prctl(0x8, 0x1, 0xffffffffffffffa8, 0x1, 0) = -1 EINVAL (Invalid argument) prctl(0x8, 0x1, 0, 0, 0) = -1 EINVAL (Invalid argument) $ grep CONFIG_SECURITY .config # CONFIG_SECURITY is not set # CONFIG_SECURITY_FILE_CAPABILITIES is not set --
Thanks, Jiri. Does the following patch work for you?
This patch address the !CONFIG_SECURITY case, but not the case of
using the dummy LSM. The default these days is to have capabilities
compiled in no matter what, but it is still possible to have
CONFIG_SECURITY=y and CONFIG_SECURITY_CAPABILITIES=n, in which
case prctl(0x8) will return -EINVAL. Do we want dummy to call
cap_prctl() as well, or are we ok with userspace getting -EINVAL
given that there are in fact no capabilities at that point and
the userspace code is clearly expecting them?
thanks,
-serge
From 4a66f19580489a3ac84f0a145e4585c09e65c88e Mon Sep 17 00:00:00 2001
From: Serge E. Hallyn <serue@us.ibm.com>
Date: Wed, 5 Mar 2008 06:02:32 -0800
Subject: [PATCH 1/1] capabilities: use cap_task_prctl when !CONFIG_SECURITY
capabilities-implement-per-process-securebits.patch introduced
cap_task_prctl() and moved the handling of capability-related
prctl into it. So when !CONFIG_SECURITY, the default
security_task_prctl() needs to call cap_task_prctl() the way
other default hooks call capability helpers when they exist.
This fixes a slew of userspace breakages when
CONFIG_SECURITY=n.
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
---
include/linux/security.h | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/include/linux/security.h b/include/linux/security.h
index 83763b0..861d6da 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -2228,7 +2228,7 @@ static inline int security_task_prctl (int option, unsigned long arg2,
unsigned long arg4,
unsigned long arg5, long *rc_p)
{
- return 0;
+ return cap_task_prctl(option, arg2, arg3, arg3, arg5, rc_p);
}
static inline void security_task_reparent_to_init (struct task_struct *p)
--
1.5.1
--
Tested-by: Jiri Slaby <jirislaby@gmail.com> --
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Acked-by: Andrew G. Morgan <morgan@kernel.org>
Cheers
Andrew
Serge E. Hallyn wrote:
|
| This patch address the !CONFIG_SECURITY case, but not the case of
| using the dummy LSM. The default these days is to have capabilities
| compiled in no matter what, but it is still possible to have
| CONFIG_SECURITY=y and CONFIG_SECURITY_CAPABILITIES=n, in which
| case prctl(0x8) will return -EINVAL. Do we want dummy to call
| cap_prctl() as well, or are we ok with userspace getting -EINVAL
| given that there are in fact no capabilities at that point and
| the userspace code is clearly expecting them?
|
| thanks,
| -serge
|
|>From 4a66f19580489a3ac84f0a145e4585c09e65c88e Mon Sep 17 00:00:00 2001
| From: Serge E. Hallyn <serue@us.ibm.com>
| Date: Wed, 5 Mar 2008 06:02:32 -0800
| Subject: [PATCH 1/1] capabilities: use cap_task_prctl when
!CONFIG_SECURITY
|
| capabilities-implement-per-process-securebits.patch introduced
| cap_task_prctl() and moved the handling of capability-related
| prctl into it. So when !CONFIG_SECURITY, the default
| security_task_prctl() needs to call cap_task_prctl() the way
| other default hooks call capability helpers when they exist.
|
| This fixes a slew of userspace breakages when
| CONFIG_SECURITY=n.
|
| Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
| ---
| include/linux/security.h | 2 +-
| 1 files changed, 1 insertions(+), 1 deletions(-)
|
| diff --git a/include/linux/security.h b/include/linux/security.h
| index 83763b0..861d6da 100644
| --- a/include/linux/security.h
| +++ b/include/linux/security.h
| @@ -2228,7 +2228,7 @@ static inline int security_task_prctl (int
option, unsigned long arg2,
| unsigned long arg4,
| unsigned long arg5, long *rc_p)
| {
| - return 0;
| + return cap_task_prctl(option, arg2, arg3, arg3, arg5, rc_p);
| }
|
| static inline void security_task_reparent_to_init (struct task_struct *p)
-----BEGIN PGP SIGNATURE-----
Version: ...With CONFIG_SYSFS not set got this on boot: kobject: '<NULL>' (f88774c8): is not initialized, yet kobject_put() is ------------[ cut here ]------------ WARNING: at lib/kobject.c:652 kobject_put+0x29/0x3c() Modules linked in: sky2 e1000 Pid: 1303, comm: modprobe Not tainted 2.6.25-rc3-mm1 #79 [<c041855b>] warn_on_slowpath+0x40/0x66 [<c041c687>] irq_exit+0x50/0x67 [<c040cc70>] smp_apic_timer_interrupt+0x6e/0x7a [<c0403380>] apic_timer_interrupt+0x28/0x30 [<c0418e36>] vprintk+0x2b0/0x2df [<c04118e8>] __update_rq_clock+0x1d/0x110 [<c0565e43>] schedule_timeout+0x13/0x86 [<c05656c2>] wait_for_common+0xd1/0x123 [<c0418e79>] printk+0x14/0x18 [<c04b34bf>] kobject_put+0x29/0x3c [<c0431e39>] free_module+0x2f/0x72 [<c04328dd>] sys_init_module+0xa61/0x15d2 [<c04ba863>] pci_bus_read_config_byte+0x0/0x58 [<c0454f87>] vfs_read+0x6c/0x8b [<c0455323>] sys_read+0x3c/0x63 [<c04028b2>] sysenter_past_esp+0x5f/0x85 ======================= ---[ end trace d50646e8e8e48682 ]--- BUG: atomic counter underflow at: Pid: 1303, comm: modprobe Tainted: G W 2.6.25-rc3-mm1 #79 [<c04b4042>] kref_put+0x3a/0x55 [<c0431e39>] free_module+0x2f/0x72 [<c04328dd>] sys_init_module+0xa61/0x15d2 [<c04ba863>] pci_bus_read_config_byte+0x0/0x58 [<c0454f87>] vfs_read+0x6c/0x8b [<c0455323>] sys_read+0x3c/0x63 [<c04028b2>] sysenter_past_esp+0x5f/0x85 ======================= And same on any (int this case sky2) module unload (load is OK) sky2 eth1: disabling interface kobject: '<NULL>' (f886cb48): is not initialized, yet kobject_put() is being called. ------------[ cut here ]------------ WARNING: at lib/kobject.c:652 kobject_put+0x29/0x3c() Modules linked in: e1000 [last unloaded: sky2] Pid: 3216, comm: rmmod Tainted: G W 2.6.25-rc3-mm1 #80 [<c041855b>] warn_on_slowpath+0x40/0x66 [<c041c687>] irq_exit+0x50/0x67 [<c040cc70>] smp_apic_timer_interrupt+0x6e/0x7a [<c0403380>] apic_timer_interrupt+0x28/0x30 [<c0418e36>] vprintk+0x2b0/0x2df [<c04118e8>] ...
Sorry, I forgot to change the subject in the previous letter. Better late than never. --
Does this fix it?: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=120fc3... Thanks, Kay --
Ok. Care to enable CONFIG_DEBUG_KOBJECT, and post the part of the log that happens right before the WARN()? We might get a hint where to look for the stuff that goes wrong. Thanks, Kay --
Hm... Not sure how may lines are required, but here'are the ones that are related to sky2 module, which is loaded and then removed: kobject: 'sky2' (f74de280): kobject_add_internal: parent: 'drivers', set: 'drivers' PCI: Setting latency timer of device 0000:02:00.0 to 64 sky2 0000:02:00.0: v1.21 addr 0xdeefc000 irq 16 Yukon-EC (0xb6) rev 2 kobject: 'net' (f7512200): kobject_add_internal: parent: '0000:02:00.0', set: '<NULL>' kobject: 'eth1' (f74ccb64): kobject_add_internal: parent: 'net', set: 'devices' kobject: 'eth1' (f74ccb64): kobject_uevent_env kobject: 'eth1' (f74ccb64): fill_kobj_path: path = '/devices/pci0000:00/0000:00:03.0/0000:02:00.0/net/eth1' sky2 eth1: addr 00:0e:0c:3b:d8:8a kobject: 'sky2' (f74de280): kobject_uevent_env kobject: 'sky2' (f74de280): fill_kobj_path: path = '/bus/pci/drivers/sky2' sky2 eth1: enabling interface sky2 eth1: disabling interface kobject: 'eth1' (f74ccb64): kobject_uevent_env kobject: 'eth1' (f74ccb64): fill_kobj_path: path = '/devices/pci0000:00/0000:00:03.0/0000:02:00.0/net/eth1' kobject: 'net' (f7512200): kobject_cleanup kobject: 'net' (f7512200): auto cleanup kobject_del kobject: 'net' (f7512200): calling ktype release kobject: (f7512200): dynamic_kobj_release kobject: 'net': free name kobject: 'eth1' (f74ccb64): kobject_cleanup kobject: 'eth1' (f74ccb64): calling ktype release kobject: 'eth1': free name kobject: 'sky2' (f74de280): kobject_cleanup kobject: 'sky2' (f74de280): auto cleanup 'remove' event kobject: 'sky2' (f74de280): kobject_uevent_env kobject: 'sky2' (f74de280): fill_kobj_path: path = '/bus/pci/drivers/sky2' kobject: 'sky2' (f74de280): auto cleanup kobject_del kobject: 'sky2' (f74de280): calling ktype release kobject: 'sky2': free name kobject: '<NULL>' (f886cb48): is not initialized, yet kobject_put() is being called. ------------[ cut here ]------------ WARNING: at lib/kobject.c:652 kobject_put+0x29/0x3c() Modules linked in: e1000 [last unloaded: sky2] Pid: 3188, comm: rmmod Tainted: G W ...
Thanks. Odds are we have some sysfs issue in the module core, that code really needs to be refactored, I'll go work on it to see if we can try to isolate all of that code into one file, which should help find these kinds of things easier. thanks, greg k-h --
x86_64, mostly 64-bit userspace, Dell Latitude D820, T7200 Core2 Duo... So I gave CONFIG_PROFILE_LIKELY another try, and this time the thing actually booted and got into userspace, but stuff started dying in rc.sysinit. According to dmesg, they all died at the same place: [ 4.841459] rename_device[686]: segfault at ffffffffff7009be ip ffffffffff7009be sp 7fff7ccfb958 error 14 [ 4.842384] rename_device[984]: segfault at ffffffffff7009be ip ffffffffff7009be sp 7fffb6fe9c68 error 14 [ 4.843298] rename_device[981]: segfault at ffffffffff7009be ip ffffffffff7009be sp 7fffc18504c8 error 14 [ 4.844184] rename_device[983]: segfault at ffffffffff7009be ip ffffffffff7009be sp 7fff512c8f48 error 14 [ 6.099486] rename_device[1513]: segfault at ffffffffff7009be ip ffffffffff7009be sp 7fff47e88ad8 error 14 [ 5.769289] rename_device[1516]: segfault at ffffffffff7009be ip ffffffffff7009be sp 7fffa317edd8 error 14 [ 7.457229] fsck.ext3[1576]: segfault at ffffffffff7009be ip ffffffffff7009be sp 7fff3be947f8 error 14 (Note that not everything died - some renames, an fsck, and maybe I missed something - but a lot of other stuff worked (dmesg, grep, cat, uname that I ran, and a lot of things that rc.sysinit invoked - so that may tell us something...) /proc/self/maps says that's near: ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] And my System.map says: ffffffff80855a0c A __bss_stop ffffffff80855a0c A _end ffffffffff600000 T vgettimeofday ffffffffff600100 t vread_tsc ffffffffff600122 t vread_hpet ffffffffff600140 D __vsyscall_gtod_data ffffffffff600400 T vtime ffffffffff600800 T vgetcpu ffffffffff600870 D __vgetcpu_mode ffffffffff600880 D __jiffies ffffffffff600c00 T venosys_1 ffffffffff700000 A VDSO64_PRELINK ffffffffff7005b0 A VDSO64_jiffies ffffffffff7005b8 A VDSO64_vgetcpu_mode ffffffffff7005c0 A VDSO64_vsyscall_gtod_data <file ends there> So we're in the same 4K as the VDSO64_* values, but some 0x4fe past ...
Try this patch:
Remove unlikelies in vsyscall path
Remove unlikely in vsyscall path that conflict with unlikely profiling.
The unlikelies shouldn't be needed anyways because gcc predicts
condition leading to early return as unlikely by default and
for the loops it shouldn't make much difference
Signed-off-by: Andi Kleen <ak@suse.de>
Index: linux/arch/x86/kernel/vsyscall_64.c
===================================================================
--- linux.orig/arch/x86/kernel/vsyscall_64.c
+++ linux/arch/x86/kernel/vsyscall_64.c
@@ -128,7 +128,7 @@ static __always_inline void do_vgettimeo
seq = read_seqbegin(&__vsyscall_gtod_data.lock);
vread = __vsyscall_gtod_data.clock.vread;
- if (unlikely(!__vsyscall_gtod_data.sysctl_enabled || !vread)) {
+ if (!__vsyscall_gtod_data.sysctl_enabled || !vread) {
gettimeofday(tv,NULL);
return;
}
@@ -169,7 +169,7 @@ time_t __vsyscall(1) vtime(time_t *t)
{
struct timeval tv;
time_t result;
- if (unlikely(!__vsyscall_gtod_data.sysctl_enabled))
+ if (!__vsyscall_gtod_data.sysctl_enabled)
return time_syscall(t);
vgettimeofday(&tv, NULL);
Index: linux/arch/x86/vdso/vclock_gettime.c
===================================================================
--- linux.orig/arch/x86/vdso/vclock_gettime.c
+++ linux/arch/x86/vdso/vclock_gettime.c
@@ -48,7 +48,7 @@ static noinline int do_realtime(struct t
ts->tv_sec = gtod->wall_time_sec;
ts->tv_nsec = gtod->wall_time_nsec;
ns = vgetns();
- } while (unlikely(read_seqretry(&gtod->lock, seq)));
+ } while (read_seqretry(&gtod->lock, seq));
timespec_add_ns(ts, ns);
return 0;
}
@@ -77,7 +77,7 @@ static noinline int do_monotonic(struct
ns = gtod->wall_time_nsec + vgetns();
secs += gtod->wall_to_monotonic.tv_sec;
ns += gtod->wall_to_monotonic.tv_nsec;
- } while (unlikely(read_seqretry(&gtod->lock, seq)));
+ } while (read_seqretry(&gtod->lock, seq));
vset_normalized_timespec(ts, secs, ns);
return 0;
}
@@ -105,7 +105,7 @@ int ...Yes, but both those files now have: /* * likely and unlikely explode when used in vdso in combination with * profile-likely-unlikely-macros.patch */ #undef likely #define likely(x) (x) #undef unlikely #define unlikely(x) (x) at the top, so it'll be something else. Perhaps a `likely' snuck in via an inline in a header file. It would be better to add a #define DONT_DO_THAT at the top of arch/x86/kernel/vsyscall_64.c and arch/x86/vdso/vclock_gettime.c, then use that to defeat likely-profiling. arch/x86/kernel/vsyscall_64.c | 11 ++--------- arch/x86/vdso/vclock_gettime.c | 11 ++--------- include/linux/compiler.h | 3 ++- 3 files changed, 6 insertions(+), 19 deletions(-) diff -puN arch/x86/kernel/vsyscall_64.c~profile-likely-unlikely-macros-fix arch/x86/kernel/vsyscall_64.c --- a/arch/x86/kernel/vsyscall_64.c~profile-likely-unlikely-macros-fix +++ a/arch/x86/kernel/vsyscall_64.c @@ -17,6 +17,8 @@ * want per guest time just set the kernel.vsyscall64 sysctl to 0. */ +#define SUPPRESS_LIKELY_PROFILING + #include <linux/time.h> #include <linux/init.h> #include <linux/kernel.h> @@ -46,15 +48,6 @@ #define __syscall_clobber "r11","cx","memory" /* - * likely and unlikely explode when used in vdso in combination with - * profile-likely-unlikely-macros.patch - */ -#undef likely -#define likely(x) (x) -#undef unlikely -#define unlikely(x) (x) - -/* * vsyscall_gtod_data contains data that is : * - readonly from vsyscalls * - written by timer interrupt or systcl (/proc/sys/kernel/vsyscall64) diff -puN arch/x86/vdso/vclock_gettime.c~profile-likely-unlikely-macros-fix arch/x86/vdso/vclock_gettime.c --- a/arch/x86/vdso/vclock_gettime.c~profile-likely-unlikely-macros-fix +++ a/arch/x86/vdso/vclock_gettime.c @@ -9,6 +9,8 @@ * Also alternative() doesn't work. */ +#define SUPPRESS_LIKELY_PROFILING + #include <linux/kernel.h> #include <linux/posix-timers.h> #include <linux/time.h> @@ -23,15 +25,6 @@ #define gtod ...
I think you need to do it differently. Not undef/define, but set some symbol that is checked by the unlikely profiler and it won't Possible. The problem is that there are now vsyscall functions in other files too, especially hpet_64.c and tsc_64.c Perhaps this is something that should be just checked in modpost instead. Any external references from the vsyscall section to another section should be flag'ed as error (cc'ed Sam in case he wants to look at that) -Andi --
Confirming that this patch works and my system goes multi-user cleanly. Actual numbers after about 10 minutes of uptime: % wc -l /proc/likely_prof 2635 /proc/likely_prof % grep '^[^ ]' /proc/likely_prof Likely Profiling Results [+- ] Type | # True | # False | Function:Filename@Line +unlikely | 1| 0 in_dev_get()@:include/linux/inetdevice.h@185 +unlikely | 513| 0 dst_input()@:include/net/dst.h@254 -likely | 0| 148 ip6_mc_input()@:net/ipv6/ip6_input.c@271 -likely | 0| 1 sock_error()@:include/net/sock.h@1211 -likely | 851| 1219 tcp_transmit_skb()@:net/ipv4/tcp_output.c@493 +unlikely | 1| 0 signal_pending()@:include/linux/sched.h@1927 -likely | 0| 1172946 audit_syscall_entry()@:kernel/auditsc.c@1522 +unlikely | 1172716| 0 syscall_trace_enter()@:arch/x86/kernel/ptrace.c@1556 -likely | 0| 1173020 audit_syscall_exit()@:kernel/auditsc.c@1551 +unlikely | 1172831| 0 syscall_trace_leave()@:arch/x86/kernel/ptrace.c@1573 -likely | 0| 1272 audit_alloc()@:kernel/auditsc.c@841 +unlikely | 3| 0 icmp_unreach()@:net/ipv4/icmp.c@773 +unlikely | 2| 1 nf_ct_attach()@:net/netfilter/core.c@230 -likely | 0| 2 dst_gc_task()@:net/core/dst.c@82 +unlikely | 143| 61 fput_light()@:include/linux/file.h@77 +unlikely | 892| 424 _read_unlock_irqrestore()@:kernel/spinlock.c@375 +unlikely | 28| 0 sched_move_task()@:kernel/sched.c@7835 +unlikely | 28| 0 sched_move_task()@:kernel/sched.c@7828 +unlikely | 108| 0 verify_export_symbols()@:kernel/module.c@1401 +unlikely | 313| 0 verify_export_symbols()@:kernel/module.c@1393 +unlikely | 14| 0 ll_front_merge_fn()@:block/blk-merge.c@347 -likely | 17| 1150 audit_free()@:kernel/auditsc.c@1428 -likely | 17| 1174290 audit_get_context()@:kernel/auditsc.c@711 +unlikely | ...
On Wed, 05 Mar 2008 17:26:25 -0500 These are all the ones which we got wrong on your setup, yes? I wonder if assuming that current->audit_context is NULL is realistic nowadays. --
Nope, sorry... same behavior. Apparently it's a (un)likely someplace else... I'm trying to figure out what's at 0x9be into the vdso, but not having a lot of luck.
You can do objdump -Sr on the vdso/vsyscall object files and see if there are any external references to unlikely related functions. If yes the problem is in that function -Andi --
Hi Andrew, Not able to boot 2.6.25-rc3-mm1 my ppc64 box. 2.6.25-rc2-mm1 and 2.6.25-rc3 boots fine. I applied slab.c fix also. Any other known issues ? My config file attached. Here are the messages on the console. Thanks, Badari Linux/PowerPC load: root=/dev/sda3 selinux=0 elevator=cfq numa=debug kernelcore=1024M Finalizing device tree... using OF tree (promptr=00c39a50) OF stdout device is: /vdevice/vty@30000000 Hypertas detected, assuming LPAR ! command line: root=/dev/sda3 selinux=0 elevator=cfq numa=debug kernelcore=1024M memory layout at init: alloc_bottom : 00000000023d0000 alloc_top : 0000000008000000 alloc_top_hi : 0000000072000000 rmo_top : 0000000008000000 ram_top : 0000000072000000 Looking for displays instantiating rtas at 0x00000000077ca000 ... done 0000000000000000 : boot cpu 0000000000000000 0000000000000002 : starting cpu hw idx 0000000000000002... done copying OF device tree ... Building dt strings... Building dt structure... Device tree strings 0x00000000023d1000 -> 0x00000000023d21cf Device tree struct 0x00000000023d3000 -> 0x00000000023e0000 Calling quiesce ... returning from prom_init # # Automatically generated make config: don't edit # Linux kernel version: 2.6.25-rc3-mm1 # Wed Mar 5 10:34:39 2008 # CONFIG_PPC64=y # # Processor support # # CONFIG_POWER4_ONLY is not set CONFIG_POWER3=y CONFIG_POWER4=y # CONFIG_TUNE_CELL is not set CONFIG_PPC_FPU=y # CONFIG_ALTIVEC is not ...
On Wed, 05 Mar 2008 13:34:14 -0800 The semaphore consolidation code enables interrupts early in boot, when it shouldn't. This tends to make powerpc blow up. Could be that this is what you're hitting. Matthew, is this ging to be fixed soon? Thanks. --
Yes. I just backed out git-semaphore.patch and machine booted fine. Thanks, Badari --
Hi Andrew, On Wed, 5 Mar 2008 13:54:25 -0800 Andrew Morton <akpm@linux-foundation.org>= There is a new version of these patches in the current linux-next tree ... --=20 Cheers, Stephen Rothwell sfr@canb.auug.org.au
Dell Latitude D820, x86_64, Core2 Duo T7200 'shutdown -h' blows up at the very end. shutdown -r works OK. I caught this one with netconsole. There's another, different, crash I've been seeing a bit earlier in the shutdown -h as well, but I haven't been able to catch that one yet... [ 74.254402] CPU 1 is now offline [ 74.255395] SMP alternatives: switching to UP code [ 74.256373] BUG: unable to handle kernel paging request at ffffffff8020a023 [ 74.256373] IP: [<ffffffff80211872>] alternatives_smp_unlock+0x66/0x7b [ 74.256373] PGD 203067 PUD 207063 PMD 7e4cc163 PTE 20a161 [ 74.256373] Oops: 0003 [1] PREEMPT SMP [ 74.256373] last sysfs file: /sys/devices/virtual/block/dm-14/dev [ 74.256373] CPU 0 [ 74.256373] Modules linked in: rtc sha256_generic aes_generic acpi_cpufreq tpm_tis arc4 ecb pcmcia iwl3945 iTCO_wdt ohci1394 firmware_class iTCO_vendor_support yenta_socket watchdog_core thermal rsrc_nonstatic mac80211 snd_hda_intel intel_agp watchdog_dev ieee1394 pcmcia_core processor button ac battery cfg80211 [ 74.256373] Pid: 1767, comm: halt Not tainted 2.6.25-rc3-mm1 #8 [ 74.256373] RIP: 0010:[<ffffffff80211872>] [<ffffffff80211872>] alternatives_smp_unlock+0x66/0x7b [ 74.256373] RSP: 0018:ffff81007ac63d10 EFLAGS: 00010093 [ 74.256373] RAX: ffffffff80573190 RBX: ffff81007f83a8c0 RCX: ffffffff80563cec [ 74.256373] RDX: ffffffff8020a023 RSI: ffffffff8078a0b8 RDI: ffffffff80783018 [ 74.256373] RBP: ffff81007ac63d28 R08: 0000000000000001 R09: ffffffff80563cec [ 74.256373] R10: ffffffff80200000 R11: ffff81007ac63d1f R12: 0000000000000000 [ 74.256373] R13: 0000000000000001 R14: 0000000000000246 R15: ffff81007d156340 [ 74.256373] FS: 00007f2d0ab206f0(0000) GS:ffffffff8076e000(0000) knlGS:0000000000000000 [ 74.256373] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 74.256373] CR2: ffffffff8020a023 CR3: 000000007edf3000 CR4: 00000000000006e0 [ 74.256373] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 74.256373] DR3: ...
Yes, I hit a similar one during halt on the t61p. But because of the netconsole bustage I was only able to see (on the screen) oops #2 - oops #1 had scrolled off. oops #2 had a simlar trace and the EIP was in text_poke(). I suppose one of us should bisect it. --
OK, I finally managed to catch the *other* failure I was seeing at shutdown, and it appears to be a variant on the same theme, so readers may feel free to ignore the rest of this note unless they care about the gory details... Apparently, if I booted with 'ignore_loglevel' (which is my default when using netconsole), I hit the above traceback and I'm dead in the water, no alt-sysrq, need to hold down the power button for 5 seconds. If I boot with 'quiet' instead, I get the below set of tracebacks, which caused the original BUG to go scrolling off-screen and obfuscating that it's the same failure. Adding to the confusion, if it failed in this mode, alt-sysrq still worked just fine, so alt-sysrq-S-S-U-B got me a reboot. Now that I know that at least *part* of the issue is the same, I can go bisecting. Somebody *else* can ponder why ignore_loglevel/quiet causes the big difference in behavior after the BUG, that part is beyond my ken... [ 168.036824] BUG: unable to handle kernel paging request at ffffffff8020a023 [ 168.037300] IP: [<ffffffff80211872>] alternatives_smp_unlock+0x66/0x7b [ 168.037745] PGD 203067 PUD 207063 PMD 7f989163 PTE 20a161 [ 168.037781] Oops: 0003 [1] PREEMPT SMP [ 168.037781] last sysfs file: /sys/devices/platform/coretemp.1/temp1_input [ 168.037781] CPU 0 [ 168.037781] Modules linked in: rtc irnet ppp_generic slhc irtty_sir sir_dev ircomm_tty ircomm irda crc_ccitt sha256_generic aes_generic acpi_cpufreq tpm_tis arc4 ecb iwl3945 pcmcia nvidia(P)(U) firmware_class mac80211 ohci1394 snd_hda_intel cfg80211 yenta_socket ieee1394 iTCO_wdt iTCO_vendor_support thermal rsrc_nonstatic ac processor watchdog_core battery watchdog_dev button pcmcia_core intel_agp [last unloaded: x_tables] [ 168.037781] Pid: 3115, comm: halt Tainted: P 2.6.25-rc3-mm1 #8 [ 168.037781] RIP: 0010:[<ffffffff80211872>] [<ffffffff80211872>] alternatives_smp_unlock+0x66/0x7b [ 168.037781] RSP: 0000:ffff81007dbebd10 EFLAGS: 00010093 [ 168.037781] RAX: ffffffff80573190 ...
Can you decode ffffffff8020a023 via addr2line please ? Thanks, tglx --
It's been a long day, and I couldn't get addr2line to work, it kept saying '??:0'. However, this is in my System.map: ffffffff8020a000 t poll_idle ffffffff8020a009 t do_nothing ffffffff8020a00f T set_personality_64bit ffffffff8020a041 T release_thread ffffffff8020a07d T arch_randomize_brk so set_personality_64bit+0x14 or so?
----------------------------------------------------------^^^^^^ The PTE has the RW bit cleared, so the fault is not a big surprise. Thanks, tglx --
Probably not surprisingly, the quilt bisect says the problem is git-x86,patch.
Nope, still blows up with exactly the same traceback. I may have to try again to figure out how to bisect the git-x86 tree - Ingo send me a pointer to his git-x86 cheat sheet, I looked at it but I couldn't figure out how to tell 'git bisect' that the starting good spot was "whatever corresponded to the git-x86 patch in 24-rc8-mm1" and bad was "25-rc3-mm1". I tried using the first commit ID listed in the patch, but that gave me this: (looking at first few lines of the git-x86.patch in the 25-rc3-mm1 broken-out): commit fa70e201463a7f3d86b995249e57a8e27b31b5f8 Author: Paolo Ciarrocchi <paolo.ciarrocchi@gmail.com> Date: Sun Feb 24 11:57:22 2008 +0100 but then: % git bisect bad fa70e201463a7f3d86b995249e57a8e27b31b5f8 fatal: Needed a single revision Bad rev input: fa70e201463a7f3d86b995249e57a8e27b31b5f8 And I didn't see any release tags in the x86 git tree that I could specify either. (Once I get the good and bad markers set, it "should be easy" - I've managed to git-bisect through Linus's git tree before, but that was always easy because "bad" was HEAD and "good" had a nice v2.6.2mumble-rcN tag to specify...
Yes, it's all a bit mysterious. I just look in the changelog, which was pull edout of the git diff via various means. See how ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.25-rc5/2.6.25-rc5-mm... starts with 5813a19cba5735b629cdeb156863dab814759128 and ends with 816543f9bf2fb77ff52083216a4537eb4e3058ec. Use 5813a19cba5735b629cdeb156863dab814759128 as good and 816543f9bf2fb77ff52083216a4537eb4e3058ec as bad. --
I *hope* I'm mis-reading Ingo's directions when I cut-n-pasted them -
first I pulled down the two trees, tried to bisect, had it give me the
"need a single revision" error, then I checked out a tree - and got a
*different* funky opaque error message when I tried to bisect:
[/usr/src/valdis/x86.git] git-init-db
Initialized empty Git repository in .git/
[/usr/src/valdis/x86.git] git-remote add linus git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
[/usr/src/valdis/x86.git] git-remote add x86 git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86.git
[/usr/src/valdis/x86.git] git-remote update
Updating linus
warning: no common commits
(...)
Resolving deltas: 100% (598008/598008), done.
From git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6
* [new branch] master -> linus/master
remote: Counting objects: 105, done.
(...)
From git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6
* [new tag] v2.6.12 -> v2.6.12
* [new tag] v2.6.12-rc2 -> v2.6.12-rc2
(...)
* [new tag] v2.6.25-rc4 -> v2.6.25-rc4
* [new tag] v2.6.25-rc5 -> v2.6.25-rc5
Updating x86
remote: Counting objects: 2651, done.
(...)
Resolving deltas: 100% (1979/1979)
s: 100% (1979/1979), completed with 310 local objects.
From git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86
* [new branch] base -> x86/base
* [new branch] for-akpm -> x86/for-akpm
* [new branch] for-linus -> x86/for-linus
* [new branch] latest -> x86/latest
* [new branch] master -> x86/master
* [new branch] origin -> x86/origin
* [new branch] testing -> x86/testing
[/usr/src/valdis/x86.git] git bisect start
[/usr/src/valdis/x86.git] git bisect good 5813a19cba5735b629cdeb156863dab814759128
fatal: Needed a single revision
Bad rev commit: ^{commit}
[/usr/src/valdis/x86.git] git branch list
fatal: Not a valid object name: 'master'.
[/usr/src/valdis/x86.git] git checkout -b ...Try this: echo "git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86.git#for-akpm" > .git/branches/git-foo git-fetch git-foo git-checkout git-foo git-bisect start git-bisect good 968f7910e8d10e5273977248f3d89193b32e8c20 git-bisect bad c28550f4f68a894a3c05141762f388b5a14f33e3 --
Trying it against what I already pulled down: [/usr/src/valdis/x86.git] echo "git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86.git#for-akpm" > .git/branches/git-foo [/usr/src/valdis/x86.git] git-fetch git-foo remote: Counting objects: 1642, done. remote: Compressing objects: 100% (261/261), done. remote: Total 1296 (delta 1090), reused 1238 (delta 1034) Receiving objects: 100% (1296/1296), 197.24 KiB | 215 KiB/s, done. Resolving deltas: 100% (1090/1090), completed with 218 local objects. [/usr/src/valdis/x86.git] git-checkout git-foo error: pathspec 'git-foo' did not match any file(s) known to git. Did you forget to 'git add'? [/usr/src/valdis/x86.git] git-bisect start won't bisect on seeked tree [/usr/src/valdis/x86.git] git-checkout -b git-foo git-foo git checkout: updating paths is incompatible with switching branches/forcing Did you intend to checkout 'git-foo' which can not be resolved as commit? Trying again against a totally clean new directory: [/usr/src/valdis] git --version git version 1.5.4.3 [/usr/src/valdis] rm -rf x86.git [/usr/src/valdis] mkdir x86.git [/usr/src/valdis] cd x86.git [/usr/src/valdis/x86.git] git-init-db Initialized empty Git repository in .git/ [/usr/src/valdis/x86.git] git-remote add linus git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git [/usr/src/valdis/x86.git] git-remote update Updating linus warning: no common commits remote: Counting objects: 721254, done. remote: Compressing objects: 100% (130309/130309), done. remote: Total 721254 (delta 598318), reused 711930 (delta 589976) Receiving objects: 100% (721254/721254), 175.04 MiB | 3535 KiB/s, done. Resolving deltas: 100% (598318/598318), done. From git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6 * [new branch] master -> linus/master remote: Counting objects: 105, done. remote: Compressing objects: 100% (105/105), done. remote: Total 105 (delta 0), reused 102 (delta 0) Receiving objects: 100% (105/105), 30.40 KiB, ...
the best way to bisect the x86.git-only commits is to do: git-bisect bad x86/latest git-bisect good x86/base the 'base' branch is the upstream tree that x86.git is based against. This will minimize the number of bisection points as well, because you'll only bisect x86.git patches. [ and make sure you test x86/base first to establish that it's truly 'good' :-) ] Ingo --
OK, *that* got the bisect running. However, after a few bisections, things are getting weird... (Note - I haven't done a git pull or update for a week and a bit, so the tree is as of 03/14 or so...) 'git bisect log' reports: git-bisect start # bad: [21a418440c44b6a2cdf38fea2533a5398d6fd939] Move mp_bus_id_to_node to numa.c git-bisect bad 21a418440c44b6a2cdf38fea2533a5398d6fd939 # good: [dba92d3bc49c036056a48661d2d8fefe4c78375a] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband git-bisect good dba92d3bc49c036056a48661d2d8fefe4c78375a # good: [53f0f2bc547fd13a70a6adb86592301ec83b9fc7] x86 mmiotrace: comment about user space ABI git-bisect good 53f0f2bc547fd13a70a6adb86592301ec83b9fc7 # good: [53f0f2bc547fd13a70a6adb86592301ec83b9fc7] x86 mmiotrace: comment about user space ABI git-bisect good 53f0f2bc547fd13a70a6adb86592301ec83b9fc7 # good: [53f0f2bc547fd13a70a6adb86592301ec83b9fc7] x86 mmiotrace: comment about user space ABI git-bisect good 53f0f2bc547fd13a70a6adb86592301ec83b9fc7 # good: [2702dd1be087ac7307b731d884ee48db6e1cdff6] x86: create smpcommon.c git-bisect good 2702dd1be087ac7307b731d884ee48db6e1cdff6 # good: [ad42b55d36238ebb9fa4d7a538ef691a76397c46] x86: add KERN_INFO to show_unhandled_signals printout git-bisect good ad42b55d36238ebb9fa4d7a538ef691a76397c46 # good: [56b412e63863ea82a5720315076c7dbd1d9888cd] x86: change x86 to use generic find_next_bit git-bisect good 56b412e63863ea82a5720315076c7dbd1d9888cd # good: [42de918f25dc9a49fb9688e22c2a3f2b156cc1bf] x86: prevent unconditional writes to DebugCtl MSR git-bisect good 42de918f25dc9a49fb9688e22c2a3f2b156cc1bf At this point, 'git bisect visualize' shows 9 commits left to bisect through, and all are dated 03/10 or later. However, since 25-rc3-mm1 had the problem, it had to be something in-tree as of 03/05. Is it possible that the problem code was in the git-x86 tree when Andrew pulled for -rc3-mm1 and -rc5-mm1, but had been reverted by the time I grabbed the tree, ...
no, we frequently regenerate the x86.git tree so the dates have little relevance. If for any particular pull, x86/base is good and x86/latest is bad, then the bug is somewhere in those 200-300 patches inbetween. They are lined up linearly so should be perfectly bisectable. Ingo --
OK, off to go try the last few bisects then...
well ... your git bisection log does look suspiciously 'good', so something is wrong thee i think :-( the chance to get 8 'good' bisection points in a row is 1:256. OTOH, the freshest x86 patches are always at the 'end' of the queue - which are also the ones most likely to break anything. Are you sure the x86/base point is indeed 'good'? You can check it via: git-checkout -b tmp x86/base and build+boot it. Ingo --
On the other hand, this was broken in 25-rc3-mm1, so it's not a "fresh" Did that, and it's good (as in 'shutdown -h now' powers off rather than BUG and hanging). "You're at Witt's End" -- Adventure, c. 1978 OK.. so far I've got: 25-rc3-mm1 is bad 25-rc5-mm1 is bad, and bisected down to git-x86.patch x86/base as pulled last week is good bisected to within the last 9 entries of x86/latest is good. So I can't seem to replicate it using the git-x86 tree, but bisecting -mm implicates it. How very strange. I even went and pulled Andrew's mmotm pile as of this afternoon, and got that to built after having to heave only a dozen patches over the side and one or two hand-fixes of patches - and *that* one is good too. So I'm thinking that it was some "bump in the night" that was broken in the x86 tree when Andrew pulled it for 25-rc5-mm1, but was fixed by the time I pulled it a few days later to start git-bisecting it. Given that -mmotm isn't showing the problem, I'm having a hard time coming up with enthusiasm to keep chasing it. If I see it happen again in a -mm or Linus kernel, I'll restart the chase then....
if it ever reappears then please check x86/latest first (without any other -mm bits) and notify us. Ingo --
