Re: [PATCH -mm] x86 allnoconfig memory model

Previous thread: Re: __rcu_process_callbacks() in Linux 2.6 by James Huang on Tuesday, November 20, 2007 - 8:43 pm. (2 messages)

Next thread: Modules: Handle symbols that have a zero value by Christoph Lameter on Tuesday, November 20, 2007 - 10:56 pm. (3 messages)
From: Dave Young
Date: Tuesday, November 20, 2007 - 10:51 pm

Hi, andrew

modpost failed for me:
  MODPOST 360 modules
ERROR: "empty_zero_page" [drivers/kvm/kvm.ko] undefined!
make[1]: *** [__modpost] Error 1
make: *** [modules] Error 2

Regards
dave
-

From: Andrew Morton
Date: Tuesday, November 20, 2007 - 11:00 pm

You're a victim of the hasty unexporting fad.  Which architecture?
x86_64 I guess?
-

From: Dave Young
Date: Tuesday, November 20, 2007 - 11:03 pm

Hi,
ia32 instead.

Regards
dave
-

From: Andrew Morton
Date: Tuesday, November 20, 2007 - 11:15 pm

oic.  Like this, I guess.

--- a/arch/x86/kernel/i386_ksyms_32.c~git-x86-i386-export-empty_zero_page
+++ a/arch/x86/kernel/i386_ksyms_32.c
@@ -2,6 +2,7 @@
 #include <asm/semaphore.h>
 #include <asm/checksum.h>
 #include <asm/desc.h>
+#include <asm/pgtable.h>
 
 EXPORT_SYMBOL(__down_failed);
 EXPORT_SYMBOL(__down_failed_interruptible);
@@ -22,3 +23,4 @@ EXPORT_SYMBOL(__put_user_8);
 EXPORT_SYMBOL(strstr);
 
 EXPORT_SYMBOL(csum_partial);
+EXPORT_SYMBOL(empty_zero_page);
_

-

From: Dave Young
Date: Tuesday, November 20, 2007 - 11:22 pm

Yes, passed :)
-

From: Kirill A. Shutemov
Date: Wednesday, November 21, 2007 - 11:35 am

Symbol init_level4_pgt is needed by nvidia module. Is it really need to=20
unexport it?

--=20
Regards,  Kirill A. Shutemov
 + Belarus, Minsk
 + Velesys LLC, http://www.velesys.com/
 + ALT Linux Team, http://www.altlinux.com/
From: Andrew Morton
Date: Wednesday, November 21, 2007 - 3:25 pm

On Wed, 21 Nov 2007 20:35:13 +0200

It's our clever way of reducing the tester base so we don't get so many
bug reports.
-

From: Rik van Riel
Date: Monday, November 26, 2007 - 11:48 am

On Wed, 21 Nov 2007 14:03:34 +0800

FYI, x86_64 has the exact same issue.

----
KVM needs the empty_zero_page export reinstated.

Signed-off-by: Rik van Riel <riel@redhat.com>

diff -up linux-2.6.24-rc3-mm1/arch/x86/kernel/x8664_ksyms_64.c.export-empty-zero-page linux-2.6.24-rc3-mm1/arch/x86/kernel/x8664_ksyms_64.c
--- linux-2.6.24-rc3-mm1/arch/x86/kernel/x8664_ksyms_64.c.export-empty-zero-page	2007-11-26 13:47:53.000000000 -0500
+++ linux-2.6.24-rc3-mm1/arch/x86/kernel/x8664_ksyms_64.c	2007-11-26 13:41:32.000000000 -0500
@@ -33,6 +33,7 @@ EXPORT_SYMBOL(__copy_from_user_inatomic)
 
 EXPORT_SYMBOL(copy_page);
 EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(empty_zero_page);
 
 /* Export string functions. We normally rely on gcc builtin for most of these,
    but gcc sometimes decides not to inline them. */    
-

From: Jiri Slaby
Date: Monday, November 26, 2007 - 12:33 pm

yes:
hot-fixes/git-x86-dont-unexport-empty_zero_page.patch

regards,
-- 
Jiri Slaby (jirislaby@gmail.com)
Faculty of Informatics, Masaryk University
-

From: KAMEZAWA Hiroyuki
Date: Tuesday, November 20, 2007 - 10:58 pm

I met.

  CHK     include/linux/version.h
  CHK     include/linux/utsrelease.h
  CALL    scripts/checksyscalls.sh
<stdin>:1389:2: warning: #warning syscall revokeat not implemented
<stdin>:1393:2: warning: #warning syscall frevoke not implemented
  CHK     include/linux/compile.h
make[1]: *** No rule to make target `arch/ia64/lib/copy_page-export.o', needed by `arch/ia64/lib/built-in.o'.  Stop.
make: *** [arch/ia64/lib] Error 2

fix (for my config ?) is attached.

=
This was necessary to build.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 arch/ia64/lib/Makefile |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.24-rc3-mm1/arch/ia64/lib/Makefile
===================================================================
--- linux-2.6.24-rc3-mm1.orig/arch/ia64/lib/Makefile
+++ linux-2.6.24-rc3-mm1/arch/ia64/lib/Makefile
@@ -2,7 +2,7 @@
 # Makefile for ia64-specific library routines..
 #
 
-obj-y := io.o copy_page-export.o
+obj-y := io.o
 
 lib-y := __divsi3.o __udivsi3.o __modsi3.o __umodsi3.o			\
 	__divdi3.o __udivdi3.o __moddi3.o __umoddi3.o			\

-

From: Andrew Morton
Date: Tuesday, November 20, 2007 - 11:08 pm

erp.  Actually, it should be this:

--- a/arch/ia64/lib/Makefile~ia64-export-copy_page-to-modules-fix-fix
+++ a/arch/ia64/lib/Makefile
@@ -2,7 +2,7 @@
 # Makefile for ia64-specific library routines..
 #
 
-obj-y := io.o copy_page-export.o
+obj-y := io.o
 
 lib-y := __divsi3.o __udivsi3.o __modsi3.o __umodsi3.o			\
 	__divdi3.o __udivdi3.o __moddi3.o __umoddi3.o			\
_

-

From: Kamalesh Babulal
Date: Tuesday, November 20, 2007 - 10:56 pm

Hi Andrew,

The kernel build fails on S390x, with

arch/s390/kernel/ipl.c: In function `ipl_register_fcp_files':
arch/s390/kernel/ipl.c:415: error: `ipl_subsys' undeclared (first use in this function)
arch/s390/kernel/ipl.c:415: error: (Each undeclared identifier is reported only once
arch/s390/kernel/ipl.c:415: error: for each function it appears in.)
arch/s390/kernel/ipl.c: In function `ipl_init':
arch/s390/kernel/ipl.c:449: error: implicit declaration of function `firmware_register'
arch/s390/kernel/ipl.c:449: error: `ipl_subsys' undeclared (first use in this function)
arch/s390/kernel/ipl.c: In function `on_panic_show':
arch/s390/kernel/ipl.c:764: error: implicit declaration of function `shutdown_action_str'
arch/s390/kernel/ipl.c:764: error: `on_panic_action' undeclared (first use in this function)
arch/s390/kernel/ipl.c:764: warning: format argument is not a pointer (arg 3)
arch/s390/kernel/ipl.c:764: warning: format argument is not a pointer (arg 3)
arch/s390/kernel/ipl.c: In function `on_panic_store':
arch/s390/kernel/ipl.c:771: error: `SHUTDOWN_REIPL_STR' undeclared (first use in this function)
arch/s390/kernel/ipl.c:772: error: `on_panic_action' undeclared (first use in this function)
arch/s390/kernel/ipl.c:772: error: `SHUTDOWN_REIPL' undeclared (first use in this function)
arch/s390/kernel/ipl.c:773: error: `SHUTDOWN_DUMP_STR' undeclared (first use in this function)
arch/s390/kernel/ipl.c:775: error: `SHUTDOWN_DUMP' undeclared (first use in this function)
arch/s390/kernel/ipl.c:776: error: `SHUTDOWN_STOP_STR' undeclared (first use in this function)
arch/s390/kernel/ipl.c:778: error: `SHUTDOWN_STOP' undeclared (first use in this function)
arch/s390/kernel/ipl.c: At top level:
arch/s390/kernel/ipl.c:879: error: redefinition of 'ipl_register_fcp_files'
arch/s390/kernel/ipl.c:412: error: previous definition of 'ipl_register_fcp_files' was here
arch/s390/kernel/ipl.c:904: error: redefinition of 'ipl_init'
arch/s390/kernel/ipl.c:446: error: previous definition of 'ipl_init' ...
From: Andrew Morton
Date: Tuesday, November 20, 2007 - 11:04 pm

Yes, sorry, I forgot to mention that.  I got a large patch reject
between Greg's driver tree and the s390 tree and I couldn't be bothered
fixing it.  s390 is busted in 2.6.24-rc3-mm1.
-

From: Kamalesh Babulal
Date: Tuesday, November 20, 2007 - 11:11 pm

Hi Andrew,

Kernel panic's across different architectures like powerpc, x86_64, 

Dentry cache hash table entries: 8388608 (order: 14, 67108864 bytes)
Inode-cache hash table entries: 4194304 (order: 13, 33554432 bytes)
Mount-cache hash table entries: 256
SMP alternatives: switching to UP code
ACPI: Core revision 20070126
..MP-BIOS bug: 8254 timer not connected to IO-APIC
Kernel panic - not syncing: IO-APIC + timer doesn't work! Try using the 'noapic' kernel parameter

-- 
Thanks & Regards,
Kamalesh Babulal,
Linux Technology Center,
IBM, ISTL.
-

From: Kamalesh Babulal
Date: Wednesday, November 21, 2007 - 2:22 am

Hi Andrew,

Passing noapic works, but the kernel oops's 

[   97.161103] Unable to handle kernel NULL pointer dereference at 0000000000000009 RIP:
[   97.193973]  [<ffffffff802341df>] cpu_to_allnodes_group+0x69/0x7c
[   97.245359] PGD 0
[   97.257611] Oops: 0000 [1] SMP
[   97.276638] last sysfs file:
[   97.294417] CPU 0
[   97.306620] Modules linked in:
[   97.325066] Pid: 1, comm: swapper Not tainted 2.6.24-rc3-mm1 #1
[   97.360514] RIP: 0010:[<ffffffff802341df>]  [<ffffffff802341df>] cpu_to_allnodes_group+0x69/0x7c
[   97.413287] RSP: 0000:ffff81012fabb650  EFLAGS: 00010286
[   97.445363] RAX: ffffffff809bb060 RBX: ffff81012fabb650 RCX: 00000000000000ff
[   97.488378] RDX: 0000000000000001 RSI: 000000000000013e RDI: 0000000000000100
[   97.531413] RBP: ffff81012fabb680 R08: ffff81012fa88180 R09: 0000000000000000
[   97.574428] R10: 0000000000000000 R11: 0000000000000000 R12: ffff810001005f50
[   97.617394] R13: 0000000000000000 R14: ffff81012fa88180 R15: ffff810001005f40
[   97.660421] FS:  0000000000000000(0000) GS:ffffffff806c3000(0000) knlGS:0000000000000000
[   97.709327] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[   97.743995] CR2: 0000000000000009 CR3: 0000000000201000 CR4: 00000000000006a0
[   97.787021] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   97.830053] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   97.873036] Process swapper (pid: 1, threadinfo FFFF81012FABA000, task FFFF81012FAB8040)
[   97.921993] Stack:  0000000000000000 0000000000000000 0000000000000000 0000000000000000
[   97.971056]  ffff810001005f40 ffff81012fabb700 ffff81012fabbdf0 ffffffff80235487
[   98.016420]  0000000000000000 0000000000000000 0000000000000000 0000000000000000
[   98.060324] Call Trace:
[   98.076657]  [<ffffffff80235487>] build_sched_domains+0x1e1/0xc19
[   98.113383]  [<ffffffff8025072a>] __kernel_text_address+0x22/0x30
[   98.150173]  [<ffffffff8025b127>] check_chain_key+0x9c/0x15f
[   98.184355]  [<ffffffff8025d544>] ...
From: Andrew Morton
Date: Wednesday, November 21, 2007 - 2:29 am

urgh, mess.  Enabling frame pointers might help here.

But we're cc'ing the right guy ;)

-

From: Kamalesh Babulal
Date: Wednesday, November 21, 2007 - 2:43 am

The kernel was compiled with frame pointers enabled.
-- 
Thanks & Regards,
Kamalesh Babulal,
Linux Technology Center,
IBM, ISTL.
-

From: Torsten Kaiser
Date: Wednesday, November 21, 2007 - 12:33 pm

CONFIG_FRAME_POINTER=y

The oops/panic that happens with noapic:
[   35.866758] Initializing CPU#3
[   35.868769] Stuck ??
[   35.874043] Inquiring remote APIC #3...
[   35.877896] ... APIC #3 ID: 03000000
[   35.881523] ... APIC #3 VERSION: 80050010
[   35.885587] ... APIC #3 SPIV: 000001ff
[   35.889390] Brought up 1 CPUs
[   35.892375] Unable to handle kernel NULL pointer dereference at
0000000000000009 RIP:
[   35.897868]  [<ffffffff8022fc5b>] cpu_to_allnodes_group+0x4b/0x60
[   35.906464] PGD 0
[   35.908523] Oops: 0000 [1] SMP
[   35.911757] last sysfs file:
[   35.914740] CPU 0
[   35.916798] Modules linked in:
[   35.919990] Pid: 1, comm: swapper Not tainted 2.6.24-rc3-mm1 #2
[   35.925914] RIP: 0010:[<ffffffff8022fc5b>]  [<ffffffff8022fc5b>]
cpu_to_allnodes_group+0x4b/0x60
[   35.934734] RSP: 0000:ffff81011ff2bdb0  EFLAGS: 00010282
[   35.940053] RAX: ffffffff8084d870 RBX: ffff810001005810 RCX: 0000000000000004
[   35.947188] RDX: 0000000000000001 RSI: ffff81011ff26f68 RDI: ffff81011ff2bdb0
[   35.954323] RBP: ffff81011ff2bdd0 R08: 2222222222222222 R09: 0000000000000000
[   35.961457] R10: ffff81007ff1c200 R11: 0000000000000200 R12: ffff810001005800
[   35.968592] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[   35.975727] FS:  0000000000000000(0000) GS:ffffffff807d4000(0000)
knlGS:0000000000000000
[   35.983951] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[   35.989701] CR2: 0000000000000009 CR3: 0000000000201000 CR4: 00000000000006a0
[   35.996836] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   36.003971] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   36.011105] Process swapper (pid: 1, threadinfo FFFF81011FF2A000,
task FFFF81007FF2A000)
[   36.019191] Stack:  0000000000000000 ffffffff807e8f98
0000000000000000 ffff810001005800
[   36.027373]  ffff81011ff2be80 ffffffff80230580 ffffffff8084d640
ffffffff8084d6e0
[   36.034922]  ffffffff8084d780 ffffffff8084d800 ...
From: Kirill A. Shutemov
Date: Thursday, November 22, 2007 - 3:04 am

This bug is also reproducible with qemu.

--=20
Regards,  Kirill A. Shutemov
 + Belarus, Minsk
 + Velesys LLC, http://www.velesys.com/
 + ALT Linux Team, http://www.altlinux.com/
From: Len Brown
Date: Wednesday, November 21, 2007 - 12:22 pm

If you suspect ACPI breakage, then try "acpi=off" or "acpi=noirq".

thanks,
-Len
-

From: Torsten Kaiser
Date: Wednesday, November 21, 2007 - 12:48 pm

Not since my last BIOS upgrade.

This is from what dmesg's I still had laying around:
2.6.22-rc6-mm1: No
2.6.23-rc1-mm1, 2.6.23-rc2-mm1, 2.6.23-rc3-mm1: Yes
2.6.23-rc3-mm1 after BIOS upgrade: No

ACPI doesn't look guilty.
acpi=noirq:
[   39.905884] Freeing SMP alternatives: 28k freed
[   39.910674] ACPI: Core revision 20070126
[   39.916542] ACPI: setting ELCR to 0e20 (from 0c20)
[   39.921855] ExtINT not setup in hardware but reported by MP table
[   39.928244] ..MP-BIOS bug: 8254 timer not connected to IO-APIC
[   39.934586] Kernel panic - not syncing: IO-APIC + timer doesn't
work! Try using the 'noapic' kernel parameter
[   39.934587]

acpi=off:
[    0.000000] Freeing SMP alternatives: 28k freed
[    0.000000] ExtINT not setup in hardware but reported by MP table
[    0.000000] ..MP-BIOS bug: 8254 timer not connected to IO-APIC
[    0.000000] Kernel panic - not syncing: IO-APIC + timer doesn't
work! Try using the 'noapic' kernel parameter
[    0.000000]

Torsten
-

From: Alexey Dobriyan
Date: Friday, November 23, 2007 - 5:49 pm

No! The box freezes somewhere after "Freeing unused kernel memory"...

Bisection points to git-x86.patch, though.

git-bisect start
# good: [f05092637dc0d9a3f2249c9b283b973e6e96b7d2] Linux 2.6.24-rc3
git-bisect good f05092637dc0d9a3f2249c9b283b973e6e96b7d2
# bad: [46c8c396d2c87b786a7fac615c289f85a18e53ce] w1-build-fix
git-bisect bad 46c8c396d2c87b786a7fac615c289f85a18e53ce
# bad: [4e22f4852c48e1eddfe04299e78c0456164abe86] frv-move-dma-macros-to-scatterlisth-for-consistency
git-bisect bad 4e22f4852c48e1eddfe04299e78c0456164abe86
# bad: [4e22f4852c48e1eddfe04299e78c0456164abe86] frv-move-dma-macros-to-scatterlisth-for-consistency
git-bisect bad 4e22f4852c48e1eddfe04299e78c0456164abe86
# good: [d5135f31313af2be37d8ccb71e2a42f8e221d8c4] ide-mm-ide-disk-extend-timeout-for-pio-out-commands
git-bisect good d5135f31313af2be37d8ccb71e2a42f8e221d8c4
# good: [6be815e83f506f4c39a46cf59014e29a95c5e6c4] iommu-sg-merging-call-blk_queue_segment_boundary-in-__scsi_alloc_queue
git-bisect good 6be815e83f506f4c39a46cf59014e29a95c5e6c4
# good: [6be815e83f506f4c39a46cf59014e29a95c5e6c4] iommu-sg-merging-call-blk_queue_segment_boundary-in-__scsi_alloc_queue
git-bisect good 6be815e83f506f4c39a46cf59014e29a95c5e6c4
# bad: [c792db6d06114a85e33a27c89e9e979f11b951c4] slub-fix-coding-style-violations
git-bisect bad c792db6d06114a85e33a27c89e9e979f11b951c4
# bad: [c792db6d06114a85e33a27c89e9e979f11b951c4] slub-fix-coding-style-violations
git-bisect bad c792db6d06114a85e33a27c89e9e979f11b951c4
# bad: [76f3939b76ff557f73720b57a16716196f04e407] x86_64-make-sparsemem-vmemmap-the-default-memory-model-v2
git-bisect bad 76f3939b76ff557f73720b57a16716196f04e407
# good: [b8ba611566d8799a979b190d4bb14305ca64ee0e] sis-fb-driver-_ioctl32_conversion-functions-do-not-exist-in-recent-kernels
git-bisect good b8ba611566d8799a979b190d4bb14305ca64ee0e
# good: [e34995928859308d2abef1709332e2b12d36db2f] git-ipwireless_cs
git-bisect good e34995928859308d2abef1709332e2b12d36db2f
# bad: [f520abbbe11bc8253714bcd34aaaf19bdf82189e] ...
From: Rik van Riel
Date: Monday, November 26, 2007 - 12:39 pm

On Tue, 20 Nov 2007 22:18:39 -0800

I got the same bug as above, 'noapic' gets past that point and right to the
next oops.  I'm posting it here because this one is different from the others
in the thread, yet looks vaguely related:

Unable to handle kernel NULL pointer dereference at 0000000000000021 RIP:
 [<ffffffff8108382a>] refresh_zone_stat_thresholds+0x6d/0x90
PGD 0
Oops: 0002 [1] SMP
last sysfs file:
CPU 0
Modules linked in:
Pid: 1, comm: swapper Not tainted 2.6.24-rc3-mm1 #2
RIP: 0010:[<ffffffff8108382a>]  [<ffffffff8108382a>] refresh_zone_stat_thresholds+0x6d/0x90
RSP: 0000:ffff81007fb59ec0  EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000001
RDX: 0000000000000001 RSI: ffffffff8146fb38 RDI: 0000000000000001
RBP: ffff81000000c000 R08: 0000000000000000 R09: 0000000000000000
R10: ffff81007fb59e60 R11: 0000000000000028 R12: ffffffff814d4558
R13: 0000000000000000 R14: ffffffff814b62c0 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffffffff813d9000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000021 CR3: 0000000000201000 CR4: 00000000000006a0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 1, threadinfo FFFF81007FB58000, task FFFF81007FB56000)
Stack:  0000000000000000 0000000000000000 0000000000000000 ffffffff814a3839
 0000000000000000 ffffffff8148e626 ffff81007fb56000 ffffffff8126d36a
 0000000000000000 ffffffffffffffff ffffffff8105786b 0000000000000000
Call Trace:
 [<ffffffff814a3839>] setup_vmstat+0x6/0x40
 [<ffffffff8148e626>] kernel_init+0x169/0x2d8
 [<ffffffff8126d36a>] trace_hardirqs_on_thunk+0x35/0x3a
 [<ffffffff8105786b>] trace_hardirqs_on+0x115/0x138
 [<ffffffff8100ce48>] child_rip+0xa/0x12
 [<ffffffff8100c55f>] restore_args+0x0/0x30
 [<ffffffff8148e4bd>] kernel_init+0x0/0x2d8
 [<ffffffff8100ce3e>] child_rip+0x0/0x12

INFO: lockdep is turned off.

-- 
All Rights ...
From: Andrew Morton
Date: Monday, November 26, 2007 - 1:33 pm

On Mon, 26 Nov 2007 14:39:43 -0500


hm.  This smells like a startup ordering problem, but everything which
refresh_zone_stat_thresholds() should be set up by the time we run
initcalls.  Maybe the zone lists are bad?

-

From: Ingo Molnar
Date: Monday, November 26, 2007 - 1:45 pm

yes. Is it a regression? If yes, could someone try to bisect it so that 
we can fix it? If it's caused by x86.git then the 'mm' branch of the x86 
git tree can be used for bisection:

   git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86.git

it's supposed to build and boot fine at every bisection point. The 
bisection run can be cut significantly by narrowing the bisection to the 
arch/x86 changes only:

  git-bisect start arch/x86 include/asm-x86/

(and if it finds a nonsensical commit, i.e. the breakage is not caused 
by the x86 commits, save the "git-bisect log" output into a file, 
restart the git bisection and use "git-bisect replay" to insert all the 
test points into a fuller bisection run - this saves quite some time.)

	Ingo
-

From: Jiri Slaby
Date: Monday, November 26, 2007 - 3:08 pm

I did, but it's hard, if you don't know the BAD point. HEAD boots fine and 'x86:
randomize brk' too (the top of git-x86.patch). Andrew, how do you pull it, git
#mm doesn't fit to the ids from the patch.

Maybe if you can emit a broken-out with the fresh pull to test?

regards,
-- 
Jiri Slaby (jirislaby@gmail.com)
Faculty of Informatics, Masaryk University
-

From: Andrew Morton
Date: Monday, November 26, 2007 - 3:17 pm

On Mon, 26 Nov 2007 23:08:33 +0100

So the bug wasn't in git-x86 in 2.6.24-rc3-mm1.

But it might be in there now, as some patches got moved over.


The -mm git tree reimports the plain git-foo.patch files back into a new
git tree, so the commit IDs won't line up.

The way to find the culprit patch in 2.6.24-rc3-mm1 is
http://www.zip.com.au/~akpm/linux/patches/stuff/bisecting-mm-trees.txt.  It

http://userweb.kernel.org/~akpm/mmotm/ is current.  But it probably won't
compile.  I'd suggest bisecting 2.6.24-rc3-mm1 would be easier.  
-

From: Jiri Slaby
Date: Monday, November 26, 2007 - 3:22 pm

Yes, I've bisected this and it pointed to git-x86.patch + 2 pushed fixes from
series, Then tried x86 git, but its HEAD was OK.
-

From: Jiri Slaby
Date: Monday, November 26, 2007 - 4:14 pm

Yes it did :). And it worked. Both in qemu and on my desktop...

qemu output at:
http://www.fi.muni.cz/~xslaby/sklad/qemu-output.txt

thanks,
-- 
Jiri Slaby (jirislaby@gmail.com)
Faculty of Informatics, Masaryk University
-

From: Andrew Morton
Date: Monday, November 26, 2007 - 4:28 pm

On Tue, 27 Nov 2007 00:14:17 +0100


Thanks for testing.

-

From: Rik van Riel
Date: Tuesday, November 27, 2007 - 10:50 am

On Mon, 26 Nov 2007 15:28:32 -0800

No worries, the mmotm compiling issue seems to have been fixed:

  CC [M]  drivers/scsi/libsas/sas_ata.o
drivers/scsi/libsas/sas_ata.c:39: error: field ‘rphy’ has incomplete type
drivers/scsi/libsas/sas_ata.c: In function ‘sas_discover_sata’:
drivers/scsi/libsas/sas_ata.c:773: error: implicit declaration of function ‘ata_sas_rphy_alloc’
drivers/scsi/libsas/sas_ata.c:775: error: dereferencing pointer to incomplete type
drivers/scsi/libsas/sas_ata.c:775: warning: assignment makes pointer from integer without a cast
drivers/scsi/libsas/sas_ata.c:781: error: dereferencing pointer to incomplete type
drivers/scsi/libsas/sas_ata.c:782: error: dereferencing pointer to incomplete type
drivers/scsi/libsas/sas_ata.c:784: warning: type defaults to ‘int’ in declaration of ‘__mptr’
drivers/scsi/libsas/sas_ata.c:784: warning: initialization from incompatible pointer type
drivers/scsi/libsas/sas_ata.c:791: error: implicit declaration of function ‘ata_sas_rphy_add’
drivers/scsi/libsas/sas_ata.c:807: error: implicit declaration of function ‘ata_sas_rphy_delete’
drivers/scsi/libsas/sas_ata.c:809: error: implicit declaration of function ‘ata_sas_rphy_free’
make[3]: *** [drivers/scsi/libsas/sas_ata.o] Error 1
make[2]: *** [drivers/scsi/libsas] Error 2
make[1]: *** [drivers/scsi] Error 2
make: *** [drivers] Error 2

So much for continuing the bisect with that tree, to find the
cause of the second bug :)

Guess I'll extract an x86 tree changeset first, to place into
the 2.6.23-rc3-mm1 broken out tree and work from there...

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
-

From: Christoph Lameter
Date: Monday, November 26, 2007 - 1:56 pm

refresh_zone_stat_thresholds goes through each zone and updates
the stat threshold for every per cpu structure in each zone.

So this could be a processor marked online where the pcp structures have 
not been allocated or a zone NULL pointer.

-

From: Kamalesh Babulal
Date: Wednesday, November 21, 2007 - 1:24 am

Hi Andrew,

The make headers_check fails,

  CHECK   include/linux/usb/gadgetfs.h
  CHECK   include/linux/usb/ch9.h
  CHECK   include/linux/usb/cdc.h
  CHECK   include/linux/usb/audio.h
  CHECK   include/linux/kvm.h
/root/kernels/linux-2.6.24-rc3/usr/include/linux/kvm.h requires asm/kvm.h, which does not exist in exported headers
make[2]: *** [/root/kernels/linux-2.6.24-rc3/usr/include/linux/.check.kvm.h] Error 1
make[1]: *** [linux] Error 2
make: *** [headers_check] Error 2


-- 
Thanks & Regards,
Kamalesh Babulal,
Linux Technology Center,
IBM, ISTL.
-

From: Andrew Morton
Date: Tuesday, November 20, 2007 - 5:32 pm

hm, works for me, on i386 and x86_64.  What's different over there?
-

From: Kamalesh Babulal
Date: Wednesday, November 21, 2007 - 1:41 am

Hi Andrew,

It fails on the powerpc box, with allyesconfig option.

-- 
Thanks & Regards,
Kamalesh Babulal,
Linux Technology Center,
IBM, ISTL.
-

From: Avi Kivity
Date: Wednesday, November 21, 2007 - 1:44 am

How do we fix this?  Export linux/kvm.h only on x86?  Seems ugly.

-- 
Any sufficiently difficult bug is indistinguishable from a feature.

-

From: Robert P. J. Day
Date: Wednesday, November 21, 2007 - 1:52 am

i'm sure i'm going to humiliate myself for asking this, but shouldn't
i be able to reproduce the above by just running:

  $ make ARCH=powerpc headers_install/headers_check

we've sort of had this discussion before where, IIRC, you should be
able to generate the appropriate arch-specific headers without having
the corresponding toolchain, no?  so why can't i reproduce that error
on my x86 box?

rday
--

========================================================================
Robert P. J. Day
Linux Consulting, Training and Annoying Kernel Pedantry
Waterloo, Ontario, CANADA

http://crashcourse.ca
========================================================================
-

From: Andrew Morton
Date: Wednesday, November 21, 2007 - 2:04 am

I can.

setenv ARCH powerpc
make mrproper
make allmodconfig
make headers_check
-

From: Robert P. J. Day
Date: Wednesday, November 21, 2007 - 2:06 am

ack.  never mind, i just noticed that this is with the rc3-mm1 tree.
i was confused since, in the latest git tree, there is absolutely *no*
inclusion of <asm/kvm.h> anywhere in the tree, so clearly something
like that has been added in the mm tree.

sorry for the noise.

rday
--

========================================================================
Robert P. J. Day
Linux Consulting, Training and Annoying Kernel Pedantry
Waterloo, Ontario, CANADA

http://crashcourse.ca
========================================================================
-

From: Sam Ravnborg
Date: Wednesday, November 21, 2007 - 2:58 am

Is kvm x86 specific? Then move the .h file to asm-x86.
Otherwise no good idea...

	Sam
-

From: Avi Kivity
Date: Wednesday, November 21, 2007 - 3:00 am

kvm.h is x86 specific today, but will be s390, ppc, ia64, and x86 
specific tomorrow.

What about having a asm-generic/kvm.h with a nice #error?    would that 
suit?


-- 
error compiling committee.c: too many arguments to function

-

From: Avi Kivity
Date: Wednesday, November 21, 2007 - 3:17 am

headers_check continues to complain.  Is the only recourse to add 
asm/kvm.h for all archs?

-- 
error compiling committee.c: too many arguments to function

-

From: Robert P. J. Day
Date: Wednesday, November 21, 2007 - 3:31 am

that's what's happened with other header files.  see asm-*/auxvec.h,
for example.

rday
--
========================================================================
Robert P. J. Day
Linux Consulting, Training and Annoying Kernel Pedantry
Waterloo, Ontario, CANADA

http://crashcourse.ca
========================================================================
-

From: Andrew Morton
Date: Tuesday, November 27, 2007 - 10:02 pm

That would work.

Meanwhile my recourse is to drop the kvm tree ;)
-

From: Avi Kivity
Date: Sunday, December 2, 2007 - 1:56 am

Since you put it this way...

I committed the attached (sorry) patch to kvm.git.   Rather than 
touching 2*($NARCH - 1) file, I changed include/linux/Kbuild to only 
export kvm.h if the arch actually supports it.  Currently that's just x86.


-- 
error compiling committee.c: too many arguments to function

From: KAMEZAWA Hiroyuki
Date: Wednesday, November 21, 2007 - 1:42 am

Hi, Andrew

I got following result in 'sync' command.
It was too slow. (memory controller config is off ;)
I attaches my .config.
==
[2.6.24-rc3-mm1]
[kamezawa@dr-test2 ~]$ dd if=/dev/zero of=./tmpfile bs=4096 count=100000
100000+0 records in
100000+0 records out
409600000 bytes (410 MB) copied, 1.46706 seconds, 279 MB/s
[kamezawa@dr-test2 ~]$ time sync

real    3m6.440s
user    0m0.000s
sys     0m0.133s


on, 2.6.23-rc2-mm1, 2.6.23-rc3 there was no problem.
==
[2.6.24-rc3]
[kamezawa@dr-test2 ~]$ dd if=/dev/zero of=tmpfile bs=4096 count=100000
100000+0 records in
100000+0 records out
409600000 bytes (410 MB) copied, 2.07717 seconds, 197 MB/s
[kamezawa@dr-test2 ~]$ time sync

real    0m9.935s
user    0m0.001s
sys     0m0.113s

[2.6.24-rc3]
[kamezawa@dr-test2 ~]$ dd if=/dev/zero of=./tmpfile bs=4096 count=100000
100000+0 records in
100000+0 records out
409600000 bytes (410 MB) copied, 1.37147 seconds, 299 MB/s
[kamezawa@dr-test2 ~]$ time sync[2.6.24-rc2-mm1]


real    0m11.718s
user    0m0.000s
sys     0m0.138s

==
-Kame
From: KAMEZAWA Hiroyuki
Date: Wednesday, November 21, 2007 - 1:49 am

On Wed, 21 Nov 2007 17:42:15 +0900
Ah, one of cpu shows 100% iowait in 'top' command while this.

-Kame

-

From: KAMEZAWA Hiroyuki
Date: Wednesday, November 21, 2007 - 8:06 pm

On Wed, 21 Nov 2007 00:49:09 -0800

I confirmed This slowdown is caused by git-scsi-misc.patch.
I'm sorry that I can't chase more and will be offline in this weekend.

This is scsi_mod information in /proc/modules
=
scsi_mod 409416 8 mptctl,sg,lpfc,scsi_transport_fc,mptspi,mptscsih,scsi_transport_spi,sd_mod, Live 0xa000000202818000
=

What information should I provide more ?

Thanks,
-Kame


-

From: kosaki
Date: Saturday, November 24, 2007 - 5:04 am

I tested x86, ext3-on-SATA(/dev/sda).
It seems works well.

Hmm...


-- 
kosaki


-

From: KAMEZAWA Hiroyuki
Date: Monday, November 26, 2007 - 12:06 am

On Sat, 24 Nov 2007 19:04:34 +0100

Thank you!
The problem was fixed by reverting the patch you pointed out.

-Kame

-

From: Kamalesh Babulal
Date: Wednesday, November 21, 2007 - 1:06 am

Hi Andrew,

The kernel build fails on powerpc while linking,

  AS      .tmp_kallsyms3.o
  LD      vmlinux.o
ld: TOC section size exceeds 64k
make: *** [vmlinux.o] Error 1

The patch posted at http://lkml.org/lkml/2007/11/13/414, solves this 
failure.

-- 
Thanks & Regards,
Kamalesh Babulal,
Linux Technology Center,
IBM, ISTL.
-

From: Stephen Rothwell
Date: Wednesday, November 21, 2007 - 3:52 pm

On Wed, 21 Nov 2007 13:36:30 +0530 Kamalesh Babulal <kamalesh@linux.vnet.ib=

Only for allyesconfig (or maybe some other config that builds a lot of

However, that patch needs more testing especially to figure out what
performance effects it has.  i.e. not for merging, yet.

--=20
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/
From: Kirill A. Shutemov
Date: Wednesday, November 21, 2007 - 11:23 am

USB mouse(Logitech M-BT58) doesn't work. TouchPad works.
dmesg after rmmod usbcore && modprobe uhci_hcd:

usbcore: registered new interface driver usbfs
usbcore: registered new interface driver hub
usbcore: registered new device driver usb
USB Universal Host Controller Interface driver v3.0
ACPI: PCI Interrupt 0000:00:1d.0[A] -> Link [LNKE] -> GSI 10 (level, low)
-> IRQ 10
PCI: Setting latency timer of device 0000:00:1d.0 to 64
uhci_hcd 0000:00:1d.0: UHCI Host Controller
uhci_hcd 0000:00:1d.0: new USB bus registered, assigned bus number 1
uhci_hcd 0000:00:1d.0: irq 10, io base 0x0000bf80
usb usb1: configuration #1 chosen from 1 choice
hub 1-0:1.0: USB hub found
hub 1-0:1.0: 2 ports detected
usb usb1: new device found, idVendor=3D0000, idProduct=3D0000
usb usb1: new device strings: Mfr=3D3, Product=3D2, SerialNumber=3D1
usb usb1: Product: UHCI Host Controller
usb usb1: Manufacturer: Linux 2.6.24-kas-alt1 uhci_hcd
usb usb1: SerialNumber: 0000:00:1d.0
ACPI: PCI Interrupt 0000:00:1d.1[B] -> Link [LNKF] -> GSI 11 (level, low)
-> IRQ 11
PCI: Setting latency timer of device 0000:00:1d.1 to 64
uhci_hcd 0000:00:1d.1: UHCI Host Controller
uhci_hcd 0000:00:1d.1: new USB bus registered, assigned bus number 2
uhci_hcd 0000:00:1d.1: irq 11, io base 0x0000bf60
usb usb2: configuration #1 chosen from 1 choice
hub 2-0:1.0: USB hub found
hub 2-0:1.0: 2 ports detected
usb usb2: new device found, idVendor=3D0000, idProduct=3D0000
usb usb2: new device strings: Mfr=3D3, Product=3D2, SerialNumber=3D1
usb usb2: Product: UHCI Host Controller
usb usb2: Manufacturer: Linux 2.6.24-kas-alt1 uhci_hcd
usb usb2: SerialNumber: 0000:00:1d.1
ACPI: PCI Interrupt 0000:00:1d.2[C] -> Link [LNKG] -> GSI 9 (level, low)
-> IRQ 9
PCI: Setting latency timer of device 0000:00:1d.2 to 64
uhci_hcd 0000:00:1d.2: UHCI Host Controller
uhci_hcd 0000:00:1d.2: new USB bus registered, assigned bus number 3
uhci_hcd 0000:00:1d.2: irq 9, io base 0x0000bf40
usb usb3: configuration #1 chosen from 1 choice
hub 3-0:1.0: USB hub ...
From: Kirill A. Shutemov
Date: Thursday, November 22, 2007 - 3:17 am

No. But I have new messages in dmesg:

uhci_hcd 0000:00:1d.3: FGR not stopped yet!
uhci_hcd 0000:00:1d.2: FGR not stopped yet!
uhci_hcd 0000:00:1d.1: FGR not stopped yet!

It is a new message since 2.6.24-rc3. I have never try -mm tree before.

--=20
Regards,  Kirill A. Shutemov
 + Belarus, Minsk
 + Velesys LLC, http://www.velesys.com/
 + ALT Linux Team, http://www.altlinux.com/
From: Marin Mitov
Date: Thursday, November 22, 2007 - 10:41 am

udelay() _is_ OK for 2.6.24-rc3, so it is not the cause of the problem



-

From: Alan Stern
Date: Thursday, November 22, 2007 - 7:51 pm

But is it OK for 2.6.24-rc3-mm1?  Kirill said specifically that 
2.6.24-rc3 does not display the message but 2.6.24-rc3-mm1 does.

Alan Stern

-

From: Alan Stern
Date: Friday, November 23, 2007 - 9:21 am

Add some printk statements to the wakeup_rh() routine: Display the 
value of inw(uhci->io_addr + USBCMD) in hex, both before and after the 
udelay(4) line.  Try doing this for both 2.6.24-rc3 and 2.6.24-rc3-mm1 
so you can compare the values for the two different kernels.

Also, enable CONFIG_PRINTK_TIME.

Alan Stern

-

From: Alan Stern
Date: Monday, December 31, 2007 - 2:06 pm

Any progress?  How about more recent kernels?

Alan Stern

--

From: Laurent Riffard
Date: Wednesday, November 21, 2007 - 2:45 pm

Le 21.11.2007 05:45, Andrew Morton a 
From: Andrew Morton
Date: Wednesday, November 21, 2007 - 3:41 pm

On Wed, 21 Nov 2007 22:45:22 +0100

Could be - libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch
and pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch
touch pata_via.c.
-

From: Laurent Riffard
Date: Friday, November 23, 2007 - 12:29 am

None of the above...

I did a bisection, it spotted git-scsi-misc.patch. 
I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works fine.

I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 "[SCSI] Do not 
requeue requests if REQ_FAILFAST is set" is the real culprit. The other 
commits are touching documentation or drivers I don't use. I'll try 
to revert only this one this evening.

-- 
laurent


-

From: Hannes Reinecke
Date: Friday, November 23, 2007 - 4:38 am

>> Le 21.11.2007 23:41, Andrew Morton a 
From: Laurent Riffard
Date: Friday, November 23, 2007 - 10:52 am

I can confirm : reverting commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 

Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O errors.

-- 
laurent

-

From: James Bottomley
Date: Friday, November 23, 2007 - 11:42 pm

I think the problem is the way we treat BLOCKED and QUIESCED (the latter
is the state that the domain validation uses and which we cannot kill
fastfail on).  It's definitely wrong to kill fastfail requests when the
state is QUIESCE.

This patch (which is applied on top of Hannes original) separates the
BLOCK and QUIESCE states correctly ... does this fix the problem?

James

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 13e7e09..a7cf23a 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1279,18 +1279,21 @@ int scsi_prep_state_check(struct scsi_device *sdev, struct request *req)
 				    "rejecting I/O to dead device\n");
 			ret = BLKPREP_KILL;
 			break;
-		case SDEV_QUIESCE:
 		case SDEV_BLOCK:
 			/*
-			 * If the devices is blocked we defer normal commands.
-			 */
-			if (!(req->cmd_flags & REQ_PREEMPT))
-				ret = BLKPREP_DEFER;
-			/*
 			 * Return failfast requests immediately
 			 */
 			if (req->cmd_flags & REQ_FAILFAST)
 				ret = BLKPREP_KILL;
+
+			/* fall through */
+
+		case SDEV_QUIESCE:
+			/*
+			 * If the devices is blocked we defer normal commands.
+			 */
+			if (!(req->cmd_flags & REQ_PREEMPT))
+				ret = BLKPREP_DEFER;
 			break;
 		default:
 			/*


-

From: Laurent Riffard
Date: Saturday, November 24, 2007 - 5:57 am

From: Gabriel C
Date: Saturday, November 24, 2007 - 10:54 am

Are the patches indeed to fix that problem as well ? 


Gabriel 

-

From: James Bottomley
Date: Saturday, November 24, 2007 - 11:04 am

That dmesg is from an unknown SCSI card exhibiting Domain Validation
problems, so it's a reasonable probability, yes ... but you'll need the
additional hack I just did to prevent further intermittent failures.

James


-

From: Gabriel C
Date: Saturday, November 24, 2007 - 11:08 am

My controller is:

03:0e.0 SCSI storage controller [0100]: Adaptec AIC-7892P U160/m [9005:008f] (rev 02)


Gabriel
-

From: Gabriel C
Date: Saturday, November 24, 2007 - 11:28 am

With your patches my problem(s) are solved. Domain Validation works again.

...

[   32.179521] scsi 0:0:0:0: Direct-Access     SEAGATE  ST318406LW       0109 PQ: 0 ANSI: 3
[   32.179540] scsi0:A:0:0: Tagged Queuing enabled.  Depth 32
[   32.179554]  target0:0:0: Beginning Domain Validation
[   32.188553]  target0:0:0: wide asynchronous
[   32.195302]  target0:0:0: FAST-80 WIDE SCSI 160.0 MB/s DT (12.5 ns, offset 63)
[   32.206510]  target0:0:0: Ending Domain Validation
[   32.211699] scsi 0:0:1:0: Direct-Access     FUJITSU  MAH3182MP        0114 PQ: 0 ANSI: 4
[   32.211707] scsi0:A:1:0: Tagged Queuing enabled.  Depth 32
[   32.211717]  target0:0:1: Beginning Domain Validation
[   32.213980]  target0:0:1: wide asynchronous
[   32.215682]  target0:0:1: FAST-80 WIDE SCSI 160.0 MB/s DT (12.5 ns, offset 127)
[   32.220205]  target0:0:1: Ending Domain Validation

...

 Thx James :)


Gabriel

-

From: Laurent Riffard
Date: Saturday, November 24, 2007 - 3:59 pm

Le 24.11.2007 14:26, James Bottomley a 
From: James Bottomley
Date: Sunday, November 25, 2007 - 12:37 am

I think this one's quite easy:  PATA devices in libata are queue depth 1
(since they don't do NCQ).  Thus, they're peculiarly sensitive to the
bug where we fail over queue depth requests.

On the other hand, I don't see how a filesystem request is getting
REQ_FAILFAST ... unless there's a bio or readahead issue involved.
Anyway, could you try this patch:

http://marc.info/?l=linux-scsi&m=119592627425498

Which should fix the queue depth issue, and see if the errors go away?

Thanks,

James


-

From: Laurent Riffard
Date: Sunday, November 25, 2007 - 1:39 pm

No, this one doesn't help...

-- 
laurent
-

From: Laurent Riffard
Date: Wednesday, November 28, 2007 - 2:38 pm

still happens with 2.6.24-rc3-mm2...
-- 
laurent
-

From: James Bottomley
Date: Saturday, November 24, 2007 - 10:44 am

Probing intermittent failures in Domain Validation, even with the fixes
applied leads me to the conclusion that there are further problems with
this commit:

commit fc5eb4facedbd6d7117905e775cee1975f894e79
Author: Hannes Reinecke <hare@suse.de>
Date:   Tue Nov 6 09:23:40 2007 +0100

    [SCSI] Do not requeue requests if REQ_FAILFAST is set
 
The essence of the problems is that you're causing REQ_FAILFAST to
terminate commands with error on requeuing conditions, some of which are
relatively common on most SCSI devices.  While this may be the correct
behaviour for multi-path, it's certainly wrong for the previously
understood meaning of REQ_FAILFAST, which was don't retry on error,
which is why domain validation and other applications use it to control
error handling, but don't expect to get failures for a simple requeue
are now spitting errors.

I honestly can't see that, even for the multi-path case, returning an
error when we're over queue depth is the correct thing to do (it may not
matter to something like a symmetrix, but an array that has a non-zero
cost associated with a path change, like a CPQ HSV or the AVT
controllers, will show fairly large slow downs if you do this).  Even if
this is the desired behaviour (and I think that's a policy issue),
DID_NO_CONNECT is almost certainly the wrong error to be sending back.

This patch fixes up domain validation to work again correctly, however,
I really think it's just a bandaid.  Do you want to rethink the above
commit?

James

Index: BUILD-2.6/drivers/scsi/scsi_lib.c
===================================================================
--- BUILD-2.6.orig/drivers/scsi/scsi_lib.c	2007-11-24 11:25:20.000000000 -0600
+++ BUILD-2.6/drivers/scsi/scsi_lib.c	2007-11-24 11:26:22.000000000 -0600
@@ -1552,7 +1552,8 @@ static void scsi_request_fn(struct reque
 			break;
 
 		if (!scsi_dev_queue_ready(q, sdev)) {
-			if (req->cmd_flags & REQ_FAILFAST) {
+			if ((req->cmd_flags & REQ_FAILFAST) &&
+			    !(req->cmd_flags & ...
From: Hannes Reinecke
Date: Monday, November 26, 2007 - 12:54 am

Given the amounted error, yes, I'll have to.
But we still face the initial problem that requeued requests will be
stuck in the queue forever (ie until the timeout catches it), causing
failover to be painfully slow.

Anyway, I'll think it over.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: Markus Rex, HRB 16746 (AG N�rnberg)
-

From: Kirill A. Shutemov
Date: Thursday, November 22, 2007 - 3:22 am

On x86_64 'uname -m' return 'x86'.  It break many userspace programs. apt
and rpm for example.

--=20
Regards,  Kirill A. Shutemov
 + Belarus, Minsk
 + Velesys LLC, http://www.velesys.com/
 + ALT Linux Team, http://www.altlinux.com/
From: Andrew Morton
Date: Thursday, November 22, 2007 - 5:18 pm

Yes, there have been various discussions about this.  I think Sam is cooking up
a fix?
-

From: Thomas Gleixner
Date: Thursday, November 22, 2007 - 5:48 pm

http://lkml.org/lkml/2007/11/19/323

I push it Linus wards ASAP.

Thanks,

	tglx
-

From: Kirill A. Shutemov
Date: Thursday, November 22, 2007 - 11:05 pm

diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 116b03a..7aa1dc6 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -11,10 +11,9 @@ endif
 $(srctree)/arch/x86/Makefile%: ;
=20
 ifeq ($(CONFIG_X86_32),y)
+        UTS_MACHINE :=3D i386
         include $(srctree)/arch/x86/Makefile_32
 else
+        UTS_MACHINE :=3D x86_64
         include $(srctree)/arch/x86/Makefile_64
 endif

Many programs expect i686 on Pentium II.

--=20
Regards,  Kirill A. Shutemov
 + Belarus, Minsk
 + Velesys LLC, http://www.velesys.com/
 + ALT Linux Team, http://www.altlinux.com/
From: Andreas Herrmann
Date: Friday, November 23, 2007 - 1:59 am

Yes, but this is done during boot.
Then the kernel overwrites "i386" to become "i686" for such CPUs.
That is why I've seen "x66" after boot when UTS_MACHINE at build-time
was "x86" with 'make ARCH=x86'.
For more details see:

http://marc.info/?l=linux-kernel&m=119521309415545&w=2


Regards,

Andreas

-

From: Gabriel C
Date: Thursday, November 22, 2007 - 6:39 pm

I have some warnings on each SCSI disc:


...

[   30.724410] scsi 0:0:0:0: Direct-Access     SEAGATE  ST318406LW       0109 PQ: 0 ANSI: 3
[   30.724419] scsi0:A:0:0: Tagged Queuing enabled.  Depth 32
[   30.724435]  target0:0:0: Beginning Domain Validation
[   30.724446]  target0:0:0: Domain Validation Initial Inquiry Failed <--
[   30.724572]  target0:0:0: Ending Domain Validation
[   30.729747] scsi 0:0:1:0: Direct-Access     FUJITSU  MAH3182MP        0114 PQ: 0 ANSI: 4
[   30.729754] scsi0:A:1:0: Tagged Queuing enabled.  Depth 32
[   30.729771]  target0:0:1: Beginning Domain Validation
[   30.729780]  target0:0:1: Domain Validation Initial Inquiry Failed <--
[   30.729908]  target0:0:1: Ending Domain Validation

...

no idea whatever this is related but buffered disk reads are 2.XX MB/sec and the box is somewhat laggy.

hdparm -t on sda and sdb reports :

/dev/sda:
 Timing buffered disk reads:    8 MB in  3.26 seconds =   2.46 MB/sec

/dev/sdb:
 Timing buffered disk reads:    8 MB in  3.56 seconds =   2.25 MB/sec

My IDE discs are fine.

Please let me know if you need my config or any other informations.


Gabriel
-

From: Andrew Morton
Date: Thursday, November 22, 2007 - 9:12 pm

Don't know what would have caused that.  But yes, something is wrong in

And you're the second to report very slow scsi throughput in 2.6.24-rc3-mm1.
-

From: Gabriel C
Date: Thursday, November 22, 2007 - 10:55 pm

I found the commit which cause these problems , it is in git-scsi-misc patch and reverting it fixes both problems for me.

http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commitdiff_plain;h=...


-

From: Andrew Morton
Date: Monday, November 26, 2007 - 11:15 pm

OK, thanks.  I'll assume that James and Hannes have this in hand (or will
have, by mid-week) and I won't do anything here.

-

From: James Bottomley
Date: Tuesday, December 11, 2007 - 9:33 am

Just to confirm what I think I'm going to be doing:  rebasing the
scsi-misc tree to remove this commit:

commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0
Author: Hannes Reinecke <hare@suse.de>
Date:   Tue Nov 6 09:23:40 2007 +0100

    [SCSI] Do not requeue requests if REQ_FAILFAST is set

And its allied fix ups:

commit 983289045faa96fba8841d3c51b98bb8623d9504
Author: James Bottomley <James.Bottomley@HansenPartnership.com>
Date:   Sat Nov 24 19:47:25 2007 +0200

    [SCSI] fix up REQ_FASTFAIL not to fail when state is QUIESCE

commit 9dd15a13b332e9f5c8ee752b1ccd9b84cb5bdf17
Author: James Bottomley <James.Bottomley@HansenPartnership.com>
Date:   Sat Nov 24 19:55:53 2007 +0200

    [SCSI] fix domain validation to work again

James


--

From: Boaz Harrosh
Date: Wednesday, December 12, 2007 - 4:03 am

- BIO flags bio->bi_rw and REQ flags req->cmd_flags no longer match.
   Remove comments and do a proper translation between the 2 systems.

 (Please look in ll_rw_blk.c/blk_rq_bio_prep() below if we need more flags)

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
---
 block/ll_rw_blk.c            |   23 +++++++++++++++++------
 include/linux/blktrace_api.h |    8 +++++++-
 2 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index 8b91994..c6a84bb 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -1990,10 +1990,6 @@ blk_alloc_request(struct request_queue *q, int rw, int priv, gfp_t gfp_mask)
 	if (!rq)
 		return NULL;
 
-	/*
-	 * first three bits are identical in rq->cmd_flags and bio->bi_rw,
-	 * see bio.h and blkdev.h
-	 */
 	rq->cmd_flags = rw | REQ_ALLOCED;
 
 	if (priv) {
@@ -3772,8 +3768,23 @@ EXPORT_SYMBOL(end_request);
 static void blk_rq_bio_prep(struct request_queue *q, struct request *rq,
 			    struct bio *bio)
 {
-	/* first two bits are identical in rq->cmd_flags and bio->bi_rw */
-	rq->cmd_flags |= (bio->bi_rw & 3);
+	if (bio_data_dir(bio))
+		rq->cmd_flags |= REQ_RW;
+	else
+		rq->cmd_flags &= ~REQ_RW;
+
+	if (bio->bi_rw & (1<<BIO_RW_SYNC))
+		rq->cmd_flags |= REQ_RW_SYNC;
+	else
+		rq->cmd_flags &= ~REQ_RW_SYNC;
+	/* FIXME: what about other flags, should we sync these too? */
+	/*
+	BIO_RW_AHEAD	==> ??
+	BIO_RW_BARRIER	==> REQ_SOFTBARRIER/REQ_HARDBARRIER
+	BIO_RW_FAILFAST	==> REQ_FAILFAST
+	BIO_RW_SYNC	==> REQ_RW_SYNC
+	BIO_RW_META	==> REQ_RW_META
+	*/
 
 	rq->nr_phys_segments = bio_phys_segments(q, bio);
 	rq->nr_hw_segments = bio_hw_segments(q, bio);
diff --git a/include/linux/blktrace_api.h b/include/linux/blktrace_api.h
index 7e11d23..9e7ce65 100644
--- a/include/linux/blktrace_api.h
+++ b/include/linux/blktrace_api.h
@@ -165,7 +165,13 @@ static inline void blk_add_trace_rq(struct request_queue *q, struct request *rq,
 				    u32 what)
 {
 	struct blk_trace *bt = ...
From: Matthew Wilcox
Date: Wednesday, December 12, 2007 - 8:18 am

I'd rather see them resynchronised ... in a way that makes it obvious
that they should be desynced again:

I don't know whether BIO_RW_BARRIER is __REQ_SOFTBARRIER or
__REQ_HARDBARRIER, so I didn't include that in this patch.  There also
doesn't seem to be a __REQ equivalent to BIO_RW_AHEAD, but we can do
the other four bits (and leave gaps for those two).

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index d18ee67..6aef34b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -167,11 +167,13 @@ enum {
 };
 
 /*
- * request type modified bits. first three bits match BIO_RW* bits, important
+ * request type modified bits.  Don't change without looking at bi_rw flags
  */
 enum rq_flag_bits {
-	__REQ_RW,		/* not set, read. set, write */
-	__REQ_FAILFAST,		/* no low level driver retries */
+	__REQ_RW = BIO_RW,	/* not set, read. set, write */
+	__REQ_FAILFAST = BIO_RW_FAILFAST, /* no low level driver retries */
+	__REQ_RW_SYNC = BIO_RW_SYNC,	/* request is sync (O_DIRECT) */
+	__REQ_RW_META = BIO_RW_META,	/* metadata io request */
 	__REQ_SORTED,		/* elevator knows about this request */
 	__REQ_SOFTBARRIER,	/* may not be passed by ioscheduler */
 	__REQ_HARDBARRIER,	/* may not be passed by drive either */
@@ -185,9 +187,7 @@ enum rq_flag_bits {
 	__REQ_QUIET,		/* don't worry about errors */
 	__REQ_PREEMPT,		/* set for "ide_preempt" requests */
 	__REQ_ORDERED_COLOR,	/* is before or after barrier */
-	__REQ_RW_SYNC,		/* request is sync (O_DIRECT) */
 	__REQ_ALLOCED,		/* request came from our alloc pool */
-	__REQ_RW_META,		/* metadata io request */
 	__REQ_NR_BITS,		/* stops here */
 };
 

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--

From: David Chinner
Date: Wednesday, December 12, 2007 - 10:36 pm

That would say to me that READA is not hooked up correctly. i.e:

#define READ 0
#define WRITE 1
#define READA 2         /* read-ahead  - don't block if no resources */
#define SWRITE 3        /* for ll_rw_block() - wait for buffer lock */
#define READ_SYNC       (READ | (1 << BIO_RW_SYNC))
#define READ_META       (READ | (1 << BIO_RW_META))
#define WRITE_SYNC      (WRITE | (1 << BIO_RW_SYNC))
#define WRITE_BARRIER   ((1 << BIO_RW) | (1 << BIO_RW_BARRIER))

i.e. it should be:

#define READA		(1 << BIO_RW_AHEAD)

Right?

FWIW, dm does this:

		if (bio_rw(bio) != READA)

Which really should be if (bio_rw_ahead(bio)).....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--

From: Boaz Harrosh
Date: Wednesday, December 12, 2007 - 9:06 am

Thats not enough You still need to fix code in ll_rw_blk(), I would
define a rq_flags_bio_match_mask = 0xf for that. 
(and also add what Jens called "needed" with the
BIO_RW_AHEAD selects REQ_FAILFAST.)

And I still don't understand why, for example, "Domain Validation" fails
with the original patch. What sets BIO_RW_FAILFAST and than panics
on Errors?
(All I see is this flag set in dm/multipath.c & dm-mpath.c)

Boaz

--

From: Matthew Wilcox
Date: Wednesday, December 12, 2007 - 9:33 am

Yes, I appreciate it's not enough; that's why I didn't sign-off on it.

Nobody currently sets BIO_RW_AHEAD, so I don't understand why we need to

No idea ...

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--

From: Jens Axboe
Date: Wednesday, December 12, 2007 - 4:36 am

But that's actually on purpose, though the comment is pretty much crap.
We don't want to be retrying readahead requests, those should always
just be tossable.

-- 
Jens Axboe

--

From: Hannes Reinecke
Date: Friday, December 14, 2007 - 2:00 am

Or just apply my latest patch (cf Undo __scsi_kill_request).
The main point is that we shouldn't retry requests
with FAILFAST set when the queue is blocked. AFAICS
only FC and iSCSI transports set the queue to blocked,
and use this to indicate a loss of connection. So any
retry with queue blocked is futile.

Cheers,

Hannes

-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
--

From: James Bottomley
Date: Friday, December 14, 2007 - 7:26 am

I still don't think this is the right approach.

For link up/down events, those are direct pathing events and should be
signalled along a kernel notifier, not by mucking with the SCSI state
machine.  However, there's still devloss_tmo to consider ... even in
multipath, I don't think you want to signal path failure until
devloss_tmo has fired otherwise you'll get too many transient up/down
events which damage performance if the array has an expensive failover
model.

The other problem is what to do with in-flight commands at the time the
link went down.  With your current patch, they're still stuck until they
time out ... surely there needs to be some type of recovery mechanism
for these?

James


--

From: Hannes Reinecke
Date: Monday, January 7, 2008 - 7:05 am

Of course they will be signalled. And eventually we should patch up
mutltipath-tools to read the exising events from the uevent socket.
But even with that patch there is a quite largish window during
which IOs will be sent to the blocked device, and hence will be
Yes. But currently we have a very high failover latency as we always have
to wait for the requeued commands to time-out.
Well, the in-flight commands are owned by the HBA driver, which should
have the proper code to terminate / return those commands with the
appriopriate codes. They will then be rescheduled and will be caught
like 'normal' IO requests.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
--

From: James Bottomley
Date: Monday, January 7, 2008 - 10:57 am

But the assumption your code makes is that if REQ_FAILFAST is set then
it's a dm request ... and that's not true.  The code in question
negatively impacts other users of REQ_FAILFAST.  For every user other

If it's a either/or choice between the two that's showing our current

But my point is that if a driver goes blocked, those commands will be
forced to wait the blocked timeout anyway, so your proposed patch does
nothing to improve the case for dm anyway ... you only avoid commands
stuck when a device goes blocked if by chance its request queue was
empty.

James


--

From: Mike Christie
Date: Monday, January 7, 2008 - 11:24 am

How about my patches to use new transport error values and make the 
iscsi and fc behave the same.

The problem I think Hannes and I are both trying to solve is this:

1. We do not want to wait for dev_loss_tmo seconds for failover.

2. The FC drivers can hook into fast_io_fail_tmo related callouts and 
with that set that tmo to a very low value like a couple of seconds if 
they are using multipath, so failovers are fast. However, there is a bug 
with where when the fast_io_fail_tmo fires requests that made it to the 
driver get failed and returned to the multipath layer, but commands in 
the blocked request queue are stuck in there until dev_loss_tmo fires.

With my patches here (need to be rediffed and for FC I need to handle 
JamesS's comments about not using a new field for the fast_fail_timeout 
state bit):

http://marc.info/?l=linux-scsi&m=117399843216280&w=2
http://marc.info/?l=linux-scsi&m=117399544112073&w=2
http://marc.info/?l=linux-scsi&m=117399844316771&w=2
http://marc.info/?l=linux-scsi&m=117400203324693&w=2
http://marc.info/?l=linux-scsi&m=117400203324690&w=2

For FC we can use the fast_io_fail_tmo for fast failovers, and commands 
will not get stuck in a blocked queue for dev_loss_tmo seconds because 
when the fast_io_fail_tmo fires the target's queues are unblocked and 
fc_remote_port_chkready() ready kicks in (iSCSI does the same with the 
patches in the links). And with the patches if multipath-tools is 
sending its path testing IO it will get a DID_TRANSPORT_* error code 
that it can use to make a decent path failing decision with.
--

From: Randy Dunlap
Date: Monday, November 26, 2007 - 12:13 pm

allnoconfig on x86_64 gives:

arch/x86/mm/init_64.c:84: error: implicit declaration of function 'pfn_valid'
mm/page_alloc.c:2533: error: implicit declaration of function 'pfn_valid'
mm/vmstat.c:518: error: implicit declaration of function 'pfn_valid'
mm/memory.c:400: error: implicit declaration of function 'pfn_valid'
drivers/char/mem.c:312: error: implicit declaration of function 'pfn_valid'


---
~Randy
-

From: Christoph Lameter
Date: Monday, November 26, 2007 - 12:34 pm

Hmmm... CONFIG_SPARSEMEM is not set if you do allnoconfig

config SPARSEMEM
        def_bool y
        depends on SPARSEMEM_MANUAL

So I guess we need to set SPARSEMEM_MANUAL

But arch/x86/Kconfig has

config SPARSEMEM_MANUAL
        bool "Sparse Memory"
        depends on ARCH_SPARSEMEM_ENABLE
        help
          This will be the only option for some systems, including
          memory hotplug systems.  This is normal.

It needs to be not deselectable for x86_64. 

Inserting

	def_bool y if X86_64

did not help....

Somehow make menuconfig did not give me an ability to even enable this 
again.

-

From: Randy Dunlap
Date: Monday, November 26, 2007 - 1:40 pm

Thanks for the hint.

ARCH_SELECT_MEMORY_MODEL depends on X86_32.  Is that too restrictive?

config ARCH_SELECT_MEMORY_MODEL
	def_bool y
	depends on X86_32 && ARCH_SPARSEMEM_ENABLE

---
~Randy
-

From: Christoph Lameter
Date: Monday, November 26, 2007 - 1:56 pm

No. X86_64 only has one memory model.
-

From: Randy Dunlap
Date: Monday, November 26, 2007 - 1:47 pm

This patch allows allnoconfig to build cleanly.

---

From: Randy Dunlap <randy.dunlap@oracle.com>

Make allnoconfig on x86_64 build by allowing ARCH_SELECT_MEMORY_MODEL
to be enabled on X86 32/64, not just X86_32.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
---
 arch/x86/Kconfig |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-2.6.24-rc3-mm1.orig/arch/x86/Kconfig
+++ linux-2.6.24-rc3-mm1/arch/x86/Kconfig
@@ -937,7 +937,7 @@ config ARCH_SPARSEMEM_ENABLE
 
 config ARCH_SELECT_MEMORY_MODEL
 	def_bool y
-	depends on X86_32 && ARCH_SPARSEMEM_ENABLE
+	depends on ARCH_SPARSEMEM_ENABLE
 
 config ARCH_MEMORY_PROBE
 	def_bool X86_64

-

From: Christoph Lameter
Date: Monday, November 26, 2007 - 2:00 pm

Well this sortof works.

One can again select a memory model but there is only one to choose from.
It would be best if the memory model selection would not occur.

-

From: Randy Dunlap
Date: Monday, November 26, 2007 - 2:17 pm

My attempt at that gave a warning from kconfig:

mm/Kconfig:70:warning: defaults for choice values not supported


Other than that, it seemed to work.
Maybe someone else can have a go at it.

-- 
~Randy
-

From: Andrew Morton
Date: Monday, November 26, 2007 - 2:20 pm

On Mon, 26 Nov 2007 13:00:03 -0800 (PST)

Unfortunately I just dropped that patch because git-x86 has gone and
combined include/asm-x86/sparsemem_32.h and include/asm-x86/sparsemem_64.h
into the same file.

-

From: Christoph Lameter
Date: Monday, November 26, 2007 - 2:52 pm

git-x86 still contains separate sparsemem_32/64.h here.
git lag?

christoph@stapp:~/x86/linux-2.6-x86/.git$ cat config
[core]
        repositoryformatversion = 0
        filemode = true
        bare = false
        logallrefupdates = true
[remote "origin"]
        url = 
git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86.git
        fetch = +refs/heads/*:refs/remotes/origin/*
[branch "master"]
        remote = origin
        merge = refs/heads/master

-

From: Andrew Morton
Date: Monday, November 26, 2007 - 2:57 pm

[Empty message]
From: Christoph Lameter
Date: Monday, November 26, 2007 - 4:19 pm

Updated patch (including Randy's fix) against git-x86 mm.



x86_64: Make sparsemem/vmemmap the only memory model V3

V2->V3:
- Rediff against mm git-x86

V1->V2:
- Rediff against new upstream x86 code that unifies the Kconfig files.

Use sparsemem as the only memory model for UP, SMP and NUMA.
Measurements indicate that DISCONTIGMEM has a higher overhead
than sparsemem. And FLATMEMs benefits are minimal. So I think its
best to simply standardize on sparsemem.

Results of page allocator tests (test can be had via git from slab git
tree branch tests)

Measurements in cycle counts. 1000 allocations were performed and then the
average cycle count was calculated.

Order	FlatMem	Discontig	SparseMem
0	  639	  665		  641
1	  567	  647		  593
2	  679	  774		  692
3	  763	  967		  781
4	  961	 1501		  962
5	 1356	 2344		 1392
6	 2224	 3982		 2336
7	 4869	 7225		 5074
8	12500	14048		12732
9	27926	28223		28165
10	58578	58714		58682

(Note that FlatMem is an SMP config and the rest NUMA configurations)

Memory use:

SMP Sparsemem
-------------

Kernel size:

   text    data     bss     dec     hex filename
3849268  397739 1264856 5511863  541ab7 vmlinux

             total       used       free     shared    buffers     cached
Mem:       8242252      41164    8201088          0        352      11512
-/+ buffers/cache:      29300    8212952
Swap:      9775512          0    9775512

SMP Flatmem
-----------

Kernel size:

   text    data     bss     dec     hex filename
3844612  397739 1264536 5506887  540747 vmlinux

So 4.5k growth in text size vs. FLATMEM.

             total       used       free     shared    buffers     cached
Mem:       8244052      40544    8203508          0        352      11484
-/+ buffers/cache:      28708    8215344

2k growth in overall memory use after boot.



NUMA discontig:

   text    data     bss     dec     hex filename
3888124  470659 1276504 5635287  55fcd7 vmlinux

             total       used       free    ...
From: Valdis.Kletnieks
Date: Tuesday, November 27, 2007 - 12:16 am

Finally got both time and motivation to at least start a bisect..

2.6.23-mm1 works on my D820 (x86_64 kernel, Core2 Duo T7200)

24-rc3-mm1 (plus 3 patches from hotfixes/) bricks *instantly* at boot - grub
prints its 3 or 4 lines saying what it loaded, the screen clears, and *blam*
dead. No serial console output, no pair of penguins on the monitor, no
netconsole, no earlyprintk=vga output, no alt-sysrq, only thing that does
*anything* is "hold the power button for 5 seconds".  Whatever it is, it
happens *very* early (before we get as far as the 'Linux version 2.6.mumble'
banner), and happens *hard*.

I've bisected it down this far:

git-ipwireless_cs.patch GOOD
git-x86.patch
git-x86-fixup.patch
git-x86-thread_order-borkage.patch
git-x86-thread_order-borkage-fix.patch
git-x86-identify_cpu-fix.patch
git-x86-memory_add_physaddr_to_nid-export-for-acpi-memhotplugko.patch
git-x86-memory_add_physaddr_to_nid-export-for-acpi-memhotplugko-checkpatch-fixes.patch
git-x86-inlining-borkage.patch
x86_64-set-cpu_index-to-nr_cpus-instead-of-0.patch
x86_64-make-sparsemem-vmemmap-the-default-memory-model-v2.patch BAD

Anybody got any good debugging ideas before I go through and do the final
3 or 4 bisects?  I suspect I'll need them once I find the offending patch
to tell *why* said patch dies on my box - I've seen enough traffic regarding
-rc3-mm1 dying *later* to know it's probably a subtle issue and not one
that will be obvious once I finger a specific patch.  For example, it's
probably not the IO-APIC panic that people are seeing, because their kernels
live long enough to panic. ;)

From: Andrew Morton
Date: Tuesday, November 27, 2007 - 12:27 am

You could try http://userweb.kernel.org/~akpm/mmotm/ - we might have already
fixed it.

Otherwise, please proceed to work out which diff I need to drop and hope like
hell that it isn't git-x86..
-

From: Valdis.Kletnieks
Date: Tuesday, November 27, 2007 - 12:54 am

I suspect that trying -rc3-mm1 but refreshing just the 10 patches above

That's a 41,240 line diff, the rest *total* to about 400 lines.  I don't have
warm-n-fuzzies about my odds here. ;)

I'm a git-idiot, but *do* know how to git-bisect through Linus tree - what
would I need to do to git-bisect through git-x86.patch? (I do *not* know how
to deal with more than 1 source git tree, so if the magic is just 'get a
linus tree, merge git-x86, then bisect as usual", I'm stuck on "merge git-x86")..

From: Andrew Morton
Date: Tuesday, November 27, 2007 - 1:17 am

All the above are no longer in -mm.  They got merged, dropped,


umm, I'm minimally git-afflicted hence am the wrong person to ask. 
Something like:


- checkout Linus's tree

- echo 'git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86.git#mm' > .git/branches/git-x86

- git-fetch git-x86

- git-checkout git-x86

- start bisecting.
-

From: Ingo Molnar
Date: Tuesday, November 27, 2007 - 3:25 am

hm? x86.git is fully bisectable - so a more accurate statement would be 
"and hope that it's x86.git, so that it can be properly bisected" :-) 
For x86.git bisection, pull the 'mm' branch from:

   git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86.git

	Ingo
-

From: Dave Young
Date: Tuesday, November 27, 2007 - 1:25 am

Hi,
does boot_delay helps?

Regards
dave
-

From: Valdis.Kletnieks
Date: Tuesday, November 27, 2007 - 1:46 am

It might, if the kernel lived long enough to output a first printk for
us to delay after.  :)

Shooting this one would be *easy* if the problem was an boot-time oops that
would otherwise scroll off the screen without a boot_delay...
Previous thread: Re: __rcu_process_callbacks() in Linux 2.6 by James Huang on Tuesday, November 20, 2007 - 8:43 pm. (2 messages)

Next thread: Modules: Handle symbols that have a zero value by Christoph Lameter on Tuesday, November 20, 2007 - 10:56 pm. (3 messages)