ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc6...
- This kernel doesn't work on i386!
It oopses late in boot due to an unrevertable change (e3c1b141) in git-x86
which I stared at for a while then I ran out of time and gave up.I would have just abandoned this release until it was fixed but I'll be
largely offline for ten days starting tomorrow.The culprits have been notified and hopefully we'll have a patch for
hot-fixes/ tomorrow.x86_64 and powerpc work OK though.
- git-block is dropped due to more conflicts that I'm prepared to repair
with git-scsi-misc- git-perfmon is dropped due to conflicts with git-x86
- git-kgdb is dropped due to conflicts with git-x86
- git-newsetup is dropped due to conflicts with git-x86
- Andi's x86 quilt tree is dropped due to conflicts with git-x86
- Someone broke suspend-to-RAM on the t61p again. It just instantly resumes
itself.Boilerplate:
- See the `hot-fixes' directory for any important updates to this patchset.
- To fetch an -mm tree using git, use (for example)
git-fetch git://git.kernel.org/pub/scm/linux/kernel/git/smurf/linux-trees.git tag v2.6.16-rc2-mm1
git-checkout -b local-v2.6.16-rc2-mm1 v2.6.16-rc2-mm1- -mm kernel commit activity can be reviewed by subscribing to the
mm-commits mailing list.echo "subscribe mm-commits" | mail majordomo@vger.kernel.org
- If you hit a bug in -mm and it is not obvious which patch caused it, it is
most valuable if you can perform a bisection search to identify which patch
introduced the bug. Instructions for this process are athttp://www.zip.com.au/~akpm/linux/patches/stuff/bisecting-mm-trees.txt
But beware that this process takes some time (around ten rebuilds and
reboots), so consider reporting the bug first and if we cannot immediately
identify the faulty patch, then perform the bisection search.- When reporting bugs, please try to Cc: the relevant maintainer an...
There's some section mismatch warnings :
MODPOST vmlinux.o
WARNING: vmlinux.o(.text+0x1685c): Section mismatch: reference to
.init.data: (between 'check_dev_quirk' and 'apm_error')
WARNING: vmlinux.o(.text+0x1687b): Section mismatch: reference to
.init.data: (between 'check_dev_quirk' and 'apm_error')
WARNING: vmlinux.o(.text+0x16885): Section mismatch: reference to
.init.data:early_qrk (between 'check_dev_quirk' and 'apm_error')
WARNING: vmlinux.o(.text+0x16890): Section mismatch: reference to
.init.data: (between 'check_dev_quirk' and 'apm_error')
WARNING: vmlinux.o(.text+0x168a3): Section mismatch: reference to
.init.data: (between 'check_dev_quirk' and 'apm_error')
WARNING: vmlinux.o(.text+0x168ab): Section mismatch: reference to
.init.data: (between 'check_dev_quirk' and 'apm_error')
WARNING: vmlinux.o(.text+0x168b3): Section mismatch: reference to
.init.data: (between 'check_dev_quirk' and 'apm_error')
WARNING: vmlinux.o(.text+0x168cd): Section mismatch: reference to
.init.data: (between 'check_dev_quirk' and 'apm_error')
WARNING: vmlinux.o(.text+0x168d3): Section mismatch: reference to
.init.data: (between 'check_dev_quirk' and 'apm_error')
WARNING: vmlinux.o(.text+0x168dc): Section mismatch: reference to
.init.data: (between 'check_dev_quirk' and 'apm_error')
WARNING: vmlinux.o(.text+0x168e5): Section mismatch: reference to
.init.data: (between 'check_dev_quirk' and 'apm_error')config file :
#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.24-rc6-mm1
# Tue Dec 25 14:47:52 2007
#
# CONFIG_64BIT is not set
CONFIG_X86_32=y
# CONFIG_X86_64 is not set
CONFIG_X86=y
# CONFIG_GENERIC_LOCKBREAK is not set
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_FAST_CMPXCHG_LOCAL=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_QUICKLIST=y
CONFIG_GENERIC_ISA_DMA...
I have finally given up on using 2.6.24-rc3-mm2 with slub_debug=FZP to
get more information out of the random crashes I had seen with that
version. (Did not crash once with slub_debug, so no new information on
what the cause was)2.6.24-rc6-mm1 does not boot for me.
It starts my initrd, but when this wants to start the md devices it crashes:
[ 12.900887] Freeing unused kernel memory: 356k freed
[ 15.290320] Clocksource tsc unstable (delta = -558415384 ns)
[ 34.284845] md: Autodetecting RAID arrays.
[ 34.154076] md: Scanned 5 and added 5 devices.
[ 34.154076] md: autorun ...
[ 34.154080] md: considering sdc2 ...
[ 34.155472] md: adding sdc2 ...
[ 34.156728] md: adding sdb2 ...
[ 34.164080] md: sdb1 has different UUID to sdc2
[ 34.165836] md: adding sda2 ...
[ 34.174080] md: sda1 has different UUID to sdc2
[ 34.175852] md: created md1
[ 34.176938] md: bind<sda2>
[ 34.184147] md: bind<sdb2>
[ 34.185219] md: bind<sdc2>
[ 34.186284] md: running: <sdc2><sdb2><sda2>
[ 34.194604] md: do_md_run() returned -22
[ 34.196123] md: md1 stopped.
[ 34.197267] md: unbind<sdc2>
[ 34.204105] md: export_rdev(sdc2)
[ 34.205426] md: unbind<sdb2>
[ 34.206548] md: export_rdev(sdb2)
[ 34.214102] md: unbind<sda2>
[ 34.215223] md: export_rdev(sda2)
[ 34.216544] md: considering sdb1 ...
[ 34.224083] md: adding sdb1 ...
[ 34.225337] md: adding sda1 ...
[ 34.226696] Unable to handle kernel paging request at 0000000034333545 RIP:
[ 34.228481] [<ffffffff803b49a1>] kref_put+0x31/0x80
[ 34.231378] PGD 7e402067 PUD 7e924067 PMD 0
[ 34.233084] Oops: 0002 [1] SMP
[ 34.234076] last sysfs file: /sys/devices/virtual/block/md1/dev
[ 34.234076] CPU 3
[ 34.234076] Modules linked in:
[ 34.234076] Pid: 18, comm: events/3 Not tainted 2.6.24-rc6-mm1 #1
[ 34.234076] RIP: 0010:[<ffffffff803b49a1>] [<ffffffff803b49a1>]
kref_put+0x31/0x80
[ 34.234076] RSP: 0018:ffff81...
Murphy: Just after sending that mail the system crashed two times with
slub_debug=FZP, but did not show any new informations.
No debug output from slub, only this stacktrace: (Its the same I
already reported in the 2.6.24-rc3-mm2 thread)[ 7620.673012] ------------[ cut here ]------------
[ 7620.676291] kernel BUG at lib/list_debug.c:33!
[ 7620.679440] invalid opcode: 0000 [1] SMP
[ 7620.682319] last sysfs file:
/sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
[ 7620.687845] CPU 0
[ 7620.689300] Modules linked in: radeon drm nfsd exportfs w83792d
ipv6 tuner tea5767 tda8290 tuner_xc2028 tda9887 tuner_simple mt20xx
tea5761 tvaudio msp3400 bttv ir_common compat_ioctl32 videobuf_dma_sg
videobuf_core btcx_risc tveeprom videodev usbhid v4l2_common
v4l1_compat hid i2c_nforce2 sg pata_amd
[ 7620.708561] Pid: 5698, comm: nfsv4-svc Not tainted 2.6.24-rc3-mm2 #2
[ 7620.713080] RIP: 0010:[<ffffffff803bae54>] [<ffffffff803bae54>]
__list_add+0x54/0x60
[ 7620.718667] RSP: 0018:ffff81011bca1dc0 EFLAGS: 00010282
[ 7620.722439] RAX: 0000000000000088 RBX: ffff81011c862c48 RCX: 0000000000000002
[ 7620.727504] RDX: ffff81011bc82ef0 RSI: 0000000000000001 RDI: ffffffff807590c0
[ 7620.732581] RBP: ffff81011bca1dc0 R08: 0000000000000001 R09: 0000000000000000
[ 7620.737658] R10: ffff810080058d48 R11: 0000000000000001 R12: ffff81011ed8d1c8
[ 7620.742711] R13: ffff81011ed8d200 R14: ffff81011ed8d200 R15: ffff81011cc0e578
[ 7620.747806] FS: 00007ffe400116f0(0000) GS:ffffffff807d4000(0000)
knlGS:00000000f73558e0
[ 7620.753535] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 7620.757607] CR2: 00000000017071dc CR3: 00000001188b5000 CR4: 00000000000006e0
[ 7620.762677] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 7620.767748] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 7620.772808] Process nfsv4-svc (pid: 5698, threadinfo
FFFF81011BCA0000, task FFFF81011BC82EF0)
[ 7620.778872] Stack: ffff81011bca1e00 ffffffff805be26e
ffff81011ed8d1d0 ff...
That looks like a sunrpc bug. git-nfsd has bene mucking around in there a
--
Please note, that this report is still against 2.6.24-rc3-mm2. The
only new thing about that was, that slub_debug=FZP does not catch theFrom code inspection I would blame the patch "[SKBUFF]: Free old skb
properly in skb_morph" from Herbert Xu. (CC added)Mostly it only shuffles code around, the only real change seems to be this hunk:
@@ -441,7 +446,7 @@ static struct sk_buff *__skb_clone(struct sk_buff
*n, struct sk_buff *skb)
*/
struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src)
{
- skb_release_data(dst);
+ skb_release_all(dst);
return __skb_clone(dst, src);
}
EXPORT_SYMBOL_GPL(skb_morph);Using sbk_release_all instead only skb_release_data (with is called
automatically from the new sbk_release_all) will add a new call to
dst_release(skb->dst); (first line in sbk_release_all)
Could that explain the above underflow warning?(I do not have any clue about the inner workings of the network core,
I would hope this OOPS was caused by the same error, trying to release
the same list twice.Torsten
--
I doubt it. skb_morph is only used on IP fragments so I don't see how
you could attribute an error from a Unix domain socket to this patch.In any case, Unix socket packets should not have a dst at all so the
very fact that you're in that path means that you have some sort of
memory corruption.Is this the very first OOPS/warning that you see? If not you should
ignore all but the very first one as that may have left your system
in an inconsistent state which may render all subsequent OOPSes and
warnings useless.Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
... I did not know about the fact that there should not have been an dst.
Its just that this warning was the first nice clue about the memory
corruption related to networking that I see since 2.6.24-rc3-mm2.
The time of the patch (Mon, 26 Nov 2007 15:11:19) even fits into the
window between -rc3-mm1 and -rc3-mm2.I doubt that the memory corruption is a hardware problem, because the
system in question is using ECC ram and I did not see any messagesI looked into the log in question and the only other warning was a
circular locking dependency that lockdep detected around 1.5 hour
before this warning.As reported in my original mail immeadeatly after the warning the
system OOPSed and hang:
[93436.947241] general protection fault: 0000 [1] SMP
-> first OOPS
[93436.947243] last sysfs file:
/sys/devices/pci0000:00/0000:00:0f.0/0000:01:00.1/irq
[93436.947245] CPU 1
[93436.947246] Modules linked in: radeon drm nfsd exportfs w83792d
ipv6 tuner tea5767 tda8290 tuner_xc2
028 tda9887 tuner_simple mt20xx tea5761 tvaudio msp3400 bttv ir_common
compat_ioctl32 videobuf_dma_sg v
ideobuf_core btcx_risc tveeprom usbhid videodev v4l2_common hid
v4l1_compat pata_amd sg i2c_nforce2
[93436.947257] Pid: 8079, comm: konqueror Not tainted 2.6.24-rc6-mm1 #11
-> not tainted by a previous OOPS
[93436.947259] RIP: 0010:[<ffffffff80531438>] [<ffffffff80531438>]
skb_drop_list+0x18/0x30
[93436.947262] RSP: 0018:ffff810005f4fda8 EFLAGS: 00010286
[93436.947263] RAX: ab1ed5ca5b74e7de RBX: ab1ed5ca5b74e7de RCX: 000000000000d135
[93436.947265] RDX: ffff81011d089a80 RSI: 0000000000000001 RDI: ffff81011d089a88
[93436.947266] RBP: ffff810005f4fdb8 R08: 0000000000000001 R09: 0000000000000006
[93436.947268] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8100de02c500
[93436.947269] R13: ffff81011c188a00 R14: 0000000000000001 R15: ffff81011c189198
[93436.947271] FS: 00007fb5bde0d700(0000) GS:ffff81007ff22000(0000)
knlGS:0000000000000000
[93436.947273] CS: 0010 DS: 0000 ES: 0000 CR0: ...
After testing the patch from http://lkml.org/lkml/2007/12/30/210 the
system hung again after building ~10 packages from the last kde4
release candidate. (see other mail)I then tried to "fix" it with this suspect.
I changed "skb_release_all(dst);" back to "skb_release_data(dst);" in
skb_morph() (net/core/skbuff.c).I'm now at 205 of 210 packages completed without a further hang. I
also do not see an obvious memory leak.(All of these tests where done on 2.6.24-rc3-mm2, as I'm relative
sure, that doing these compiles will trigger the error on that kernel
version)Torsten
--
Check /proc/net/snmp to see if you're getting any fragments, if not
In any case, I suspect the cause of your problem is that somebody
somewhere is doing a double-free on an skb.Since you're the only person who can reproduce this, we really need
your help to track this down. Since bisecting the mm tree is not
practical, you could start by checking whether the bug is in mm only
or whether it affects rc6 too.Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
Vanilla 2.6.24-rc6 seems stable. I did not see any crash or warnings.
Torsten
--
OK that's great. The next step would be to try excluding specific git
trees from mm to see if they make a difference.The two specific trees of interest would be git-nfsd and git-net.
Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
git-nfsd from git://git.linux-nfs.org/projects/bfields/linux.git#for-mm
-> compiling and installing 54 packages worked without crashes.git-net from git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6.25.git
-> compiling and installing 95 packages worked without crashes.The only thing in the announces of 2.6.24-rc3-mm1/2 that stands out for me is:
+iommu-sg-merging-add-device_dma_parameters-structure.patch
+iommu-sg-merging-pci-add-device_dma_parameters-support.patch
+iommu-sg-merging-x86-make-pci-gart-iommu-respect-the-segment-size-limits.patch
+iommu-sg-merging-ppc-make-iommu-respect-the-segment-size-limits.patch
+iommu-sg-merging-ia64-make-sba_iommu-respect-the-segment-size-limits.patch
+iommu-sg-merging-alpha-make-pci_iommu-respect-the-segment-size-limits.patch
+iommu-sg-merging-sparc64-make-iommu-respect-the-segment-size-limits.patch
+iommu-sg-merging-parisc-make-iommu-respect-the-segment-size-limits.patch
+iommu-sg-merging-call-blk_queue_segment_boundary-in-__scsi_alloc_queue.patch
+iommu-sg-merging-sata_inic162x-use-pci_set_dma_max_seg_size.patch
+iommu-sg-merging-aacraid-use-pci_set_dma_max_seg_size.patchiommu work
I will enable CONFIG_IOMMU_DEBUG in -rc6-mm1 and see, as otherwise I
have no clue where to look...Torsten
--
Hi,
A few questions/suggestions:
- is it still vanilla -rc6-mm1; I've seen on kernel list you tried
some fixes around raid?- could you remind this lockdep warning; is it always and the same,
always before crash, or no rules?- I've seen you looked after double freeing, but this last debug list
warning could suggest locking problems during list modification too.- above git-nfsd and git-net tests should be probably repeated with
-rc6-mm1 git versions: so vanilla rc6 plus both these -mm patches
only, and if bug triggers, with one reversed; btw., since in previous
message you mentioned that 50 packages could be not enough to trigger
this, these 54 above could make too little margin yet.Regards,
Jarek P.
--
I'm open for any suggestions and will try to answer any questions.
The only thing that is sadly not practical is bisecting the borkenout
mm-patches, as triggering this error is to unreliable /Yes, without these fixes I can't boot.
But they should only be run during starting the arrays, so I doubt
that this is that cause.
(Also -rc3-mm2 did not need this fix)???
I see no lockdep warning before the crashes.
I have seen a warning about the dst->__refcnt in dst_release and
different warnings about list operations.I think I have always posted everything I have seen before the
crashes. (captured via serial console)(If you mean the lockdep-problem in -rc6: That is more or less a
missing annotation during early bootup. The only problem with that is,
that it will causes lockdep to be turned off and so it can not be used
to find any real problem. A fix for that is in -mm so I do haveYes, but Herbert mentioned double freeing a skb explicit and so I
tried to catch this.
I do not know enough about the network core to verify the locking ofYes, I think I really need to redo the git-nfsd-test.
With IOMMU_DEBUG enabled rc6-mm1worked for 52 packages, only a secound
run of kde-packages triggered it after only 5 packages.
I don't know what this bug hates about kdeartwork-wallpaper (triggered
it this time) or kdeartwork-styles.Output from the crash with IOMMU_DEBUG (lockdep was enabled, but did
not trigger):
[15593.236374] Unable to handle kernel NULL pointer
dereference<3>list_add corruption. prev->next should be next
(ffffffff8078a410), but was ffff81011ec01e68. (prev=ffff81011ec01e68).
[15593.236374] at 0000000000000000 RIP:
[15593.236374] [<0000000000000000>]
[15593.236374] PGD 79d22067 PUD 7acd7067 PMD 0
[15593.236374] Oops: 0010 [1] SMP
[15593.236374] last sysfs file:
/sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
[15593.236374] CPU 2
[15593.236374] Modules linked in: radeon drm w83792d ipv6 tuner
tea5767 tda8290 tuner_xc2028 tda9887 tu...
49 more (kde-)packages did work too. Still looks like it is only in -mm.
Torsten
--
You've written vanilla -rc6 is OK. Does it mean -rc6 with these fixes?
I think it would be easier just to start with this working -rc6 and
simply check if we have 'right' suspects, so: git-net.patch and
git-nfsd.patch from -mm1-broken-out, as suggested by Herbert (I hope,
can compile - otherwise you could try the other way: add the whole -mm
and revert these two). Using current gits could complicate thisSo, you mean there are no more of these?:
"looked into the log in question and the only other warning was a
circular locking dependency that lockdep detected around 1.5 hour
before this warning."
...I didn't read all this thread, so probably I miss many points, but are
you sure there are no problems with filesystem corruption around theseFine! I'll try to look at this. BTW, I guess/hope DEBUG_SLAB etc. are
also on...Thanks,
Jarek P.
--
vanilla -rc6 is fine without these fixes.
The raid-bugs from -rc6-mm1 are probably introduced by
md-allow-devices-to-be-shared-between-md-arrays.patch and that patchAha, I had forgotten about that one.
Looking at all the crashlogs, I do not find another one of this lockdep warning.I had hoped that I could catch use-after-freeing by using
slub_debug=FZP, but that did not help.
(first oops in http://lkml.org/lkml/2007/12/28/159 )I think that the main skb structs come from slub and should be
poisoned by this, so it might be some other data structure that isFor my setup: It's a gentoo system, so compiling packages is the
normal way of installing something.
The compile itself is done on a tmpfs so a filesystem corruption there
should be rather impossible. ;)
(The system has 4Gb RAM, so it doesn't even need to swap)
The sources are taken from a nfsv4 share that is served from a
different system. Also gentoo checksums all sources it will use.After the crashes I also did a checksum of the last installed
packages. Only in one instance there was corruption, all new files
where completely empty. Obviously XFS did not have the time to write
them back to disk before the system crashed.
Also as all crashes show network related traces and the system is
working fine otherwise, I doubt any permanent filesystem problems.For the raid problems: I was just unable to even start the raid that
has / on it, because of a wrong check in the raid-autostart code.DEBUG_SLAB is off, because of:
CONFIG_SLUB_DEBUG=y
# CONFIG_SLAB is not set
CONFIG_SLUB=yBut I'm currently did not have the slub_debug-option in my kernel
commandline, because:
a) slub_debug=FZP did not prevent the bug in -rc3-mm2
b) but it took a much longer time to trigger it
c) its a serious slowdown for these compilesIf you think some other slub_debug might catch it, I would try this...
Torsten
--
It seems that this last report gives the third one: ieee1394 to the pack,
so probably, you can hold on a "minute" - this all needs some rethinking.Yes, since this was no problem with vanilla 2.6.24-rc6, I've probably
OK! But, in the meantime could you send your current .config? I wonder
e.g. if there could be used this new ieee1394 code from
init_ohci1394_dma.c?You are really helpful, thanks,
Jarek P.
--
I don't think ieee1394 is to blame here. See http://lkml.org/lkml/2007/11/29/372
This was the first report of these crashes.
The first one is a similar crash in the ieee1394 code and my first try
was to blame it. But switching to a real network card did not solve
this, as the second crash in that mail shows.
Also Stefan Richter said in http://lkml.org/lkml/2007/11/29/419 this:
"FWIW, eth1394 and the entire rest of the 1394 stack beneath eth1394
are identical between -mm and Linus' tree."I'm still using the old ieee1394-stack and not the new firewire one,
as eth1394 had not been ported at that time.It might be possible that these are two different bugs, but two bugs
with same symptom's of corrupted lists at the same time seem unlikely.
(Especially this last report of the oops in 1394 looks rather
strange. Things can only go onto hpsbpkt_queue if they have a non NULL
complete_routine. (see queue_packet_complete() in
drivers/ieee1394/ieee1394_core.c). But a call to a NULL
complete_routine seems to be the cause of one of the two oopses. So it
looks like the hpsbpkt_queue list got mangled. But this list is only
used in this file and all three places that access this list are
protected by spinlocking pending_packets_lock.So my personal conclusion would be, that someone is writing to memory
that he no longer owns. Most probably 0-bytes. (the complete_routine
got NULLed and the warning about dst->__refcnt being 0).Use-after-free or something else?
Attached. (Last one I was using with 2.6.24-rc6-mm1. For all other
Interesting. I didn't even know about this file / option.
But four things make an involvement rather doubtful:
a) I do not find a single line like "init_ohci1394_dma: initializing
OHCI-1394" in any of the syslogs.
b) I do not have the parameter ohci1394_dma=early set
c) # CONFIG_PROVIDE_OHCI1394_DMA_INIT is not set
d) I have seen the crash in svc_xprt_enqueue() without eth1394 and at
that try there was not a single firewire device attached.I will no...
working on it...
2.6.24-rc6 + mm-patches up to git.battery (includes git-net and
git-netdev-all) worked for 110 packages, then I proclaimed it good.
2.6.24-rc6 + mm-patches up to (including) git.nfsd is currently
getting testet (9 packages done...)But the cause of my mail is the following question:
Regarding my "iommu-sg-merging-patches are new in -rc3-mm and could be
the cause"-suspicion I looked at these patches and came across these
hunks:This is removed from arch/x86/lib/bitstr_64.c:
-/* Find string of zero bits in a bitmap */
-unsigned long
-find_next_zero_string(unsigned long *bitmap, long start, long nbits, int len)
-{
- unsigned long n, end, i;
-
- again:
- n = find_next_zero_bit(bitmap, nbits, start);
- if (n == -1)
- return -1;
-
- /* could test bitsliced, but it's hardly worth it */
- end = n+len;
- if (end > nbits)
- return -1;
- for (i = n+1; i < end; i++) {
- if (test_bit(i, bitmap)) {
- start = i+1;
- goto again;
- }
- }
- return n;
-}This is added to lib/iommu-helper.c:
+static unsigned long find_next_zero_area(unsigned long *map,
+ unsigned long size,
+ unsigned long start,
+ unsigned int nr)
+{
+ unsigned long index, end, i;
+again:
+ index = find_next_zero_bit(map, size, start);
+ end = index + nr;
+ if (end > size)
+ return -1;
+ for (i = index + 1; i < end; i++) {
+ if (test_bit(i, map)) {
+ start = i+1;
+ goto again;
+ }
+ }
+ return index;
+}The old version checks, if find_next_zero_bit returns -1, the new
version doesn't do this.
Is this intended and can find_next_zero_bit never fail?
Hmm... but in the worst case it s...
That kernel did also work for all 110 packages.
2.6.24-rc6 + mm-patches up to (including) git.xfs -> crash
[ 576.899332] ------------[ cut here ]------------
[ 576.903661] kernel BUG at lib/list_debug.c:33!
[ 576.903661] invalid opcode: 0000 [1] SMP
[ 576.903661] last sysfs file:
/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
[ 576.903661] CPU 3
[ 576.903661] Modules linked in: radeon drm w83792d ipv6 tuner
tea5767 tda8290 tuner_xc2028 tda9887 tuner_simple mt20xx tea5761
tvaudio msp3400 bttv ir_common compat_ioctl32 videobuf_dma_sg
videobuf_core btcx_risc tveeprom videodev v4l2_common usbhid
v4l1_compat sg hid i2c_nforce2 pata_amd
[ 576.903661] Pid: 5559, comm: nfsv4-svc Not tainted 2.6.24-rc6-mm-git.xfs #2
[ 576.903661] RIP: 0010:[<ffffffff803c16e4>] [<ffffffff803c16e4>]
__list_add+0x54/0x60
[ 576.903661] RSP: 0018:ffff81007d4e1dc0 EFLAGS: 00010282
[ 576.903661] RAX: 0000000000000088 RBX: ffff81007e955800 RCX: fffffffffc6c7900
[ 576.903661] RDX: ffff81007d53eef0 RSI: 0000000000000001 RDI: ffffffff80760140
[ 576.903661] RBP: ffff81007d4e1dc0 R08: 0000000000000001 R09: 0000000000000000
[ 576.903661] R10: ffff810080062008 R11: 0000000000000001 R12: ffff81007ed00900
[ 576.903661] R13: ffff81007ed00938 R14: ffff81007ed00938 R15: ffff81007dd6f100
[ 576.903661] FS: 00007f1b7e6a36f0(0000) GS:ffff81011ff1b780(0000)
knlGS:0000000000000000
[ 576.903661] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 576.903661] CR2: 00007ffb28c2c000 CR3: 00000000741ab000 CR4: 00000000000006e0
[ 576.903661] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 576.903661] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 576.903661] Process nfsv4-svc (pid: 5559, threadinfo
ffff81007d4e0000, task ffff81007d53eef0)
[ 576.903661] Stack: ffff81007d4e1e00 ffffffff805c4dbb
ffff81007ed00908 ffff81007dd6f100
[ 576.903661] ffff81011ad7bc00 ffff81007d458000 ffff81007e955800
ffff81007dd6f110
[ 576.903661] ffff81007d4e1e10 ffffff...
I susect these are doing different things.
iommu-sg-kill-__clear_bit_string-and-find_next_zero_string.patch says:This kills unused __clear_bit_string and find_next_zero_string (they
were used by only gart and calgary IOMMUs).iommu-sg-add-iommu-helper-functions-for-the-free-area-management.patch says:
This adds IOMMU helper functions for the free area management. These
functions take care of LLD's segment boundary limit for IOMMUs. TheyLet's cc the author of those changes.
Thanks for persisting with this bug.
--
On Sat, 5 Jan 2008 17:25:24 -0800
find_next_zero_bit returns -1?
It seems that x86_64 doesn't. POWER and SPARC64 IOMMUs use
find_next_zero_bit too but both doesn't check if find_next_zero_bit
returns -1. If find_next_zero_bit fails, it returns size. So it
doesn't leads to an endless loop.But this patch has other bugs that break POWER IOMMUs.
If you use the IOMMUs on POWER, please try the following patch:
--
I'm sorry. I didn't look into find_next_zero_bit, I only noted that
the old version did check for -1 and the new one didn't.I also noted the line "index = (index + align_mask) & ~align_mask;" in
iommu_area_alloc() and didn't understand what this was trying to do
and how this should work, but as arch/x86/kernel/pci-gart_64.c always
uses 0 as align_mask I just ignored it.I will applie your patch and see if this hunk from
find_next_zero_area() makes a difference:end = index + nr;
- if (end > size)
+ if (end >= size)
return -1;
- for (i = index + 1; i < end; i++) {
+ for (i = index; i < end; i++) {
if (test_bit(i, map)) {Torsten
--
On Sun, 6 Jan 2008 11:41:10 +0100
Yeah, it's for only POWER IOMMUs. It's meaningless for gart and
The patch should not make a difference for X86_64.
Can you try the patch to revert my IOMMU changes?
http://www.mail-archive.com/linux-scsi@vger.kernel.org/msg12694.html
--
Hmm...
arch/x86/kernel/pci-gart_64.c:
alloc_iommu() calls iommu_area_alloc()
lib/iommu-helper.c:
iommu_area_alloc() calls find_next_zero_area()
-> so the above code should be called even on X86_64And the change in the for loop means that 'index' will now be tested,
but with the old code it was not.Testing for this bug is a little bit slow, as I'm compiling ~100
packages trying to trigger it.
If my current testrun with the patch from
http://www.mail-archive.com/linux-scsi@vger.kernel.org/msg12702.html
crashes, I will revert the hole IOMMU changes with above patch and try again.Torsten
--
On Sun, 6 Jan 2008 12:35:35 +0100
Oops, I meant that the patch fixes the align allocation (non zero
With the old code, 'index' is tested by find_next_zero_bit.
With the new code and non zero align_mask case, 'index' is not tested
by find_next_zero_bit. So test_bit needs to start with 'index'.So If I understand the correctly, this patch should not make a
Thanks for testing,
--
... as you say below, the test for the index position is only needed
You did not miss anything.
OK, I'm still testing this, but after 95 completed packages I'm rather
certain that reverting the IOMMU changes with this patch fixes my
problem.
I didn't have time to look more into this, so I can't offer any
concrete ideas where the bug is.If you send more patches, I'm willing to test them, but it might take
some more time during the next week.Thanks for looking into this.
Torsten
--
On Sun, 6 Jan 2008 21:03:42 +0100
Can you try 2.6.24-rc7 + the IOMMU changes?
The patches are available at:
http://www.kernel.org/pub/linux/kernel/people/tomo/iommu/
Or if you prefer the git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/tomo/linux-2.6-misc.git iommu-sg-fixes
I've looked at the changes to GART but they are straightforward and
don't look wrong...Thanks,
--
Sorry for the *really* late answer, but I did not have any time to do
linux things the last weeks. :-(Until my last mail from 7. Jan this was true, that I was not able to
crash 2.6.24-rc6-mm1 with above patch.
But after testing 2.6.24-rc7 with only the IOMMU changes applied it
did crash once again.After looking at the patch that seems rather expected as it only
touches powerpc code.
(I only looked at its diffstat after testing it, so I was not aware ofThe resulting 2.6.24-rc7 kernel worked for me. I compiled 146 packages
without a crash.Today I finally had some time for debugging again and tried the new
2.6.24-rc8-mm1.
The crash is still there, I will report that crash in current thread.Torsten
--
btw., these improvements to the IOMMU code are in -mm and will go into
v2.6.25, right? The changes look robust to me.Ingo
--
On Tue, 8 Jan 2008 16:59:48 +0100
Thanks, they have been in -mm though the iommu helper fix hasn't
yet. Balbir Singh found the bug in 2.6.24-rc6-mm1. I've just check
mmotm and found that the IOMMU helper patch doesn't include the fix.Andrew, can you replace
iommu-sg-add-iommu-helper-functions-for-the-free-area-management.patch
with the updated patch:
http://ozlabs.org/pipermail/linuxppc-dev/2007-December/048997.html
For your convenience I've attached the updated patch too.
Hopefully, they will go into v2.6.25. At least, I hope that the
patches (0001-0011) that make the IOMMUs respect segment size limits
when merging sg lists will be merged. They are simple and I got ACKs
on POWER and PARISC.Thanks,
=
From: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Subject: [PATCH] add IOMMU helper functions for the free area managementThis adds IOMMU helper functions for the free area management. These
functions take care of LLD's segment boundary limit for IOMMUs. They would be
useful for IOMMUs that use bitmap for the free area management.Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
---
include/linux/iommu-helper.h | 7 ++++
lib/Makefile | 1 +
lib/iommu-helper.c | 80 ++++++++++++++++++++++++++++++++++++++++++
3 files changed, 88 insertions(+), 0 deletions(-)
create mode 100644 include/linux/iommu-helper.h
create mode 100644 lib/iommu-helper.cdiff --git a/include/linux/iommu-helper.h b/include/linux/iommu-helper.h
new file mode 100644
index 0000000..4dd4c04
--- /dev/null
+++ b/include/linux/iommu-helper.h
@@ -0,0 +1,7 @@
+extern unsigned long iommu_area_alloc(unsigned long *map, unsigned long size,
+ unsigned long start, unsigned int nr,
+ unsigned long shift,
+ unsigned long boundary_size,
+ unsigned long align_mask);
+extern void iommu_area_free(unsigned long *map, unsigned long start,
+ unsigned int nr);
diff --git a/lib/Makefile b/lib/Ma...
On Wed, Jan 09, 2008 at 08:57:53AM +0900, FUJITA Tomonori wrote:
This '>=' looks doubtful to me, e.g.:
map points to 0s only, size = 64, nr = 64,
we get: index = 0; end = 64;
and: return -1 ?!Regards,
Jarek P.
--
On Wed, 9 Jan 2008 10:04:42 +0100
You are right. I did it only because I didn't want to change the
original code (iommu_range_alloc in arch/powerpc/kernel/iommu.c). I
thought that there might be a mysterious reason for it so I let it
alone since it's tiny loss.Thanks,
--
On Wed, 09 Jan 2008 08:57:53 +0900
The ALIGN() macro is the approved way of doing this.
(I don't think ALIGN adds much value really, especially given that you've
commented what's going on, but I guess it does make reviewing and reading a
--
On Tue, 8 Jan 2008 16:27:39 -0800
Would be better to use __ALIGN_MASK? I can find only one user who
directly use __ALIGN_MASK. The POWER IOMMU calculates align_mask by
itself so it's easier to pass align_mask as an argument.
--
On Wed, 09 Jan 2008 09:54:45 +0900
ALIGN() should be OK - its aditional type coercion isn't useful in this
case but ALIGN() is the official interface.I don't see any reason why vermilion.c had to reach for __ALIGN_MASK. I'll
switch it to ALIGN().--
On Sat, Jan 05, 2008 at 03:52:32PM +0100, Torsten Kaiser wrote:
I agree: your conclusion seems to be the most probable explanation for
this. Then it could be really hard to solve this without bisection or
something similar. But there is some probabability this something couldYou can try to add "U" to these other slub_debug options. As a matter
of fact, if your above diagnose is right, it seems you risk to damage
your system or even the box with these tests, so if you want to
continue, you should probably turn any possible debugging on (not in
mm only).BTW, you've written that some debugging options seem to delay the bug.
Since they often change sizes of some structures than such wrong
writes could have some 'safer' offsets. So, this could really delay
e.g. these list's bugs, but maybe this could also let to stay 'alive'
to such wrong kfree?Cheers,
Jarek P.
--
As for example in the case when it dies in ieee1394-thread the list is
so corrupted that it will die anyway.I did not add U, because I thought that would only needed to trace memory leaks.
I think this bug is highly timing dependent. Its not always the same
package that dies and as this is a SMP system I would guess two CPUs
using the same data will trigger this.
And using the poison-option will definitily slow the system down and
mess up the timings.What also speaks against the 'safer' offsets is, that after adding my
notfreed-byte to skbuff the bug still triggered in the same way.I'm currently looking at
http://www.mail-archive.com/linux-scsi@vger.kernel.org/msg12702.html
,trying to understand if this is relevant for me on x86_64.Torsten
--
On Sun, Jan 06, 2008 at 11:30:48AM +0100, Torsten Kaiser wrote:
Of course it looks like using the same data, but it seems there is no
reason to think it needs the same time: e.g. some timer or workqueue
could retrigger after it's supposed to be killed. Any additional
debugging/poisonning might help to see it earlier, so this should be
safer for your system, but, most probably this would show data fromWe are not even sure skbuffs were directly affected by this or they
were incorrectly freed because of other structures beeing damaged?IMHO, e.g. starting your system with limited memory should cause
faster memory reclaiming, and thus more often triggering of these bugs,
but of course I can be wrong.Jarek P.
--
Also, if it's git-nfsd, it'd be useful to test with the current git-nfsd
from the for-mm branch at:git://linux-nfs.org/~bfields/linus.git for-mm
and then any bisection results (even partial) from that tree would help
immensely....--b.
--
Wrong URL, its (now?) at git://git.linux-nfs.org/projects/bfields/linux.git
Using "HEAD is now at cd7e1c9... Merge commit 'server-xprt-switch^' into for-mm"
I was able to compile&install 54 packages, so this seems to be working.Now git-fetch'ing
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6.25.gitTorsten
--
Whoops, apologies. Actually, the only change required should have been
to change "linus" to "linux". I should cut-n-paste and not relying onHm, OK, thanks again for all this followup.
--
The problem with that is, that triggering the bug is not easy so
marking anything 'good' is questionable.
This time I needed to compile over 50 packages until it triggered.I was using 2.6.24-rc6-mm1 again, but with a crude hack (see end of
mail) that I hope should catch any double-frees of skbs.
None of my warnings triggered, only a list corruption again in
svc_xprt_enqueue(), but this time with an additional output about
whats wrong with the list:
[17023.029519] list_add corruption. prev->next should be next
(ffff8100d20ec1c8), but was ffff81009c5a6
c28. (prev=ffff81009c5a6c28).
[17023.029537] ------------[ cut here ]------------
[17023.031445] kernel BUG at lib/list_debug.c:33!
[17023.033280] invalid opcode: 0000 [1] SMP
[17023.034967] last sysfs file:
/sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
[17023.038209] CPU 3
[17023.039047] Modules linked in: radeon drm w83792d ipv6 tuner
tea5767 tda8290 tuner_xc2028 tda9887 tu
ner_simple mt20xx tea5761 tvaudio msp3400 bttv ir_common
compat_ioctl32 videobuf_dma_sg videobuf_core b
tcx_risc usbhid tveeprom videodev hid v4l2_common v4l1_compat sg
pata_amd i2c_nforce2
[17023.039519] Pid: 20564, comm: nfsv4-svc Not tainted 2.6.24-rc6-mm1 #14
[17023.039519] RIP: 0010:[<ffffffff803bd834>] [<ffffffff803bd834>]
__list_add+0x54/0x60
[17023.039519] RSP: 0018:ffff8101002c9dc0 EFLAGS: 00010282
[17023.039519] RAX: 0000000000000088 RBX: ffff810110125c00 RCX: 0000000000000000
[17023.039519] RDX: ffff81010067c000 RSI: 0000000000000001 RDI: ffffffff80764140
[17023.039519] RBP: ffff8101002c9dc0 R08: 0000000000000001 R09: 0000000000000000
[17023.039519] R10: ffff81000100a088 R11: 0000000000000001 R12: ffff8100d20ec180
[17023.039519] R13: ffff8100d20ec1b8 R14: ffff8100d20ec1b8 R15: ffff8101188e4600
[17023.039519] FS: 00007ff7a870c6f0(0000) GS:ffff81011ff0cd00(0000)
knlGS:0000000000000000
[17023.039519] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[17023.039519] CR2: 00000000024df510 CR3: 0000000032539000 CR4: 0000000...
OK, thanks for that hint.
Is there any debug option I could turn on to catch this?
Hmm... __alloc_skb() uses kmem_cache_alloc_node() and I did run
-rc3-mm2 a long time with slub_debug=FZP and that did not catch
anything. Shouldn't the poisoning catch that? (Sorry if this questionI will try -rc6-mm1 and vanilla -rc6 and report back.
Torsten
--
I can't explain, why this seems to fix 2.6.24-rc3-mm2 for me, but at
During normal work I did not see the frag counters increase.
I used ping -s 10000 to create some frags, worked perfectly.
I used netio -b 63k -u [target] to create around half a million frags,
worked too.And what really is strange is that I changed skb_morph into this:
struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src)
{
printk(KERN_ERR "morph %p:%p",dst,src);
WARN_ON(1);
skb_release_all(dst);
return __skb_clone(dst, src);
}The problem bisecting this, is that I can't seem to trigger this on
demand. Today I was just about giving up on triggering it in -rc6-mm1
with doing package complies when did happen again. But that was afterAs noted above, my WARN_ON(1) in skb_morph did not trigger once before
the system died with this OOPS:
[18663.909931] Unable to handle kernel NULL pointer dereference at
0000000000000000 RIP:
[18663.915489] [<ffffffff8055f2e8>] tcp_read_sock+0x58/0x1b0
[18663.918652] PGD 73442067 PUD 7480e067 PMD 0
[18663.918652] Oops: 0000 [1] SMP
[18663.918652] last sysfs file:
/sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
[18663.918652] CPU 1
[18663.918652] Modules linked in: radeon drm nfsd exportfs w83792d
ipv6 tuner tea5767 tda8290 tuner_xc2028 tda9887 tuner_simple mt20xx
tea5761 tvaudio msp3400 bttv ir_common compat_ioctl32 videobuf_dma_sg
videobuf_core btcx_risc tveeprom usbhid videodev v4l2_common
v4l1_compat hid sg pata_amd i2c_nforce2
[18663.918652] Pid: 0, comm: swapper Not tainted 2.6.24-rc6-mm1 #13
[18663.918652] RIP: 0010:[<ffffffff8055f2e8>] [<ffffffff8055f2e8>]
tcp_read_sock+0x58/0x1b0
[18663.918652] RSP: 0018:ffff81007ff4fb60 EFLAGS: 00010286
[18663.918652] RAX: 0000000000000038 RBX: 0000000000000000 RCX: 0000000000000000
[18663.918652] RDX: ffff8100141a40b0 RSI: ffff81007ff4fbc0 RDI: 0000000000000000
[18663.918652] RBP: ffff81007ff4fbb0 R08: 0000000000000002 R09: 0000000000000000
[18663.9186...
---
~Randy
desserts: http://www.xenotime.net/linux/recipes/
--
Can you still reproduce this? Tom thought there was a chance the
following could fix it.--b.
From: Tom Tucker <tom@opengridcomputing.com>
Date: Sun, 30 Dec 2007 10:07:17 -0600Bruce/Aime:
Here is what I believe to be the fix for the crashes/svc_xprt BUG_ON
that people are seeing. It would be great if those who have seen this
problem could apply this patch and see if it resolves their problem.The common code calls svc_xprt_received on behalf of the transport.
Since the provider was calling it as well, this resulted in clearing the
busy bit/resetting xpt_pool when the BUSY bit wasn't held.diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 4628881..4d39db1 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -1272,7 +1272,6 @@ static struct svc_xprt *svc_create_socket(struct svc_serv *serv,if ((svsk = svc_setup_socket(serv, sock, &error, flags)) != NULL) {
svc_xprt_set_local(&svsk->sk_xprt, newsin, newlen);
- svc_xprt_received(&svsk->sk_xprt);
return (struct svc_xprt *)svsk;
}-
--
Please see also http://lkml.org/lkml/2007/12/29/76
Just wanted to say that slub_debug did not help to get more infos.
I will try to reproduce this with rc3-mm2 and the below patch tomorrow.
Without slub_debug this seemed to trigger rather reliable when tryingI will send a mail, when I'm done with testing this...
Thanks for the patch.
Torsten
--
Removing this line from 2.6.24-rc3-mm2 does not solve my crash
FYI the codepart from net/sunrpc/svcsock.c / svc_create_socket() where
I removed this:
if (protocol == IPPROTO_TCP) {
if ((error = kernel_listen(sock, 64)) < 0)
goto bummer;
}if ((svsk = svc_setup_socket(serv, sock, &error, flags)) != NULL) {
memcpy(&svsk->sk_xprt.xpt_local, newsin, newlen);
//svc_xprt_received(&svsk->sk_xprt);
return (struct svc_xprt *)svsk;
}bummer:
dprintk("svc: svc_create_socket error = %d\n", -error);The crash itself:
[11166.565362] ------------[ cut here ]------------
[11166.568595] kernel BUG at lib/list_debug.c:33!
[11166.571696] invalid opcode: 0000 [1] SMP
[11166.574527] last sysfs file:
/sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
[11166.580017] CPU 3
[11166.581442] Modules linked in: radeon drm nfsd exportfs w83792d
ipv6 tuner tea5767 tda8290 tuner_xc2
028 tda9887 tuner_simple mt20xx tea5761 tvaudio msp3400 bttv ir_common
compat_ioctl32 videobuf_dma_sg v
ideobuf_core btcx_risc tveeprom videodev usbhid v4l2_common hid
v4l1_compat sg pata_amd i2c_nforce2
[11166.600470] Pid: 5548, comm: nfsv4-svc Not tainted 2.6.24-rc3-mm2 #3
[11166.604912] RIP: 0010:[<ffffffff803bae54>] [<ffffffff803bae54>]
__list_add+0x54/0x60
[11166.610408] RSP: 0000:ffff81007d83fdc0 EFLAGS: 00010282
[11166.614144] RAX: 0000000000000088 RBX: ffff81007f2e0400 RCX: 0000000000000002
[11166.619113] RDX: ffff81007dc6eed0 RSI: 0000000000000001 RDI: ffffffff807590c0
[11166.624130] RBP: ffff81007d83fdc0 R08: 0000000000000001 R09: 0000000000000000
[11166.629124] R10: ffff810080058d48 R11: 0000000000000001 R12: ffff81007e444680
[11166.634129] R13: ffff81007e4446b8 R14: ffff81007e4446b8 R15: ffff81011ff50100
[11166.639128] FS: 00007fb815abc6f0(0000) GS:ffff81011ff13280(0000)
knlGS:0000000000000000
[11166.644786] CS: 0010 DS: 0000 ES: 0000 CR0:...
I don't know. It could even be that both patch series are OK but when they
are combined, things fail.Greg, Alasdair: the above looks like a preview of 2.6.25-rc1 :(
--
OK, I debugged this some more. It looks like two bugs meshed together.
One new bug: "do_md_run() returned -22"
I can't seem to start my raid anymore.
The following part of md-allow-devices-to-be-shared-between-md-arrays
adds a new check to do_md_run() (drivers/md/md.c) that fails for my system:
@@ -3213,8 +3283,11 @@ static int do_md_run(mddev_t * mddev)
/*
* Analyze all RAID superblock(s)
*/
- if (!mddev->raid_disks)
+ if (!mddev->raid_disks) {
+ if (!mddev->persistent)
+ return -EINVAL;
analyze_sbs(mddev);
+ }chunk_size = mddev->chunk_size;
The raid gets started normally with any other kernel I tried.
I did not investigate the cause of this failure further, because I was
looking why a failure to start a raid was causing event/3 to oops.This looks like a secound, but rather old bug.
do_md_stop (from drivers/md/md.c) does the following:
3691 /* make sure all delayed_delete calls have finished */
3692 flush_scheduled_work();
3693
3694 export_array(mddev);
3695
But: Only the callchain export_array -> kick_rdev_from_array ->
unbind_rdev_from_array schedules the delayed_delete's!After adding a second flush_scheduled_work() below the export_array()
the resulting kernel no longer oopses and my initrd normally asks for
an alternative root-fs, because of the first bug the raid still does
not get started.I don't know if this flush_scheduled_work() is misplaced since is was
introduced, or if it really even was only trying to flush delayed
deletes from previously stopped arrays.When investigation this, I got these debug-outputs:
first try, with my second flush_scheduled_work removed again:
[ 34.290576] md: Autodetecting RAID arrays.
[ 34.125649] md: Scanned 5 and added 5 devices.
[ 34.125649] md: autorun ...
[ 34.125649] md: considering sdc2 ...
[ 34.125658] md: adding sdc...
[author CCed]
This hunk is indeed buggy.
analyze_sbs() calls load_super() and validate_super() and only the
validate function is setting mddev->persistent, so this new check
needs to be after the call analyze_sbs(mddev).Changing this allows my system to boot correctly, including starting KDE.
Please note, that this is not a fix for the OOPS in delayed_delete,
the OOPS just doesn't happen, because the buggy error path is no
longer used.Torsten
--
Hi Andrew,
The 2.6.24-rc6-mm1 kernel with hotfix x86-fix-system-gate-related-crash.patch applied
panics while booting on a x86_64 boxUnable to handle kernel NULL pointer dereference at 0000000000000046 RIP:
[<ffffffff80369a0b>] rb_erase+0xe7/0x2a3
PGD 17ff65067 PUD 17f1c7067 PMD 0
Oops: 0000 [1] SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:0a.0/0000:02:04.0/host0/target0:0:6/0:0:6:0/type
CPU 0
Modules linked in:
Pid: 0, comm: swapper Not tainted 2.6.24-rc6-mm1-autokern1 #1
RIP: 0010:[<ffffffff80369a0b>] [<ffffffff80369a0b>] rb_erase+0xe7/0x2a3
RSP: 0000:ffffffff80650e00 EFLAGS: 00010002
RAX: ffff8101fe9568c8 RBX: ffff8100010062a8 RCX: ffff8101fe9568b0
RDX: ffff8101fe9568c8 RSI: 0000000000000046 RDI: 0000000000000000
RBP: ffffffff80650e10 R08: ffff8101fe9568c8 R09: 0000000000000086
R10: 0000000000000000 R11: 00000000000001e8 R12: ffff8100010062b8
R13: 0000000000000002 R14: ffff810001006260 R15: 0000000000000001
FS: 0000000000000000(0000) GS:ffffffff805dc000(0000) knlGS:00000000f31ffbb0
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000046 CR3: 000000017f0ab000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffffffff805f6000, task ffffffff805a2080)
Stack: ffff8100010062a8 ffff8101fe9568b0 ffffffff80650e40 ffffffff8024be16
ffffffff80369d65 ffffffff80369d65 ffff8101fe9568b0 ffff8100010062a8
ffffffff80650eb0 ffffffff8024c1d5 ffffffffb88cc28e 0000000006e73eff
Call Trace:
<IRQ> [<ffffffff8024be16>] __remove_hrtimer+0x2e/0x3c
[<ffffffff80369d65>] __down_read_trylock+0x16/0x42
[<ffffffff80369d65>] __down_read_trylock+0x16/0x42
[<ffffffff8024c1d5>] hrtimer_run_queues+0x130/0x191
[<ffffffff8023fd09>] run_timer_softirq+0x28/0x1a7
[<ffffffff8023c018>] __do_softirq+0x55/0xc2
[<ffffffff8020c73c>] call_softirq+0x1c/0x28
[<ffffffff8020e71...
It does seem to be mostly hrtimer-related. But surely the hrtimer system
is initialised by the time tis happens.The usual refrain: is it possible to run a bisection search?
--
Hi Andrew,
While doing the git bisect, following panic was seen
Unable to handle kernel paging request at 000000000000401e RIP:
[<ffffffff80232ec8>] load_balance_monitor+0x15e/0x2a4
PGD 0
Oops: 0000 [1] SMP
last sysfs file: /devices/pci0000:00/0000:00:0a.0/0000:02:04.0/host0/target0:0:6/0:0:6:0/type
CPU 1
Modules linked in:
Pid: 15, comm: load_balance_mo Not tainted 2.6.24-rc6-mm1-autokern1 #1
RIP: 0010:[<ffffffff80232ec8>] [<ffffffff80232ec8>] load_balance_monitor+0x15e/0x2a4
RSP: 0000:ffff81007ffb7eb0 EFLAGS: 00010297
RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
RDX: 000000000000401e RSI: ffff81007ffb7ed8 RDI: 0000000000000000
RBP: ffff81007ffb7f20 R08: ffff81007ffb6000 R09: ffff81007ffb6000
R10: ffff81007ffb6000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000003 R14: 0000000000000800 R15: ffff8101fe997f00
FS: 0000000000000000(0000) GS:ffff8100e3b10000(0000) knlGS:00000000f73e1bb0
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 000000000000401e CR3: 0000000000201000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process load_balance_mo (pid: 15, threadinfo ffff81007ffb6000, task ffff81007ff94790)
Stack: 0000000000002000 0000000000000000 ffff810001009cc0 00000001e3b29d90
0000008000000000 000000000000000f ffff81007f0be780 000000000000000f
000000017ffb7f20 0000000000000000 00000000fffffffc ffffffffffffffff
Call Trace:
[<ffffffff80232d6a>] load_balance_monitor+0x0/0x2a4
[<ffffffff80247830>] kthread+0x3d/0x63
[<ffffffff8020c2b8>] child_rip+0xa/0x12
[<ffffffff802477f3>] kthread+0x0/0x63
[<ffffffff8020c2ae>] child_rip+0x0/0x12Code: 48 8b 04 c2 48 8b 10 48 01 55 98 e8 ce 40 12 00 83 f8 07 41
RIP [<ffffffff80232ec8>] load_balance_monitor+0x15e/0x2a4
RSP <ffff81007ffb7eb0>
CR2: 000000000000401eThe git-sched.patch is causing this panic, and i am se...
Hmmm. Looking into it :-).
--
regards,
Dhaval
--
I will do the bisect and update.
--
Thanks & Regards,
Kamalesh Babulal,
Linux Technology Center,
IBM, ISTL.
--
Happened to be looking more closely than usual at my dmesg looking for something
else, and spotted this:[ 6.079043] power_supply BAT0: 11 dynamic props
[ 6.079045] power_supply BAT0: prop STATUS=Full
[ 6.079047] power_supply BAT0: prop PRESENT=1
[ 6.079049] power_supply BAT0: prop TECHNOLOGY=Li-ion
[ 6.079052] power_supply BAT0: prop VOLTAGE_MIN_DESIGN=11100000
[ 6.079054] power_supply BAT0: prop VOLTAGE_NOW=11793000
[ 6.079056] power_supply BAT0: prop CURRENT_NOW=1000
[ 6.079058] power_supply BAT0: prop CHARGE_FULL_DESIGN=7800000
[ 6.079061] power_supply BAT0: prop CHARGE_FULL=3110000
[ 6.079063] power_supply BAT0: prop CHARGE_NOW=7800000
[ 6.079065] power_supply BAT0: prop TIME_TO_FULL_AVG=DELL FF2316
[ 6.079067] power_supply BAT0: prop MODEL_NAME=Sanyo
[ 6.079301] ACPI: SSDT 7FE82138, 0244 (r1 PmRef Cpu0Ist 3000 INTL 20050624)
[ 6.079488] ACPI: SSDT 7FE81EED, 01C6 (r1 PmRef Cpu0Cst 3001 INTL 20050624)What's with that TIME_TO_FULL_AVG value? Is the battery on crack, or my BIOS,
or the driver? I expected time units, not a Dell part number ;) (Yes, I know
CHARGE_FULL is low, the battery is pretty beat, and Latitudes seem to always
report CHARGE_NOW as "design full" when running off the AC power brick)(For the record, dmidecode says:
Handle 0x1600, DMI type 22, 26 bytes
Portable Battery
Location: Sys. Battery Bay
Manufacturer: Sanyo
Name: DELL FF2316A
Design Capacity: 78000 mWh
Design Voltage: 11100 mV
SBDS Version: 1.0
Maximum Error: 0%
SBDS Serial Number: 0355
SBDS Manufacture Date: 2006-10-26
SBDS Chemistry: LION
OEM-specific Information: 0x00000001So it even managed to lose the trailing 'A'.. ;)
I've bisected it down this far:
kvm-ist-kaput.patch GOOD
git-lblnet.patch
git-lblnet-fixup.patch
git-leds.patch
git-libata-all.patch
git-libata-all-fix-pata_winbond-borkage.patch
git-libata-all-wtf.patch BADand somehow, I doubt the leds or libata trees horked up networking. ;)
Symptoms - semi-sporadic failures in making network connections. The test
case that tripped it up was the 'make test' from the Tcl 8.5 - several of the
test cases will create a listening socket, and then try to connect to it.
Under 2.6.24-rc5-mm1, it works just fine, but I'm seeing hangs under -rc6-mm1.
Doing a 'netstat -n -a -A inet -p' while it's hung shows me this:Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 127.0.0.1:34118 0.0.0.0:* LISTEN 2236/tcltest
tcp 0 1 127.0.0.1:59460 127.0.0.1:34118 SYN_SENT 2236/tcltest
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 127.0.0.1:47842 0.0.0.0:* LISTEN 2352/tcltest
tcp 0 1 127.0.0.1:46510 127.0.0.1:47842 SYN_SENT 2352/tcltest
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 127.0.0.1:47842 0.0.0.0:* LISTEN 2352/tcltest
tcp 0 1 127.0.0.1:46510 127.0.0.1:47842 SYN_SENT 2352/tcltestPretty consistent failure mode - a socket is in 'listen', and the connection
gets hung in 'SYN_SENT'. There's 3 outputs listed - the first one from one run
of the test case, the second 2 are some 20 seconds...
Can you post your .config ?
Also, is that the plain upstream Tcl package you're compiling, or a distro
package?--
James Morris
<jmorris@namei.org>
--
This is a multipart MIME message.
--==_Exmh_1198656492_29170
Content-Type: text/plain; charset=us-asciiThe gzip'ed config as of when I quit bisecting is attached. It's probably
not directly usable unless you have a quilt tree that's positioned fairlyIt's actually a CVS pull of the upstream, but tcl 8.5 was released back on
12/19, and there's nothing obvious in the 4 commits since then. So you should
be able to snarf a 8.5 source tarball, untar it, 'cd tcl/unix', run
./configure, make, make test, and that should replicate it - the 'socket'
test hangs quite consistently for me, and a few earlier ones *sometimes*
hang.--==_Exmh_1198656492_29170
Content-Type: application/x-gzip ; name="config-lblnet.gz"
Content-Description: config-lblnet.gz
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="config-lblnet.gz"H4sICDXLcUcAAy5jb25maWcAjFxdc9u20r7vr9Ck78xpZ04TW3aUuDO+gEBQQkUSDABKVm44
jq2mPvVHjiz3NP/+XQAkBZALur1Io30W4GKxWOwugPz4w48T8nJ4erg+3N1c399/n3zdPe72
14fd7eTh+s/d5Obp8fe7r79Obp8e/3WY7G7vDtAiu3t8+Xvy527/uLuf/LXbP989Pf46mb6d
vZ2e/7K/mQGLftlNkt3NZPp+Mp3+ejb99QT+cnLy4Ycff6CiSPminp3Pub78DrwN4erjrD6b
Tu6eJ49Ph8nz7vCDB8zOgfX4+/hjwQomOa01z9mQSnOh6qpMiPZAmgm6UqKSlNUboukyEQuk
qeFia1ZoNQrWcylIQonSRzaDJqysVVWWQnqA0oSutCTw5SHGclIuhQQoY6xk0vtunlfHH59F
weokJ77yPlWcrjKuNKK/VmiuSNOsDwj48pA8rxC1LDeML5ae2ETSZZ2Tbb0ka1aXtE4TekST
nB9/yA2MsetJlbwwivKH4Tiu6HJBkqQm2UJIrpe5P6iO1354SVTNM7GY1lVoO1G22fmIiijJ
+FyCtdQJy8gWN7N6rbYKOLOeEpoZVbUopUh5xnr4ks+ZLIjmoqhLoRSfD1hUpUpWJAjcTvrZ
tNekFGWVgcSqLkTC6mAmSZVwbdmQiSQy4fITYtxArWEEc2Q9GeF4sTAswXqsl8FSXlk1GR0R
iWg7YWm7UMBiL9+8u7/78u7h6fblfvf87v+qgkBjyTJGFHv31nmgNz+A5/hxsrDe6d509vLt
6EvYFSwX+GShSRauwnoFKmcekRegEVasQSvm8zn4oLOp14Jka1h5MEWXb95gZNCpFt6i3fj6
BrtY85IOCOb/VHtSwPzyqzr/VLHK0/JcJUbxlClVE0p1HKnXZ0dQE7UCx+L7KUNyJtzryAJX
CI0LwymqQsP8+nNZKSbrQqFrq+RJD/INr2eHw0VjrVNL5qmAr9xfhhQ7fn9yQStVWqslT/Xl
6YejUHQhRVViUqWEy9qiYJtLlhx7s4gdaQt03VmIeq1QTcDgUgUaLyWjsBgT5Osy9CjzzMzQ
2hqkTEIDlSSH3twW5dkhpeBaYGnxz6xOBcgKfwlsLbAxls9ZkvijrHhyOvPH5p...
What does the following say ?
# sestatus && rpm -q selinux-policy
Do you see anything unusual in the audit log or syslog?
Try
# ausearch -hn 127.0.0.1
and
# ausearch -x tcltest
- James
--
James Morris
<jmorris@namei.org>
--
Don't worry about that -- I reproduced it with Paul Moore's git tree:
git://git.infradead.org/users/pcmoore/lblnet-2.6_testing(under current -mm, the e1000 driver doesn't find my ethernet card & the
tcl tests won't run without an external interface).The offending commit is when SELinux is converted to the new ifindex
interface:9c6ad8f6895db7a517c04c2147cb5e7ffb83a315 is first bad commit
commit 9c6ad8f6895db7a517c04c2147cb5e7ffb83a315
Author: Paul Moore <paul.moore@hp.com>
Date: Fri Dec 21 11:44:26 2007 -0500SELinux: Convert the netif code to use ifindex values
[...]
In some case (not yet fully identified -- also happens when avahi starts
up, although seemingly silently & without obvious issues), SELinux is
passed an ifindex of 1515870810, which corresponds to 0x5a5a5a5a, the slab
poison value, suggesting a race in the calling code where we're being
asked to check an skb which has been freed.The SELinux code is erroring out before performing an access check
(perhaps there should be WARN_ON, at least), so this will affect both
permissive and enforcing mode without generating any log messages.Andrew: I suggest dropping the patchset from -mm until Paul gets back from
vacation.- James
--
James Morris
<jmorris@namei.org>
--
OK, thanks.
--
Indeed, it works for me.
- James
--
James Morris
<jmorris@namei.org>
--
I'm running MLS in permissive mode, so there shouldn't be any SElinux
denials happening.
Hi Andrew, Ingo, Thomas, Peter,
x86: revert i386: handle an initrd in highmem
The patch caused a failure while booting a kexec kernel.
(http://lkml.org/lkml/2008/1/7/42 has the bisect details.)The following patch reverts it.
Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
---
arch/x86/boot/header.S | 5 -
arch/x86/kernel/setup_32.c | 113 +++++++--------------------------------------
2 files changed, 19 insertions(+), 99 deletions(-)Index: linux-2.6.24-rc6/arch/x86/boot/header.S
===================================================================
--- linux-2.6.24-rc6.orig/arch/x86/boot/header.S
+++ linux-2.6.24-rc6/arch/x86/boot/header.S
@@ -195,13 +195,10 @@ cmd_line_ptr: .long 0 # (Header version
# can be located anywhere in
# low memory 0x10000 or higher.-ramdisk_max: .long 0x7fffffff
+ramdisk_max: .long (-__PAGE_OFFSET-(512 << 20)-1) & 0x7fffffff
# (Header version 0x0203 or later)
# The highest safe address for
# the contents of an initrd
- # The current kernel allows up to 4 GB,
- # but leave it at 2 GB to avoid
- # possible bootloader bugs.kernel_alignment: .long CONFIG_PHYSICAL_ALIGN #physical addr alignment
#required for protected mode
Index: linux-2.6.24-rc6/arch/x86/kernel/setup_32.c
===================================================================
--- linux-2.6.24-rc6.orig/arch/x86/kernel/setup_32.c
+++ linux-2.6.24-rc6/arch/x86/kernel/setup_32.c
@@ -583,95 +583,6 @@ static void __init relocate_initrd(void)#endif /* CONFIG_BLK_DEV_INITRD */
-#ifdef CONFIG_BLK_DEV_INITRD
-
-static bool do_relocate_initrd = false;
-
-static void __init reserve_initrd(void)
-{
- unsigned long ramdisk_image = boot_params.hdr.ramdisk_image;
- unsigned long ramdisk_size = boot_params.hdr.ramdisk_size;
- unsigned long ramdisk_end = ramdisk_imag...
Thanks for tracking this down. I'll pull the patch from the x86 git
tree as well.tglx
--
Dhaval, how about the other problem you had - do you have any guess
what it might be related to?i'm also wondering - what would be the easiest way to integrate kexec
into an automated test environment. If i have a bzImage kernel, is kexec
still supposed to work? Could i for example do a reboot into a new
(kexec-enabled) kernel via kexec in essence?Ingo
--
other problem? The load_balance_monitor one? (We are still looking into
Yes, I use a bzImage kernel to reboot using kexec. I use a script which
just sets it up for me. (I can send it to you separately).--
regards,
Dhaval
--
My daily/nightly kernel test runs use kexec to boot the test kernel.
Well, did thru 2.6.24-rc6-git9, but they fail after that.
Hopefully this patch fixes things.---
~Randy
--
The patch in question is not in mainline, it's in the mm branch of
x86.git. So the problem in mainline is a different one. Could you
bisect it please ?Thanks,
tglx
--
Ugh. As happens during demos, this (kexec failure) won't happen for me
when I want it to. I'll keep an eye out for it...---
~Randy
--
hmmm. I don't think so. This revert is from the x86 git tree (-mm) (I think
targetted for 2.6.25). Probably a bisect might help there.--
regards,
Dhaval
--
Hello,
This is from allnoconfig on sparc64:
LD .tmp_vmlinux1
arch/sparc64/kernel/head.o: In function `kvmap_vmemmap':
(.text+0x34ec): undefined reference to `vmemmap_table'
arch/sparc64/kernel/head.o: In function `kvmap_vmemmap':
(.text+0x34f4): undefined reference to `vmemmap_table'
make: *** [.tmp_vmlinux1] Error 1Regards,
Mariusz
Linux sparc64 2.6.23 #2 SMP PREEMPT Fri Dec 21 21:20:01 CET 2007 sparc64 sun4u TI UltraSparc II (BlackBird) GNU/Linux
Gnu C 4.1.2
Gnu make 3.81
binutils 2.18
util-linux 2.12r
mount 2.12r
module-init-tools 3.4
e2fsprogs 1.40.3
Linux C Library 2.6.1
Dynamic linker (ldd) 2.6.1
Procps 3.2.7
Net-tools 1.60
Kbd 1.13
Sh-utils 6.9
udev 115
--
Happens in mainline too. Maybe arch/sparc64/kernel/ktlb.S needs to be
taught about CONFIG_SPARSEMEM_VMEMMAP=n.
--
From: Andrew Morton <akpm@linux-foundation.org>
It's pointless to support this thing being off. If possible
I'd like a method to force it always to be enabled and I'll
look into doing that.
--
With CONFIG_BLOCK=n:
LD drivers/block/built-in.o
/local/linsrc/linux-2.6.24-rc6-mm1/drivers/base/core.c: In function 'device_add_class_symlinks':
/local/linsrc/linux-2.6.24-rc6-mm1/drivers/base/core.c:707: error: 'part_type' undeclared (first use in this function)
/local/linsrc/linux-2.6.24-rc6-mm1/drivers/base/core.c: In function 'device_remove_class_symlinks':
/local/linsrc/linux-2.6.24-rc6-mm1/drivers/base/core.c:746: error: 'part_type' undeclared (first use in this function)
make[3]: *** [drivers/base/core.o] Error 1and that is after fixing (in some sense) the first CONFIG_BLOCK=n
problem with the patch below. Please test lots of configs.
and/or use 'make randconfig' (automated, scripted, e.g., etc.).
maybe check Documentation/SubmitChecklist. :)---
From: Randy Dunlap <randy.dunlap@oracle.com>
Parts of driver core use blk_lookup_devt() when CONFIG_BLOCK=n,
so provide an short inline version of it for that case.Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
---
include/linux/genhd.h | 7 +++++++
1 file changed, 7 insertions(+)--- linux-2.6.24-rc6-mm1.orig/include/linux/genhd.h
+++ linux-2.6.24-rc6-mm1/include/linux/genhd.h
@@ -10,6 +10,7 @@
*/#include <linux/types.h>
+#include <linux/kdev_t.h>#ifdef CONFIG_BLOCK
@@ -443,6 +444,12 @@ static inline struct block_device *bdget
static inline void printk_all_partitions(void) { }
+static inline dev_t blk_lookup_devt(const char *name)
+{
+ dev_t devt = MKDEV(0, 0);
+ return devt;
+}
+
#endif /* CONFIG_BLOCK */#endif
--
Ingo seems to be saying that he has some kind of "automated" build
system to do this kind of checking. Ingo, did you ever post how you did
this anywhere? I have enough spare machines here that I should be able
to set up something to test my stuff this way easier than doing it byThanks for this patch, I've merged it with the original block patch so
there is no regression along the way.greg k-h
--
the crux of it is this patch:
http://redhat.com/~mingo/auto-qa-patches/Kconfig-qa.patch
(ontop of x86.git)
adjust your arch/x86/Kconfig.needed whitelist (should already work on
typical systems) and do a 'make randconfig'. Every config is supposed to
build and boot fine, including 'make allnoconfig' and 'make
allyesconfig'. And please let me know about any blacklist items as well.
(right now they are a bit hacky via a "depends on 0" line and a small
comment explaining why they are not suitable in a bzImage kernel.)( the CONFIG_BOOTPARAM stuff is there to easily randomize boot options -
we frequently have regressions that only trigger with certain boot
option combinations. )it's somewhat hackish in places. The rest of my scripts (to scp a new
kernel image, to reboot a testbox, etc.) tie in to my specific
environment quite closely and make no sense to be posted. (they are also
quite ugly)Ingo
--
config X86_ELAN
bool "AMD Elan"
depends on X86_32
+
+ # dangerous to boot on non-Elan CPUs
+ depends on 0
+
help
Select this for an AMD Elan processor.Hmmm... Most options like "support 386" are of "add support for 386,
but do not break support for pentium". ELAN etc seem to be
exceptions. Perhaps options that _take away_ functionality (like ELAN
-- takes ability to boot on normal 386) should be specifically marked
somehow?depends on EXCLUSIVE_FEATURE
?
depends on NOT_A_FEATURE
?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
CC drivers/w1/masters/w1-gpio.o
In file included from /local/linsrc/linux-2.6.24-rc6-mm1/drivers/w1/masters/w1-gpio.c:19:
include2/asm/gpio.h:4:18: error: gpio.h: No such file or directory
/local/linsrc/linux-2.6.24-rc6-mm1/drivers/w1/masters/w1-gpio.c: In function 'w1_gpio_write_bit_dir':
/local/linsrc/linux-2.6.24-rc6-mm1/drivers/w1/masters/w1-gpio.c:26: error: implicit declaration of function 'gpio_direction_input'
/local/linsrc/linux-2.6.24-rc6-mm1/drivers/w1/masters/w1-gpio.c:28: error: implicit declaration of function 'gpio_direction_output'
/local/linsrc/linux-2.6.24-rc6-mm1/drivers/w1/masters/w1-gpio.c: In function 'w1_gpio_write_bit_val':
/local/linsrc/linux-2.6.24-rc6-mm1/drivers/w1/masters/w1-gpio.c:35: error: implicit declaration of function 'gpio_set_value'
/local/linsrc/linux-2.6.24-rc6-mm1/drivers/w1/masters/w1-gpio.c: In function 'w1_gpio_read_bit':
/local/linsrc/linux-2.6.24-rc6-mm1/drivers/w1/masters/w1-gpio.c:42: error: implicit declaration of function 'gpio_get_value'
/local/linsrc/linux-2.6.24-rc6-mm1/drivers/w1/masters/w1-gpio.c: In function 'w1_gpio_probe':
/local/linsrc/linux-2.6.24-rc6-mm1/drivers/w1/masters/w1-gpio.c:58: error: implicit declaration of function 'gpio_request'
/local/linsrc/linux-2.6.24-rc6-mm1/drivers/w1/masters/w1-gpio.c:82: error: implicit declaration of function 'gpio_free'
make[4]: *** [drivers/w1/masters/w1-gpio.o] Error 1---
~Randy
desserts: http://www.xenotime.net/linux/recipes/
The dependency is there:
config W1_MASTER_GPIO
tristate "GPIO 1-wire busmaster"
depends on GENERIC_GPIOSo it looks like the arch Kconfig has selected GENERIC_GPIO but failed
to provide the implementation.--
Ville Syrjälä
syrjala@sci.fi
http://www.sci.fi/~syrjala/
--
There was a follow-up patch in the thread that limits X86_RDC321X
to X86_32 instead of any X86.--
~Randy
desserts: http://www.xenotime.net/linux/recipes/
--
CC drivers/input/keyboard/gpio_keys.o
In file included from /local/linsrc/linux-2.6.24-rc6-mm1/drivers/input/keyboard/gpio_keys.c:27:
include2/asm/gpio.h:4:18: error: gpio.h: No such file or directory
/local/linsrc/linux-2.6.24-rc6-mm1/drivers/input/keyboard/gpio_keys.c: In function 'gpio_keys_isr':
/local/linsrc/linux-2.6.24-rc6-mm1/drivers/input/keyboard/gpio_keys.c:40: error: implicit declaration of function 'gpio_to_irq'
/local/linsrc/linux-2.6.24-rc6-mm1/drivers/input/keyboard/gpio_keys.c:42: error: implicit declaration of function 'gpio_get_value'
/local/linsrc/linux-2.6.24-rc6-mm1/drivers/input/keyboard/gpio_keys.c: In function 'gpio_keys_probe':
/local/linsrc/linux-2.6.24-rc6-mm1/drivers/input/keyboard/gpio_keys.c:81: error: implicit declaration of function 'gpio_request'
/local/linsrc/linux-2.6.24-rc6-mm1/drivers/input/keyboard/gpio_keys.c:88: error: implicit declaration of function 'gpio_direction_input'
/local/linsrc/linux-2.6.24-rc6-mm1/drivers/input/keyboard/gpio_keys.c:93: error: implicit declaration of function 'gpio_free'
make[4]: *** [drivers/input/keyboard/gpio_keys.o] Error 1CC drivers/leds/leds-gpio.o
In file included from /local/linsrc/linux-2.6.24-rc6-mm1/drivers/leds/leds-gpio.c:18:
include2/asm/gpio.h:4:18: error: gpio.h: No such file or directory
/local/linsrc/linux-2.6.24-rc6-mm1/drivers/leds/leds-gpio.c: In function 'gpio_led_work':
/local/linsrc/linux-2.6.24-rc6-mm1/drivers/leds/leds-gpio.c:34: error: implicit declaration of function 'gpio_set_value_cansleep'
/local/linsrc/linux-2.6.24-rc6-mm1/drivers/leds/leds-gpio.c: In function 'gpio_led_set':
/local/linsrc/linux-2.6.24-rc6-mm1/drivers/leds/leds-gpio.c:60: error: implicit declaration of function 'gpio_set_value'
/local/linsrc/linux-2.6.24-rc6-mm1/drivers/leds/leds-gpio.c: In function 'gpio_led_probe':
/local/linsrc/linux-2.6.24-rc6-mm1/drivers/leds/leds-gpio.c:85: error: implicit declaration of function 'gpio_cansleep'
/local/linsrc/linux-2.6.24-rc6-mm1/drivers/leds/leds-gpio.c:90: error: implici...
Find whatever broken patch selected (on x86_64)
CONFIG_GENERIC_GPIO=y
without actually providing that support (by providing <asm/gpio.h> and
an implementation backing it up). That's the patch which broke those
various GPIO-dependant drivers.- Dave
--
OK, thanks for the direction.
---
From: Randy Dunlap <randy.dunlap@oracle.com>
X86_RDC321X is X86_32, so make it depend on X86_32 so that
X86_64 random configs don't try to build RDC and fail.Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
---
arch/x86/Kconfig | 1 +
1 file changed, 1 insertion(+)--- linux-2.6.24-rc6-mm1.orig/arch/x86/Kconfig
+++ linux-2.6.24-rc6-mm1/arch/x86/Kconfig
@@ -297,6 +297,7 @@ config X86_ES7000config X86_RDC321X
bool "RDC R-321x SoC"
+ depends on X86_32
select M486
select X86_REBOOTFIXUPS
select GENERIC_GPIO
--
thanks Randy, i have applied your fix to x86.git.
Ingo
--
MODPOST 120 modules
ERROR: "i2c_attach_client" [drivers/media/video/v4l2-common.ko] undefined!
make[2]: *** [__modpost] Error 1---
~Randy
desserts: http://www.xenotime.net/linux/recipes/
I fixed this problem in this changeset:
http://linuxtv.org/hg/v4l-dvb/rev/64e0c78821c4
Mauro, can you send this upstream?
for mm: here is the raw patch:
http://linuxtv.org/hg/v4l-dvb/raw-rev/64e0c78821c4
Regards,
Mike
--
Hmm, I apologize -- I think this was an unrelated issue. Sorry for the
confusion.-Mike
--
From: Randy Dunlap <randy.dunlap@oracle.com>
When SYSFS=n and MODULES=y, build ends with:
linux-2.6.24-rc6-mm1/drivers/base/module.c: In function 'module_add_driver':
linux-2.6.24-rc6-mm1/drivers/base/module.c:49: error: 'module_kset' undeclared (first use in this function)
make[3]: *** [drivers/base/module.o] Error 1Below is one possible fix.
Build-tested with all 4 config combinations of SYSFS & MODULES.Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
---
drivers/base/Makefile | 2 ++
drivers/base/base.h | 2 +-
2 files changed, 3 insertions(+), 1 deletion(-)--- linux-2.6.24-rc6-mm1.orig/drivers/base/Makefile
+++ linux-2.6.24-rc6-mm1/drivers/base/Makefile
@@ -11,7 +11,9 @@ obj-$(CONFIG_FW_LOADER) += firmware_clas
obj-$(CONFIG_NUMA) += node.o
obj-$(CONFIG_MEMORY_HOTPLUG_SPARSE) += memory.o
obj-$(CONFIG_SMP) += topology.o
+ifeq ($(CONFIG_SYSFS),y)
obj-$(CONFIG_MODULES) += module.o
+endif
obj-$(CONFIG_SYS_HYPERVISOR) += hypervisor.oifeq ($(CONFIG_DEBUG_DRIVER),y)
--- linux-2.6.24-rc6-mm1.orig/drivers/base/base.h
+++ linux-2.6.24-rc6-mm1/drivers/base/base.h
@@ -79,7 +79,7 @@ extern char *make_class_name(const charextern int devres_release_all(struct device *dev);
-#ifdef CONFIG_MODULES
+#if defined(CONFIG_MODULES) && defined(CONFIG_SYSFS)
extern void module_add_driver(struct module *mod, struct device_driver *drv);
extern void module_remove_driver(struct device_driver *drv);
#else
--
From: Randy Dunlap <randy.dunlap@oracle.com>
When CONFIG_PREEMPT_NONE=y, scatterwalk.h still uses cond_resched()
so it needs to include sched.h:linux-2.6.24-rc6-mm1/include/crypto/scatterwalk.h:52: error: implicit declaration of function 'cond_resched'
make[2]: *** [crypto/digest.o] Error 1Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
---
include/crypto/scatterwalk.h | 1 +
1 file changed, 1 insertion(+)--- linux-2.6.24-rc6-mm1.orig/include/crypto/scatterwalk.h
+++ linux-2.6.24-rc6-mm1/include/crypto/scatterwalk.h
@@ -23,6 +23,7 @@
#include <linux/kernel.h>
#include <linux/mm.h>
#include <linux/scatterlist.h>
+#include <linux/sched.h>static inline enum km_type crypto_kmap_type(int out)
{
--
Thanks. This is already in cryptodev-2.6.
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
This doesn't build on powerpc with my .config:
In file included from arch/powerpc/kernel/asm-offsets.c:17:
include/linux/sched.h: In function ‘spin_needbreak’:
include/linux/sched.h:1947: error: implicit declaration of function ‘__raw_spin_is_contended’I don't see where __raw_spin_is_contended is defined for any arch
other than x86, so I guess this will happen on any non-x86 arch when
SMP=y and PREEMPT=y are set?This comes from "spinlock: lockbreak cleanup" in git-x86.
--
Joseph Fannin
jfannin@gmail.com
--
And CONFIG_GENERIC_LOCKBREAK is not defined, which is what powerpc needs.
Thanks for reporting,
Nick---
Index: linux-2.6/arch/powerpc/Kconfig
===================================================================
--- linux-2.6.orig/arch/powerpc/Kconfig
+++ linux-2.6/arch/powerpc/Kconfig
@@ -53,6 +53,11 @@ config RWSEM_XCHGADD_ALGORITHM
bool
default y+config GENERIC_LOCKBREAK
+ bool
+ default y
+ depends on SMP && PREEMPT
+
config ARCH_HAS_ILOG2_U32
bool
default y
--
Hello,
WARNING: vmlinux.o(.text+0x46b04): Section mismatch: reference to .init.text:sun4v_ktsb_register (between 'smp_callin' and 'smp_fill_in_sib_core_maps')
WARNING: vmlinux.o(.text+0x4756c): Section mismatch: reference to .init.text:sun4v_register_mondo_queues (between 'after_lock_tlb' and 'hv_cpu_startup')
WARNING: vmlinux.o(.text+0x477ac): Section mismatch: reference to .init.text:sun4v_register_mondo_queues (between 'hv_cpu_startup' and 'sys32_exit')
WARNING: vmlinux.o(.text+0x55258): Section mismatch: reference to .init.text:__alloc_bootmem (between 'kernel_map_range' and 'kernel_map_pages')
WARNING: vmlinux.o(.text+0x55278): Section mismatch: reference to .init.text:__alloc_bootmem (between 'kernel_map_range' and 'kernel_map_pages')
WARNING: vmlinux.o(.text+0x1fdfe4): Section mismatch: reference to .init.text:sunserial_console_match (between 'hv_probe' and 'serial_in')
WARNING: vmlinux.o(.text+0x20011c): Section mismatch: reference to .init.text:sunserial_console_match (between 'su_probe' and 'sunsu_console_putchar')
WARNING: vmlinux.o(.sun4v_2insn_patch+0x3d8): Section mismatch: reference to .init.text:
WARNING: vmlinux.o(__ksymtab+0x62c0): Section mismatch: reference to .init.text:sunserial_console_match (between '__ksymtab_sunserial_console_match' and '__ksymtab_sunserial_unregister_minors')Regards,
Mariusz
--
From: Mariusz Kozlowski <m.kozlowski@tuxland.pl>
Well known and I see them every build and so does everyone
else on sparc64.They are harmless and as time allows I try to find ways
to get rid of them but it's very low priority.
--
At least the sunserial_console_match() one is an obvious Oops
(EXPORT_SYMBOL of an __init function).The comment in the description of
commit 58d784a5c754cd66ecd4791222162504d3c16c74 the warning was bogus
is bullshit.I'm not sure whether this might count as a 2.6.24-rc regression or
whether 2.6.23 is simply differently but similarly broken (does anyone
actually use the Sun console drivers modular?).cu
Adrian--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed--
From: Adrian Bunk <bunk@kernel.org>
You can't do that, the FOO_CONSOLE config options depend upon
FOO=y.That's why I'm not worried about this issue and it's not critical at
all.
--
Looking closer, the problem aren't the FOO_CONSOLE options themselves,
the problem is that with FOO_CONSOLE=n sunserial_console_match() stillIf a module calls sunserial_console_match() that's an Oops.
I removed the EXPORT_SYMBOL(sunserial_console_match), and this is the
result:
MODPOST 136 modules
ERROR: "sunserial_console_match" [drivers/serial/sunzilog.ko] undefined!
ERROR: "sunserial_console_match" [drivers/serial/sunsu.ko] undefined!
ERROR: "sunserial_console_match" [drivers/serial/sunsab.ko] undefined!-ENOHARDWARE, but looking at the code you could call me _very_ surprised
if you manage to load a modular sunsab from 2.6.24-rc6 on a machine with
the hardware without getting an Oops.cu
Adrian--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed--
From: Adrian Bunk <bunk@kernel.org>
That's true.
I'm trying to figure out a way to fix this.
--
#ifdef FOO_CONSOLE around the sunserial_console_match() calls in the
drivers should work.If you consider this too many #ifdef's, an alternative solution would be
doing the following in drivers/serial/suncore.h:#ifndef MODULE
extern int sunserial_console_match(struct console *, struct device_node *,
struct uart_driver *, int);
#else
static inline int sunserial_console_match(struct console *, struct device_node *,
struct uart_driver *, int);
{ return 0; }
#endifcu
Adrian--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed--
From: Adrian Bunk <bunk@kernel.org>
It absolutely doesn't work, I tried this, see my other reply.
The issue is add_preferred_console() is __init, driver probe calls are
__devinit which are either __init or not __init.So even with the FOO_CONSOLE ifdef (or something similar like the
Just removing the __init tag from add_preferred_console() (and
subsequently sunserial_console_match()) is probably the easiest way to
fix all of this.
--
Sorry, I shouldn't suggest stuff I haven't tried myself... :-(
cu
Adrian--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed--
From: David Miller <davem@davemloft.net>
At the end of this email is one idea I came up with but it still
results in:WARNING: vmlinux.o(.text+0x19e52c): Section mismatch: reference to .init.text:sunserial_console_match (between 'hv_probe' and 'sunzilog_get_mctrl')
WARNING: vmlinux.o(.text+0x19fd1c): Section mismatch: reference to .init.text:sunserial_console_match (between 'zs_probe' and 'serial_in')
WARNING: vmlinux.o(.text+0x19fd5c): Section mismatch: reference to .init.text:sunserial_console_match (between 'zs_probe' and 'serial_in')
WARNING: vmlinux.o(.text+0x1a19e0): Section mismatch: reference to .init.text:sunserial_console_match (between 'su_probe' and 'sunsu_console_putchar')
WARNING: vmlinux.o(.text+0x1a307c): Section mismatch: reference to .init.text:sunserial_console_match (between 'sab_probe' and 'sunsab_send_xchar')
WARNING: vmlinux.o(.text+0x1a3090): Section mismatch: reference to .init.text:sunserial_console_match (between 'sab_probe' and 'sunsab_send_xchar')
WARNING: vmlinux.o(.sun4v_2insn_patch+0x4f8): Section mismatch: reference to .init.text:if CONFIG_HOTPLUG is set because driver initialization code has to be
marked with __devinit and with HOTPLUG that isn't __init.This means it's impossible to call add_preferred_console() (either
directly or indirectly via a helper like sunserial_console_match())
from a driver init routine.The only way I can think of to "work around" this is to mark
sunserial_console_match() as __init_refok, and use some static
variable in suncore which starts as "0" gets set to "1" via a
late_initcall() to block the call to add_preferred_console().But that's just gross.
Probably the thing to do to untangle this is to make
add_preferred_console() not be __init. I just tested and that seems
to make everything happy. Again, below is the first thing I
tried just for reference.diff --git a/drivers/serial/suncore.c b/drivers/serial/suncore.c
index 707c5b0..a4cbd17 100644
--- a/drivers/serial/suncore.c
+++ b/drivers/serial/s...
From: David Miller <davem@davemloft.net>
Adrian, if you're interested in tackling this "fun" problem,
have a look at add_preferred_console() and find a way to make
that not marked __init. (it's called by sunserial_console_match)That's what causes this dependency chain of __init problems for the
Sun serial console drivers.It's problematic, furthermore, because even if one could call
add_preferred_console() from a module properly, it doesn't have the
desired effect of changing init's stdin/stdout/stderr
--
Hi,
another one most likely related to the recent NFS_V4 define build error
saga:CC fs/nfs/super.o
fs/nfs/super.c: In function 'nfs_sb_deactive':
fs/nfs/super.c:338: error: 'TASK_NORMAL' undeclared (first use in this function)
fs/nfs/super.c:338: error: (Each undeclared identifier is reported only once
fs/nfs/super.c:338: error: for each function it appears in.)
fs/nfs/super.c: In function 'nfs_put_super':
fs/nfs/super.c:349: error: 'TASK_UNINTERRUPTIBLE' undeclared (first use in this function)
fs/nfs/super.c:349: error: implicit declaration of function 'schedule'
make[3]: *** [fs/nfs/super.o] Error 1
make[2]: *** [fs/nfs] Error 2
make[1]: *** [fs] Error 2
make[1]: Leaving directory `/usr/src/linux-2.6.24-rc6-mm1.system-gate-patch'
make: *** [debian/stamp-build-kernel] Error 2This was hand-patched from earlier kernel versions, however I wouldn't
think there was any problem due to this (a cleanly extracted version
doesn't show any md5sum difference for fs/nfs/super.c).[plus hotfix x86-fix-system-gate-related-crash.patch]
I'm circa 120% sure there must be a sched.h include missing there, given the
whereabouts of these APIs ;)CONFIG_NETWORK_FILESYSTEMS=y
CONFIG_NFS_FS=y
CONFIG_NFS_V3=y
# CONFIG_NFS_V3_ACL is not set
CONFIG_NFS_V4=y
# CONFIG_NFS_DIRECTIO is not set
CONFIG_NFSD=y
CONFIG_NFSD_V3=y
# CONFIG_NFSD_V3_ACL is not set
CONFIG_NFSD_V4=y
CONFIG_NFSD_TCP=y
CONFIG_LOCKD=y
CONFIG_LOCKD_V4=y
CONFIG_EXPORTFS=y
CONFIG_NFS_COMMON=yi386 K6-III@150, gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)
Thanks,
Andreas Mohr
--
Suspend is also broken on my HP nx6325 (hangs hard in the last phase of
suspend) and git-cpufreq.patch is responsible for that (as shown by bisection).Reverting git-cpufreq.patch makes suspend work again, although it still is a
bit flaky (it takes well more than 5 seconds to suspend and the sound adapter
doesn't work right after the resume, but it starts to work again about 10s
later).Thanks,
Rafael
--
hm. There have been some suspend changes in the alsa tree.
And yes, I noticed that susped has become slower too - looks like abot ten
seconds, which is a pretty significant usability irritant.--
At Sun, 23 Dec 2007 14:50:03 -0800,
Not really. The usb-audio suspend support is the only addition on
mm. It should be irrelevant with on-board HD-audio...Takashi
--
Well, I'm suspecting some ACPI changes, but will be only able to debug it
further in a couple of days.Rafael
--
On Sun, Dec 23, 2007 at 02:50:03PM -0800, Andrew Morton wrote:
> > > - Someone broke suspend-to-RAM on the t61p again. It just instantly resumes
> > > itself.
> >
> > Suspend is also broken on my HP nx6325 (hangs hard in the last phase of
> > suspend) and git-cpufreq.patch is responsible for that (as shown by bisection).
> >
> > Reverting git-cpufreq.patch makes suspend work again,
>
> ah. Thanks.I'm not sure how this is 'new' breakage, because git-cpufreq hasn't changed
in a while, other than the integration of that missing #include diff
that sat in -mm. Maybe some bad interaction with something else that
changed perhaps. *shrug*.I'm on vacation until the new year, so I'm going out of my way not to look
at bugs for a change. But I'm not ignoring this completely, I'll make a
note to look at it in January.Dave
Well it doesn't build on x86-64 for me:
CHK include/linux/compile.h
CC arch/x86/ia32/../../../fs/compat_binfmt_elf.o
Assembler messages:
Fatal error: can't create arch/x86/ia32/../../../fs/.tmp_compat_binfmt_elf.o: No such file or directory
make[2]: *** [arch/x86/ia32/../../../fs/compat_binfmt_elf.o] Error 2I will post the .config if anyone is interested.
Thanks,
Rafael
--
It's a Kbuild race -- if you keep re-building it will eventually build
the right file.Not excusable, but that's what's going on.
-hpa
--
yes, please send the .config.
Ingo
--
Attached.
It also may be relevant that I compile the kernel with "make O=../build".
Thanks,
Rafael
I ran the compilation once again and it worked. Strange.
Thanks,
Rafael
--
Try to delete your fs/ directory in your output dir.
Then I expect the same bug to surface again.I guess it is because arch/x86/ia32/ is built before fs/ and
gcc cannot create directories for the output files and
it is the dependency files that triggers the error as this
is the first file to be generated.
The right fix is to move the build of compat_binfmt_elf to
fs/Makefile as already discussed.Sam
--
I think you are right.
Greetings,
Rafael
--
could you try the patch from Sam below - does it fix the problem?
Thanks,Ingo
---------->
Subject: x86 compat_binfmt_elf, Makefile fixes
From: Sam Ravnborg <sam@ravnborg.org>fix the build rules of compat-binfmt_elf.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
arch/x86/Kconfig | 1 +
arch/x86/ia32/Makefile | 5 ++---
fs/Kconfig.binfmt | 10 ++++++++++
fs/Makefile | 1 +
4 files changed, 14 insertions(+), 3 deletions(-)Index: linux-x86.q/arch/x86/Kconfig
===================================================================
--- linux-x86.q.orig/arch/x86/Kconfig
+++ linux-x86.q/arch/x86/Kconfig
@@ -1546,6 +1546,7 @@ source "fs/Kconfig.binfmt"
config IA32_EMULATION
bool "IA32 Emulation"
depends on X86_64
+ select HAVE_COMPAT_BINFMT_ELF
help
Include code to run 32-bit programs under a 64-bit kernel. You should
likely turn this on, unless you're 100% sure that you don't have any
Index: linux-x86.q/arch/x86/ia32/Makefile
===================================================================
--- linux-x86.q.orig/arch/x86/ia32/Makefile
+++ linux-x86.q/arch/x86/ia32/Makefile
@@ -2,7 +2,8 @@
# Makefile for the ia32 kernel emulation subsystem.
#-obj-$(CONFIG_IA32_EMULATION) := ia32entry.o sys_ia32.o ia32_signal.o
+obj-$(CONFIG_IA32_EMULATION) := ia32entry.o sys_ia32.o ia32_signal.o \
+ ia32_binfmt.osysv-$(CONFIG_SYSVIPC) := ipc32.o
obj-$(CONFIG_IA32_EMULATION) += $(sysv-y)
@@ -11,5 +12,3 @@ obj-$(CONFIG_IA32_AOUT) += ia32_aout.oaudit-class-$(CONFIG_AUDIT) := audit.o
obj-$(CONFIG_IA32_EMULATION) += $(audit-class-y)
-
-obj-$(CONFIG_IA32_EMULATION) += ../../../fs/compat_binfmt_elf.o
Index: linux-x86.q/fs/Kconfig.binfmt
===================================================================
--- linux-x86.q.orig/fs/Kconfig.binfmt
+++ linux-x86.q/fs/Kconfig.binfmt
@@ -23,6 +23,16 @@ config BINFMT_ELF
ld.so (check the file <file:Documentation/Changes> for location and
latest v...
Well, with this patch applied the compilation reliably fails with:
No rule to make target `arch/x86/ia32/ia32_binfmt.o', needed by `arch/x86/ia32/built-in.o'.
[Do you want the .config, btw?]
Rafael
--
i think i'll wait for Roland and Sam to sort it out.
Ingo
--
hm, the fix for that is in x86.git already - perhaps you got an older
copy?Ingo
--
hm, e3c1b141 is already the latest one.
Ingo
--
"already in", I assume.
You can always tell what I have by looking at the patch:
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc6...
It includes the head commit ID at the first line (it's in a
machine-readable form - Matthias's scripts which prepare the mm git tree
actually get the git-foo.patch info direct from the original repo rather
than by applying the diff from broken-out/)Still. The crash is 100% repeatable and is the same every time. Happens
on both my i386 test boxes.http://userweb.kernel.org/~akpm/config-sony.txt
http://userweb.kernel.org/~akpm/config-vmm.txtand I bisected it down to e3c1b141.
--
ok, can reproduce it - the patch below fixes it for me.
Ingo
------------------------->
Subject: x86: fix system gate related crash
From: Ingo Molnar <mingo@elte.hu>on 32-bit, system gates are traps.
on 64-bit, they are interrupts (which disable hardirqs).
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
include/asm-x86/desc.h | 4 ++++
1 file changed, 4 insertions(+)Index: linux-x86.q/include/asm-x86/desc.h
===================================================================
--- linux-x86.q.orig/include/asm-x86/desc.h
+++ linux-x86.q/include/asm-x86/desc.h
@@ -310,7 +310,11 @@ static inline void set_trap_gate(unsigne
static inline void set_system_gate(unsigned int n, void *addr)
{
BUG_ON((unsigned)n > 0xFF);
+#ifdef CONFIG_X86_32
+ _set_gate(n, GATE_TRAP, addr, 0x3, 0, __KERNEL_CS);
+#else
_set_gate(n, GATE_INTERRUPT, addr, 0x3, 0, __KERNEL_CS);
+#endif
}static inline void set_task_gate(unsigned int n, unsigned int gdt_entry)
--
This would be a lot cleaner with entirely separate implementations
of set_system_gate for 32 vs 64 bit. Especially if the file already
has a large ifdef block for 32 vs 64 already.
--
I seem to be on a roll here... :)
X86_64 kernel, Dell Latitude D820, Core2 T7200 processor...
(Yes, I know it's tainted. If I *have* to, I'll try to get an untainted one,
which will be a pain - the oss 'nv' driver is a tad busticated on my box at
the moment).Took me a bunch of hours to find a way to repeat this one, but it seems
pretty consistent - I exit an 'Eterm' process and ka-blam. First guess
I have is that we hit a race and the timer popped right in the middle of
a process exiting and we try to update process times on an already-defunct
process table entry?This ring any bells or brown-paper-bag D'Oh!s before I go bisecting?
[15345.901919] Unable to handle kernel paging request at 000000af008c00cd RIP:
[15345.901934] [<ffffffff802310d9>] scheduler_tick+0xdb/0x1c4
[15345.901952] PGD 0
[15345.901959] Oops: 0000 [1] PREEMPT SMP
[15345.901972] last sysfs file: /sys/devices/platform/coretemp.1/temp1_input
[15345.901978] CPU 1
[15345.901984] Modules linked in: irnet ppp_generic slhc irtty_sir sir_dev ircomm_tty ircomm irda crc_ccitt coretemp nf_conntrack_ftp xt_pkttype ipt_REJECT ipt_osf nf_conntrack_ipv4 xt_ipisforif ipt_recent ipt_LOG xt_u32 iptable_filter ip_tables xt_tcpudp nf_conntrack_ipv6 xt_state nf_conntrack ip6t_LOG xt_limit ip6table_filter ip6_tables x_tables sha256_generic aes_generic acpi_cpufreq tpm_tis pcmcia gspca(U) iwl3945 firmware_class iTCO_wdt yenta_socket compat_ioctl32 ohci1394 rsrc_nonstatic iTCO_vendor_support mac80211 ieee1394 pcmcia_core nvidia(P)(U) watchdog_core battery videodev ac watchdog_dev v4l2_common snd_hda_intel v4l1_compat thermal power_supply cfg80211 button intel_agp processor rtc
[15345.902170] Pid: 0, comm: Eterm Tainted: P 2.6.24-rc6-mm1 #4
[15345.902176] RIP: 0010:[<ffffffff802310d9>] [<ffffffff802310d9>] scheduler_tick+0xdb/0x1c4
[15345.902189] RSP: 0018:ffff81007f8a3eb8 EFLAGS: 00010083
[15345.902195] RAX: 000000af008c005d RBX: 00000df4fefac7d9 RCX: 0000000000000004
[15345.902202] RDX: 0000000000000004 ...
In case it makes a difference, the Eterm that causes the issue on exit is
a 32-bit binary, with a 64-bit kernel (though I did have one kernel lockup
with xpdf, which is a 64-bit binary, but I can't prove that was/wasn't this
same issue)....Bisection says:
git-ipwireless_cs.patch GOOD
#
git-x86.patch
git-x86-fixup.patch
git-x86-arch-x86-math-emu-errorsc-fix-printk-warnings.patch
git-x86-drivers-pnp-pnpbios-bioscallsc-build-fix.patch
git-x86-fix-doubly-merged-patch.patch
git-x86-export-leave_mm.patch BADand that's where bisection comes to a halt...
Time to bisect through git-x86, or somebody got a better idea? Looking at
the commits listed in git-x86.patch, I didn't see anything that jumped out,
but I'm pretty sure the problem is in there somewhere...
Yup. But please do try to get the cc's right. Especially when I'm lying
--
/2.6.24-rc6-mm1/
Looks like an uninitialized variable dereference for SEPARATOR events:
# mount -t securityfs none /sys/kernel/security/
# ls /sys/kernel/security/
tpm0
# l /sys/kernel/security/tpm0/
total 0
0 -r--r----- 1 root root 0 2007-12-26 23:28 ascii_bios_measurements
0 -r--r----- 1 root root 0 2007-12-26 23:28 binary_bios_measurements
# cat /sys/kernel/security/tpm0/ascii_bios_measurements =0 0000000000000000000000000000000000000000 07 [S-CRTM Contents]
0 0000000000000000000000000000000000000000 07 [S-CRTM Contents]
0 0000000000000000000000000000000000000000 07 [S-CRTM Contents]
0 0000000000000000000000000000000000000000 07 [S-CRTM Contents]
4 c1e25c3f6b0dc78d57296aa2870ca6f782ccf80f 05 [Calling INT 19h]
0 85e53271e14006f0265921d02d4d736cdc580b0b 04 [=C3=BF]
1 85e53271e14006f0265921d02d4d736cdc580b0b 04 [=C3=BF]
2 85e53271e14006f0265921d02d4d736cdc580b0b 04 [=C3=BF]
3 85e53271e14006f0265921d02d4d736cdc580b0b 04 [=C3=BF]
4 85e53271e14006f0265921d02d4d736cdc580b0b 04 [=C3=BF]
5 85e53271e14006f0265921d02d4d736cdc580b0b 04 [=C3=BF]
6 85e53271e14006f0265921d02d4d736cdc580b0b 04 [=C3=BF]
7 85e53271e14006f0265921d02d4d736cdc580b0b 04 [=C3=BF]
4 38f30a0a967fcf2bfee1e3b2971de540115048c8 05 [Returned INT 19h]
4 f9d3a33e4ba6109fb60e8df6ec0f10330733c8b2 0c [Compact Hash]
5 9bd5c812613f67ce1c75d0ea48b9933a547683cb 0c [Compact Hash]Looks like the problem is likely in get_event_name:
case NONHOST_INFO:
name =3D tcpa_event_type_strings[event->event_type];
n_len =3D strlen(name);
break;
case SEPARATOR:
case ACTION:
if (MAX_TEXT_EVENT > event->event_size) {
name =3D event_entry;
n_len =3D event->event_size;
}
break;Should there be a 'break;' after the SEPARATOR line? Given the name, it
probably doesn't have a name/length pair attached to an event, right?
| Jens Axboe | Re: [BUG] New Kernel Bugs |
| KAMEZAWA Hiroyuki | Re: 2.6.24-rc3-mm1 |
| Ingo Molnar | Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] |
| Greg Kroah-Hartman | [PATCH 001/196] Chinese: Add the known_regression URI to the HOWTO |
git: | |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| Jarek Poplawski | Re: Data corruption issue with splice() on 2.6.27.10 |
| Patrick McHardy | Re: [GIT]: Networking |
