Re: [PATCH 1/1] x86: fix text_poke

Previous thread: [2.6.25 REGRESSION] Lenovo 3000 N100 does not wake up from ACPI S3 state by Thomas Bächler on Saturday, April 19, 2008 - 8:49 am. (1 message)

Next thread: [GIT PULL] sh updates for 2.6.26-rc1 by Paul Mundt on Saturday, April 19, 2008 - 9:25 am. (1 message)
To: Ingo Molnar <mingo@...>
Cc: Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, LKML <linux-kernel@...>
Date: Saturday, April 19, 2008 - 9:22 am

Hi,

2.6.25-git1 hung solid on my HP nx6325 when I was trying to run KMail.

That happened after running it for several hours with a couple of suspend
and hibernation cycles in between, but I hadn't been doing anything unusual
with it.

Well, that has never happened since 2.6.25-rc1 (at least) on this box, so it
looks worrisome. I guess one of the x86 changes is responsible for it.

Thanks,
Rafael

--
"Premature optimization is the root of all evil." - Donald Knuth
--

To: LKML <linux-kernel@...>
Cc: Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, <linux-ext4@...>
Date: Sunday, April 20, 2008 - 3:04 pm

Hi,

I've just got the following traces from 2.6.25-git2 on HP nx6325 (64-bit).
I think they are related to the hang I described yesterday:

[12844.066757] BUG: unable to handle kernel paging request at ffffffffffffffff
[12844.066765] IP: [<ffffffff802a7b3c>] __d_lookup+0xf1/0x117
[12844.066775] PGD 203067 PUD 204067 PMD 0
[12844.066778] Oops: 0000 [1] SMP DEBUG_PAGEALLOC
[12844.066782] CPU 1
[12844.066784] Modules linked in: ip6t_LOG nf_conntrack_ipv6 xt_pkttype ipt_LOG xt_limit af_packet rfkill_input snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device ip6t_REJECT xt_tcpudp ipt_REJECT xt_state iptable_mangle iptable_nat nf_nat iptable_filter ip6table_mangle nf_conntrack_ipv4 nf_conntrack ip_tables ip6table_filter cpufreq_conservative ip6_tables x_tables cpufreq_ondemand cpufreq_userspace ipv6 cpufreq_powersave powernow_k8 freq_table fuse dm_crypt loop dm_mod arc4 ecb crypto_blkcipher b43 rfkill mac80211 cfg80211 led_class rfcomm input_polldev l2cap fan ssb thermal pcmcia joydev snd_hda_intel snd_pcm rtc_cmos yenta_socket usbhid rtc_core hci_usb processor rsrc_nonstatic snd_timer shpchp psmouse i2c_piix4 sdhci ohci1394 battery pcmcia_core snd_page_alloc snd_hwdep tifm_7xx1 pci_hotplug serio_raw ide_cd_mod ac button i2c_core backlight output ieee1394 tifm_core mmc_core rtc_lib ff_memless bluetooth snd soundcore firmware_class k8temp cdrom tg3 sg ohci_hcd ehci_hcd usbcore edd ext3 jbd atiixp ide_core
[12844.066854] Pid: 13078, comm: kio_file Tainted: G M 2.6.25 #401
[12844.066857] RIP: 0010:[<ffffffff802a7b3c>] [<ffffffff802a7b3c>] __d_lookup+0xf1/0x117
[12844.066861] RSP: 0018:ffff810064c5dc08 EFLAGS: 00010286
[12844.066863] RAX: ffffffffffffffff RBX: ffff8100f0bd7e10 RCX: 0000000000000012
[12844.066866] RDX: ffffffffffffffff RSI: ffff810064c5dd08 RDI: ffff810053304000
[12844.066868] RBP: ffff810064c5dc58 R08: 0000000000000003 R09: 0000000000000001
[12844.066871] R10: 0000000000000000 R11: 0000000000000246 R12: ffff810053304000
[12844.066873] R13: ffff810064c5dd08 R14: 00...

To: Rafael J. Wysocki <rjw@...>
Cc: LKML <linux-kernel@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-ext4@...>, Herbert Xu <herbert@...>, Paul E. McKenney <paulmck@...>
Date: Sunday, April 20, 2008 - 5:31 pm

Something has added a dentry pointer that has the value -1 to the dentry
hash list. The access that oopses seems to be the

prefetch(pos->next)

which is part of hlist_for_each_entry_rcu(), where "pos" is -1.

I suspect it's an RCU error, ie somebody has released a dentry entry, and
free'd it without waiting for the RCU grace period.

Talking about RCU I also think that whoever did those "rcu_dereference()"
macros in <linux/list.h> was insane. It's totally pointless to do
"rcu_dereference()" on a local variable. It simply *cannot* make sense.
Herbert, Paul, you guys should look at it.

As far as I can tell, rcu_dereference() should _always_ be done when we
access the "next" pointer (except for when prefetching, where we simply
don't care).

Paul? Herbert? Totally untested patch appended.

NOTE! I do not expect this patch to matter for this oops. There's
something else going on there.

Linus

---
include/linux/list.h | 34 +++++++++++++++++-----------------
1 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/include/linux/list.h b/include/linux/list.h
index 75ce2cb..4a851ba 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -631,14 +631,14 @@ static inline void list_splice_init_rcu(struct list_head *list,
* as long as the traversal is guarded by rcu_read_lock().
*/
#define list_for_each_rcu(pos, head) \
- for (pos = (head)->next; \
- prefetch(rcu_dereference(pos)->next), pos != (head); \
- pos = pos->next)
+ for (pos = rcu_dereference((head)->next); \
+ prefetch(pos->next), pos != (head); \
+ pos = rcu_dereference(pos->next))

#define __list_for_each_rcu(pos, head) \
- for (pos = (head)->next; \
- rcu_dereference(pos) != (head); \
- pos = pos->next)
+ for (pos = rcu_dereference((head)->next); \
+ pos != (head); \
+ pos = rcu_dereference(pos->next))

/**
* list_for_each_safe_rcu
@@ -653,8 +653,8 @@ static inline void list_splice_in...

To: Linus Torvalds <torvalds@...>
Cc: LKML <linux-kernel@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-ext4@...>, Herbert Xu <herbert@...>, Paul E. McKenney <paulmck@...>
Date: Monday, April 21, 2008 - 12:12 pm

Well, it seems that the oops is actually known from -mm:

http://lkml.org/lkml/2008/4/21/55

and something similar was observed with 2.6.25-rc8-mm2.

Thanks,
Rafael
--

To: Rafael J. Wysocki <rjw@...>
Cc: LKML <linux-kernel@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-ext4@...>, Herbert Xu <herbert@...>, Paul E. McKenney <paulmck@...>, Jiri Slaby <jirislaby@...>, David S. Miller <davem@...>
Date: Monday, April 21, 2008 - 12:54 pm

Hmm. Sadly, I doubt that really cuts down the suspect list very much. Most
of what has been merged since 2.6.25 has been in -mm, so while I agree
that it looks very similar, the fact that it was possibly already in
-rc8-mm2 doesn't much _help_.

And in fact, those oopses in rc8-mm2 don't look _that_ similar. Those are
a corrupt f_mapping structure, it looks like (ie it looks like either
"struct address_space" or a "struct filp" rather than a "struct dentry").

What is interesting about Jiri's version of the bug is that he has another
value for the corruption than you do: you had either all-ones, or a value
that *looked* like possibly a single nybble got cleared.

Jiri, in contrast, has a value of 00f0000000000000. Which is a bit
interesting in that it's again a *nybble* that looks corrupt, but it's a
different one.

But assuming Jiri's two oopses are related (which is not entirely
unlikely), and assuming that this is a SLUB bucket re-use, then it's quite
likely that the reason that his -rc8-mm2 oops looks different just because
it was yet _another_ allocation that was in the same bucket. If so, the
most likely one is "struct filp", because it has the right size: for me a
filp is in the 192-byte bucket, which is very close to the 208-byte bucket
of dentry.

So I could imagine that some config option or other change just changed
the sizes around so that the two types ended up in different buckets in
rc8-mm2 and in 2.6.25-mm1 (ie neither the dentry nor the filp necessarily
changed sizes, but the *corrupting* type perhaps did?)

What I find interesting is that at least for me, I have the SLAB bucket
size for nf_conntrack_expect being 208 bytes. And the *biggest* merge by
far after 2.6.25 so far has been networking (and conntrack in particular)

Is that a smoking gun? Not necessarily. But it *is* intriguing. But there
are other possible clashes (the 192-byte bucket has several different
suspects, and not all of them are in networking).1

Jiri and Davem ...

To: Linus Torvalds <torvalds@...>
Cc: Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-ext4@...>, Herbert Xu <herbert@...>, Paul E. McKenney <paulmck@...>, David S. Miller <davem@...>
Date: Monday, April 21, 2008 - 1:06 pm

Yeah, I'm using slub. Going to boot to slub_debug.

Thanks so far.

BTW. I haven't see this without suspend/resume cycle, do you, Rafael? It doesn't
mean anything, since it needs longer time to trigger, but anyway, it might be a
clue.
--

To: Jiri Slaby <jirislaby@...>
Cc: Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-ext4@...>, Herbert Xu <herbert@...>, Paul E. McKenney <paulmck@...>, David S. Miller <davem@...>
Date: Monday, April 21, 2008 - 1:48 pm

There's a separate (and very different-looking) bug-report about the atl1
driver having problems when doing an "ifconfig down" on it. In fact, the

where that "or oopses at filp_close()" thing is somewhat interesting,
since your original bug was about something that looked like file pointer
corruption.

Now, I doubt you have an ATL chip, and I doubt the two are _really_
related in any way (the ATL bug was actually triggered by enabling 64-bit
DMA), but the filp_close thing makes me go "hmm".

The two affected corrupted SLUB areas were the 2kB allocation (1560-byte
ethernet packets plus skb_shared_info overhead, anyone?) and apparently
the one that filp's are in (perhaps a 20-byte TCP ACK packet or other
"small" packet + the skb_shared_info overhead would be a common case that
might be in that 200-byte range?)

Maybe the ATL bug isn't ATL-specific at all, but somehow connected to
NETIF_F_HIGHDMA. Do you have 4GB+ of RAM?

And one thing that suspend/resume does, which is not necessarily commonly
done during normal operation, is that ifconfig down/up pattern. Maybe
there is something broken in general there?

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-ext4@...>, Herbert Xu <herbert@...>, Paul E. McKenney <paulmck@...>, David S. Miller <davem@...>
Date: Monday, April 21, 2008 - 3:38 pm

Who knows, unfortunately it seems so.

I've found another two oopses related to this in logs (they are below). Again
dentry + offsetof(dentry, name) address is broken here and it fires up in
memcmp. I suspect somebody still uses that bucket (assigned now to dentry) as it
hasn't ever be freed and overwrites its members.

I also had corrupted include/linux/irq.h file. There was
irq_has_<some_ugly_utf_char>ction or something like that. I don't remember the
the exact function name, but compilation failed and it didn't when I compiled
the kernel for the first time -- I use that tree everyday, the corruption must
happen that day. Anyway I have no idea if this is related.

BUG: unable to handle kernel paging request at ffff81f02003f16c
IP: [<ffffffff802ad7d5>] __d_lookup+0x155/0x160
PGD 0
Oops: 0000 [1] SMP
last sysfs file: /sys/devices/platform/coretemp.1/temp1_input
CPU 1
Modules linked in: ppdev parport tun bitrev ipv6 test arc4 ecb crypto_blkcipher
cryptomgr crypto_algapi ath5k mac80211 crc32 rtc_cmos sr_mod ohci1394 rtc_core
usbhid rtc_lib ieee1394 cdrom cfg80211 hid usblp ehci_hcd ff_memless floppy
[last unloaded: vmnet]
Pid: 3710, comm: sensors-applet Tainted: P 2.6.25-rc8-mm2_64 #399
RIP: 0010:[<ffffffff802ad7d5>] [<ffffffff802ad7d5>] __d_lookup+0x155/0x160
RSP: 0018:ffff810057973b98 EFLAGS: 00010246
RAX: 0000000000000017 RBX: ffff81002003f0e0 RCX: 0000000000000017
RDX: 0000000000000017 RSI: ffff81f02003f16c RDI: ffff8100036f7022
RBP: ffff810057973bf8 R08: ffff810057973ca8 R09: 0000000000000000
R10: 00000000000000d8 R11: 0000000000000246 R12: ffff81002003f0c8
R13: 00000000910b9880 R14: ffff810035a5ded8 R15: ffff810057973bc8
FS: 00007f6e2b7266f0(0000) GS:ffff81007d006580(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff81f02003f16c CR3: 000000005788a000 CR4: 00000000000006a0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 00000...

To: Linus Torvalds <torvalds@...>
Cc: Jiri Slaby <jirislaby@...>, LKML <linux-kernel@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-ext4@...>, Herbert Xu <herbert@...>, Paul E. McKenney <paulmck@...>, David S. Miller <davem@...>
Date: Monday, April 21, 2008 - 2:22 pm

Hm, that may be the case.

In fact, I've cut the messages that precede the oops from the dmesg output,
but they are from the b43 driver and the firewall (the full oops below is
reproduced for completness):

[12736.964336] b43-phy0: Loading firmware version 410.2160 (2007-05-26 15:32:10)
[12737.692435] b43-phy0 debug: Chip initialized
[12737.692659] b43-phy0 debug: 32-bit DMA initialized
[12742.213601] Registered led device: b43-phy0::tx
[12742.216372] Registered led device: b43-phy0::rx
[12742.216559] Registered led device: b43-phy0::radio
[12742.216587] b43-phy0 debug: Wireless interface started
[12737.724614] b43-phy0 ERROR: PHY transmission error
[12737.764440] b43-phy0 ERROR: PHY transmission error
[12738.469683] b43-phy0 debug: Switching to 2.4-GHz band
[12738.469755] b43-phy0 debug: Wireless interface stopped
[12738.469958] b43-phy0 debug: DMA-32 rx_ring: Used slots 0/64, Failed frames 0/0 = 0.0%, Average tries 0.00
[12738.470020] b43-phy0 debug: DMA-32 tx_ring_AC_BK: Used slots 0/128, Failed frames 0/0 = 0.0%, Average tries 0.00
[12738.476448] b43-phy0 debug: DMA-32 tx_ring_AC_BE: Used slots 0/128, Failed frames 0/0 = 0.0%, Average tries 0.00
[12738.484436] b43-phy0 debug: DMA-32 tx_ring_AC_VI: Used slots 0/128, Failed frames 0/0 = 0.0%, Average tries 0.00
[12738.492433] b43-phy0 debug: DMA-32 tx_ring_AC_VO: Used slots 2/128, Failed frames 0/13 = 0.0%, Average tries 1.00
[12738.500433] b43-phy0 debug: DMA-32 tx_ring_mcast: Used slots 0/128, Failed frames 0/0 = 0.0%, Average tries 0.00
[12738.668447] b43-phy0: Loading firmware version 410.2160 (2007-05-26 15:32:10)
[12739.892834] b43-phy0 debug: Chip initialized
[12739.893099] b43-phy0 debug: 32-bit DMA initialized
[12739.916479] Registered led device: b43-phy0::tx
[12739.919263] Registered led device: b43-phy0::rx
[12739.919329] Registered led device: b43-phy0::radio
[12739.919372] b43-phy0 debug: Wireless interface started
[12739.968824] wlan0: Initial auth_alg=0
[12739.968832] wlan0: authenticate with AP 00:17:9a:f3:b5:75...

To: Jiri Slaby <jirislaby@...>
Cc: Linus Torvalds <torvalds@...>, LKML <linux-kernel@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-ext4@...>, Herbert Xu <herbert@...>, Paul E. McKenney <paulmck@...>, David S. Miller <davem@...>
Date: Monday, April 21, 2008 - 1:19 pm

I think we need some more data anyway.

Thanks,
Rafael
--

To: <torvalds@...>
Cc: <rjw@...>, <linux-kernel@...>, <mingo@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <paulmck@...>, <jirislaby@...>
Date: Monday, April 21, 2008 - 4:39 pm

From: Linus Torvalds <torvalds@linux-foundation.org>

I think you might be onto something here.

The "mask" member of struct nf_conntrack_expect could be reasonably
all 1's like the value reported in the crash that begins this
thread.

Do we know the offset within the object at which this all 1's
value is found?

My rough calculations show that on 32-bit that expect->mask member is
at offset 56 and on 64-bit it should be at offset 72. Does that
match up to the offset of the filp or whatever bit being corrupted?

I'll scan through the netfilter changesets in post 2.6.25 to see if
anything stands out.

--

To: David Miller <davem@...>
Cc: <rjw@...>, <linux-kernel@...>, <mingo@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <paulmck@...>, <jirislaby@...>
Date: Monday, April 21, 2008 - 5:19 pm

No, I think that the d_hash list is at offset 24 (64-bit).

But that changes if any of

- GENERIC_LOCKBREAK
- DEBUG_SPINLOCK
- DEBUG_LOCK_ALLOC (and if so, LOCK_STAT)

is set, and then you might actually get to 72.

However, the Code: line for one of the oopses shows that in that
particular case, it was at offset 0x18 (ie the normal 24), so at least one
of the oopses had no such thing going on.

Linus
--

To: <torvalds@...>
Cc: <rjw@...>, <linux-kernel@...>, <mingo@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <paulmck@...>, <jirislaby@...>
Date: Monday, April 21, 2008 - 5:54 pm

From: Linus Torvalds <torvalds@linux-foundation.org>

On 64-bit x86_64, which I believe the case you are referring to
is, that would be right in the middle of an hlist_node.

We would expect to see a valid pointer, NULL, or a list poison
value. Which we're not.

But I don't think networking or even netfilter can in any way be ruled
out yet.

A lot of the speculation is because of the SLUB cache sharing between
different object types. Is there some way to disable that and see how
that influences the bug?

Of course, even with sharing disabled a cache's page could get freed
and then re-allocated into another cache, but the likelyhood of it
happening exactly to a filp or dentry cache from a netfilter or whatever
one is extremely unlikely :)

--

To: David Miller <davem@...>
Cc: <torvalds@...>, <rjw@...>, <linux-kernel@...>, <mingo@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <paulmck@...>, Zdenek Kabelac <zdenek.kabelac@...>
Date: Monday, April 21, 2008 - 5:18 pm

dentry.d_name.name is 56 on 64-bit (my memcmp crashes)
dentry.d_hash.next is 24 (crashed at least 3 times here, rafael's one)
dentry.d_op is 136 (crash below)

It's spreading :/.

---------- Forwarded message ----------
From: Zdenek Kabelac <zdenek.kabelac@gmail.com>
Date: 21.4.2008 11:14
Subject: BUG: unable to handle kernel NULL pointer at d_free+0x18/0x80
To: Kernel development list <linux-kernel@vger.kernel.org>

Hello

This oops appeared in my log - unsure how it is related to my DVB-T
tuner test before.
But I've also seen another weird resume with some similar crash.

Happens with 2.6.25 - commit 48a86f548fb74928f9a466f52527e83fecdb4575
(T61, 2GB)

BUG: unable to handle kernel NULL pointer dereference at 0000000000000110
IP: [d_free+24/128] d_free+0x18/0x80
PGD 0
Oops: 0000 [1] PREEMPT SMP
CPU 0
Modules linked in: usb_storage dvb_usb_af9015 dvb_usb_dibusb_common
dib3000mc dibx000_common dvb_usb dvb_core tun nls_iso8859_2 nls_cp852
vfat fat i915 drm ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4
xt_state nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables
x_tables bridge llc nfsd lockd nfs_acl auth_rpcgss exportfs autofs4
sunrpc binfmt_misc dm_mirror dm_log dm_multipath dm_mod uinput
kvm_intel kvm snd_hda_intel arc4 ecb snd_seq_oss crypto_blkcipher
snd_seq_midi_event snd_seq cryptomgr snd_seq_device snd_pcm_oss
crypto_algapi iwl3945 mac80211 e1000e psmouse snd_mixer_oss rtc_cmos
evdev rtc_core thinkpad_acpi video snd_pcm mmc_block sdhci mmc_core
snd_timer iTCO_wdt iTCO_vendor_support battery backlight nvram rtc_lib
i2c_i801 i2c_core ac snd soundcore snd_page_alloc intel_agp output
serio_raw cfg80211 button uhci_hcd ohci_hcd ehci_hcd usbcore [last
unloaded: dvb_core]
Pid: 210, comm: kswapd0 Not tainted 2.6.25 #56
RIP: 0010:[d_free+24/128] [d_free+24/128] d_free+0x18/0x80
RSP: 0018:ffff81007ced9cf0 EFLAGS: 00010206
RAX: 00000000000000f0 RBX: ffff8100202723d8 RCX: 0000000000000132
...

To: David Miller <davem@...>
Cc: <torvalds@...>, <rjw@...>, <linux-kernel@...>, <mingo@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <paulmck@...>, Zdenek Kabelac <zdenek.kabelac@...>
Date: Monday, April 21, 2008 - 5:58 pm

Leaving untouched.

file.f_mapping is 176 (the another one from -rc8-mm2)

the one at:
http://www.opensubscriber.com/message/linux-kernel@vger.kernel.org/90082...

--

To: David Miller <davem@...>
Cc: <torvalds@...>, <rjw@...>, <linux-kernel@...>, <mingo@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <paulmck@...>, Zdenek Kabelac <zdenek.kabelac@...>
Date: Monday, April 21, 2008 - 6:26 pm

Sorry, one more entry:

00000000000000f0 dentry.d_op (Zdenek, offset ? around 136)
00f0000000000000 dentry.d_hash.next (me, offset 24)
ffff81f02003f16c dentry.d_name.name (me, offset 56)
memory ORed by 000000f000000000
fffff0002004c1b0 file.f_mapping (me, offset 176)
memory hole, it was something like
(ffff81002004c1b0 & ~00000f0000000000) | 0000f00000000000?
ffffffffffffffff dentry.d_hash.next (Rafael, offset ? around 24)
-1, ~0ULL

--

To: Jiri Slaby <jirislaby@...>
Cc: David Miller <davem@...>, <torvalds@...>, <rjw@...>, <linux-kernel@...>, <mingo@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, Zdenek Kabelac <zdenek.kabelac@...>
Date: Monday, April 21, 2008 - 6:54 pm

Are these running with CONFIG_PREEMPT_RCU? Grasping at straws, but
there are a couple of patches that need to move from -rt to mainline,
but mostly related to SELinux. So if both PREEMPT_RCU and SELinux
were in use, we might be missing "rcu-various-fixups.patch" from:

http://www.kernel.org/pub/linux/kernel/projects/rt/patch-2.6.24.4-rt4-br...

--

To: <paulmck@...>
Cc: David Miller <davem@...>, <torvalds@...>, <rjw@...>, <linux-kernel@...>, <mingo@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, Zdenek Kabelac <zdenek.kabelac@...>
Date: Monday, April 21, 2008 - 7:02 pm

$ grep RCU .config
CONFIG_CLASSIC_RCU=y
# CONFIG_RCU_TORTURE_TEST is not set
$ grep SECU .config
# CONFIG_EXT4DEV_FS_SECURITY is not set
# CONFIG_SECURITY is not set
# CONFIG_SECURITY_FILE_CAPABILITIES is not set

I guess not.

BTW the corruption I mentioned earlier was char 'ð' and it's ('p' | 0xf0) in
latin2. I think it was set_ðending_irq IIRC. Whatever, it won't help us.
--

To: Jiri Slaby <jirislaby@...>
Cc: <paulmck@...>, David Miller <davem@...>, <torvalds@...>, <rjw@...>, <linux-kernel@...>, <mingo@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>
Date: Monday, April 21, 2008 - 7:11 pm

I've kernel compiled with preemptible RCU & Security - but usually
using selinux=off as a kernel parameter

Zdenek
--

To: <paulmck@...>
Cc: David Miller <davem@...>, <torvalds@...>, <rjw@...>, <linux-kernel@...>, <mingo@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, Zdenek Kabelac <zdenek.kabelac@...>
Date: Monday, April 21, 2008 - 7:17 pm

The same place, dentry.d_hash.next is 1. No slub debug clues... I think, I'll
give slab a try. Any other clues?

Is this enough?
$ grep SLUB ../my_64/.config
CONFIG_SLUB_DEBUG=y
CONFIG_SLUB=y
# CONFIG_SLUB_DEBUG_ON is not set
# CONFIG_SLUB_STATS is not set
$ cat /proc/cmdline
root=/dev/md1 vga=1 ro reboot=a,w slub_debug

BUG: unable to handle kernel NULL pointer dereference at 0000000000000001
IP: [<ffffffff802aca27>] __d_lookup+0x97/0x160
PGD 4510b067 PUD 6768d067 PMD 0
Oops: 0000 [1] SMP
last sysfs file: /sys/devices/virtual/net/tun0/statistics/collisions
CPU 0
Modules linked in: test ipv6 tun bitrev arc4 ecb crypto_blkcipher cryptomgr
crypto_algapi ath5k mac80211 rtc_cmos crc32 sr_mod usbhid ohci1394 ehci_hcd
rtc_core hid ieee1394 floppy cdrom cfg80211 rtc_lib evdev ff_memless
Pid: 18600, comm: git-status Not tainted 2.6.25-mm1_64 #403
RIP: 0010:[<ffffffff802aca27>] [<ffffffff802aca27>] __d_lookup+0x97/0x160
RSP: 0018:ffff81006096bbf8 EFLAGS: 00010202
RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000012
RDX: ffff8100200f3568 RSI: ffff81006096bd08 RDI: ffff810020c0c880
RBP: ffff81006096bc58 R08: ffff81006096bd08 R09: 000000000000002c
R10: 000000000000002d R11: ffff81006428c200 R12: ffff810021f0a770
R13: 000000001b820c0e R14: ffff810020c0c880 R15: ffff81006096bc28
FS: 00007f2aa905e710(0000) GS:ffffffff80664000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000001 CR3: 0000000008fba000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process git-status (pid: 18600, threadinfo ffff81006096a000, task ffff810007988fc0)
Stack: ffff81006096bd08 0000000000000009 ffff810020c0c888 000000098026c2fd
ffff81006428c21c 0000000000000000 0000000000000001 0000000000000001
ffff81006096be38 ffff81006096be38 ffff81006096bd08 ffff81006096bd18
Call Trace:
[<ffffffff802a1e85>] do_lookup+0x35/0x...

To: Jiri Slaby <jirislaby@...>
Cc: <paulmck@...>, David Miller <davem@...>, <torvalds@...>, <linux-kernel@...>, <mingo@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, Zdenek Kabelac <zdenek.kabelac@...>
Date: Monday, April 21, 2008 - 8:54 pm

Well, SLUB uses some per CPU data structures. Is it possible that they get
corrupted and which leads to the observed symptoms?
--

To: Rafael J. Wysocki <rjw@...>
Cc: Jiri Slaby <jirislaby@...>, <paulmck@...>, David Miller <davem@...>, <linux-kernel@...>, <mingo@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, Zdenek Kabelac <zdenek.kabelac@...>
Date: Monday, April 21, 2008 - 9:14 pm

It really doesn't look like the slub allocations themselves would be
corrupted. It very much looks like wild pointers corrupting allocations
that themselves were fine.

The nybble pattern looked intriguing (especially as it apparently also hit
a normal page cache page!) but obviously not everything matches that
pattern (eg your value of 1).

What do you do to trigger this? Any particular load? Is it still just
doing suspend/resume, or do you have something else that you are playing
with?

Also, have you tried CONFIG_DEBUG_PAGEALLOC? That can also be a very
powerful way to find memory corruption.

Does anybody see any other patterns? Looking at the modules linked in in
the oopses from Zdenek, Rafael and Jiri, I don't see anything odd. You
both all have 80211 support, maybe the corruption comes from the wireless
layer?

Or maybe it's the x86 code changes themselves, and it really is about the
suspend/resume sequence itself. Are all the people who see this doing
suspends?

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: Rafael J. Wysocki <rjw@...>, <paulmck@...>, David Miller <davem@...>, <linux-kernel@...>, <mingo@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, Zdenek Kabelac <zdenek.kabelac@...>
Date: Tuesday, April 22, 2008 - 5:49 am

Yesterday I did 2 suspend/resumes after 1 hour of uptime and ran git-status
for a fraction of a second until it was killed. So I can perfectly reproduce
it when I suspend, resume and produce some io load. I guess it's time to
bisect 2.6.25-rc8-mm2 as I'm able to reproduce it the best and haven't seen

May be, however I don't use that stack, it's a desktop machine, it's only
sitting there not turned on, but sure, it's loaded.
--

To: Jiri Slaby <jirislaby@...>
Cc: Linus Torvalds <torvalds@...>, Rafael J. Wysocki <rjw@...>, <paulmck@...>, David Miller <davem@...>, <linux-kernel@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, Zdenek Kabelac <zdenek.kabelac@...>
Date: Tuesday, April 22, 2008 - 5:53 am

the most dangerous x86 change we added was the PAT stuff. Does it
influence the crashes in any way if you boot with 'nopat' or if you
disable CONFIG_X86_PAT=y into the .config?

the other area was the DMA ops change - that should be rather trivial on
64-bit though.

Ingo
--

To: Ingo Molnar <mingo@...>
Cc: Jiri Slaby <jirislaby@...>, Linus Torvalds <torvalds@...>, Rafael J. Wysocki <rjw@...>, <paulmck@...>, David Miller <davem@...>, <linux-kernel@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>
Date: Tuesday, April 22, 2008 - 2:35 pm

Unsure how it is related to my orginal Oops post - but now when I've
debug pagealloc enabled this appeared in my log after resume - should
I open new bug for this - or could this be part of the problem I've
experienced later?

(Note - now I'm running commit: 8a81f2738f10ca817c975cec893aa58497e873b2

sd 0:0:0:0: [sda] Starting disk
mmc0: new SD card at address 5a61
mmc mmc0:5a61: parent mmc0 is sleeping, will not add
------------[ cut here ]------------
WARNING: at drivers/base/power/main.c:78 device_pm_add+0x6c/0xf0()
Modules linked in: tda18271 nls_iso8859_2 nls_cp852 vfat fat i915 drm
ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 xt_state
nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables
bridge llc nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc
binfmt_misc dm_mirror dm_log dm_multipath dm_mod uinput kvm_intel kvm
snd_hda_intel snd_seq_oss snd_seq_midi_event snd_seq arc4
snd_seq_device snd_pcm_oss ecb crypto_blkcipher cryptomgr
crypto_algapi iwl3945 snd_mixer_oss mac80211 snd_pcm mmc_block video
sdhci thinkpad_acpi mmc_core i2c_i801 snd_timer rtc_cmos rtc_core
backlight iTCO_wdt cfg80211 evdev snd i2c_core e1000e psmouse
soundcore snd_page_alloc nvram intel_agp rtc_lib iTCO_vendor_support
output serio_raw ac battery button uhci_hcd ohci_hcd ehci_hcd usbcore
[last unloaded: microcode]
Pid: 1240, comm: kmmcd Not tainted 2.6.25 #57

Call Trace:
[warn_on_slowpath+95/144] warn_on_slowpath+0x5f/0x90
[device_pm_add+24/240] ? device_pm_add+0x18/0xf0
[device_pm_add+108/240] device_pm_add+0x6c/0xf0
[device_add+1092/1376] device_add+0x444/0x560
[_end+510110570/2109230024] :mmc_core:mmc_add_card+0xa2/0x140
[_end+510117927/2109230024] :mmc_core:mmc_attach_sd+0x17f/0x860
[_end+510109176/2109230024] ? :mmc_core:mmc_rescan+0x0/0x1c0
[_end+510109545/2109230024] :mmc_core:mmc_rescan+0x171/0x1c0
[run_workqueue+246/560] run_workqueue+0xf6/0x230
[worker_thread+167/288] worker_thread+0xa7/0x120
[autoremove_wake_function+0/64] ? aut...

To: Zdenek Kabelac <zdenek.kabelac@...>
Cc: Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, Rafael J. Wysocki <rjw@...>, <paulmck@...>, David Miller <davem@...>, <linux-kernel@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>
Date: Tuesday, April 22, 2008 - 2:48 pm

This is unrelated to the other issue, I think.

Your warning comes from commit 58aca23226a19983571bd3b65167521fc64f5869,
which admittedly looks like total crap. Rafael, what's the point of that
commit?

I read the commit message, but I can't make myself agree with the commit
code itself. If it's a "checking that the order is correct" thing, it
should be a warning, but not change the actual _action_ of the code.

Because the commit refused to add the device, it is also then the direct

So I would suggest reverting that commit, or at least just making it a
warning (while still registering the device).

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, Rafael J. Wysocki <rjw@...>, <paulmck@...>, David Miller <davem@...>, <linux-kernel@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>
Date: Wednesday, April 23, 2008 - 4:50 am

Hi

This time I've got slightly larger mess with some other oopses - I'm
not sure if they are just a consequence of the PM bad commit - or they
are a separate issue ?

Is there actually some patch I should test from those posted in the list ?

Here goes the oops log:

(SPIN LOCK already disabled is my personal trace ooops which is just
checking if the spin_lock_irq is already called with disabled irq - in
this place probably irqsave version should be used instead, otherwice
it's not properly restored)

PM: Syncing filesystems ... done.
Freezing user space processes ... (elapsed 0.46 seconds) done.
Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
Suspending console(s)
drm_sysfs_suspend
ACPI: PCI interrupt for device 0000:00:02.0 disabled
sd 0:0:0:0: [sda] Synchronizing SCSI cache
sd 0:0:0:0: [sda] Stopping disk
mmc0: card 5a61 removed
MMC: killing requests for dead queue
ACPI: PCI interrupt for device 0000:15:00.2 disabled
ACPI: PCI interrupt for device 0000:00:1f.1 disabled
ACPI: PCI interrupt for device 0000:00:1d.7 disabled
ACPI: PCI interrupt for device 0000:00:1d.2 disabled
ACPI: PCI interrupt for device 0000:00:1d.1 disabled
ACPI: PCI interrupt for device 0000:00:1d.0 disabled
ACPI: PCI interrupt for device 0000:00:1b.0 disabled
ACPI: PCI interrupt for device 0000:00:1a.7 disabled
ACPI: PCI interrupt for device 0000:00:1a.1 disabled
ACPI: PCI interrupt for device 0000:00:1a.0 disabled
ACPI: PCI interrupt for device 0000:00:19.0 disabled
ACPI: Preparing to enter system sleep state S3
Disabling non-boot CPUs ...
kvm: disabling virtualization on CPU1
CPU 1 is now offline
lockdep: fixing up alternatives.
SMP alternatives: switching to UP code
CPU1 is down
Extended CMOS year: 2000
hwsleep-0322 [00] enter_sleep_state : Entering sleep state [S3]
x86: PAT support disabled.
Extended CMOS year: 2000
Enabling non-boot CPUs ...
lockdep: fixing up alternatives.
SMP alternatives: switching to SMP code
Booting pr...

To: Zdenek Kabelac <zdenek.kabelac@...>
Cc: Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, Rafael J. Wysocki <rjw@...>, <paulmck@...>, David Miller <davem@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>, <linux-ext4@...>, <herbert@...>, Pekka Enberg <penberg@...>, Christoph Lameter <clameter@...>
Date: Wednesday, April 23, 2008 - 11:53 am

Goodie, two of the backtraces (the parent-is-sleeping warning and the
immediately subsequent oops) look like the same thing that should already

This is indeed an interesting issue: arch/x86/kernel/smpboot.c does an IPI
call to start_secondary, and yes, it looks suspicious to have that
lock_ipi_call_lock there (and in particular the unlock_ipi_call_lock that
enables interrupts within it). Ingo?

But the really interesting one is the later kmalloc() debugging triggers,
because this one is, I suspect, very much a sign of the memory corruption
bug you see.

There's two reasons that make me say that:

- the callback is in networking code and wireless, which was one of the
possible suspects.

- the padding pattern which *should* have been POISON_INUSE (0x5a) has
been overwritten with:

Padding 0xffff8100201a0000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
....
Padding 0xffff8100201a71a0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5 kkkkkkkkkkkkkkk¥
Padding 0xffff8100201a71b0: cc cc cc cc cc cc cc cc 00 00 1a 20 00 81 ff ff ÌÌÌÌÌÌÌÌ......ÿÿ
Padding 0xffff8100201a71c0: cd 70 17 a0 ff ff ff ff 00 00 00 00 73 05 00 00 Íp..ÿÿÿÿ....s...
Padding 0xffff8100201a71d0: b6 54 58 00 01 00 00 00 d5 71 26 81 ff ff ff ff ¶TX.....Õq&.ÿÿÿÿ
Padding 0xffff8100201a71e0: 00 00 00 00 7c 05 00 00 97 54 58 00 01 00 00 00 ....|....TX.....

which in turn is interesting because it very much looks like SLUB
re-used a page for something else (the values that things got
overwritten by are largely SLUB's own poison bytes: 6b is POISON_FREE,
the a5 at the end of the list of 6b's is POISON_END, while cc is
SLUB_RED_ACTIVE).

To me, that pattern looks like an order-3 allocation (correct: that's what
kmalloc-4096 is supposed to be using!) got released, and the stuff at the
end (with slub debugging, there's only room for 7 4096-byte allocations
there, so 71b0 is past the end) in that SLUB debug...

To: Linus Torvalds <torvalds@...>
Cc: Zdenek Kabelac <zdenek.kabelac@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, Rafael J. Wysocki <rjw@...>, <paulmck@...>, David Miller <davem@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>, <linux-ext4@...>, <herbert@...>, Christoph Lameter <clameter@...>
Date: Wednesday, April 23, 2008 - 12:58 pm

On Wed, Apr 23, 2008 at 6:53 PM, Linus Torvalds

Is the POISON_FREE ("6b") region really contiguous Zdenek? The problem
here is that the object looks to be 29104 bytes that is subject to
kmalloc_large() which by-passes SLUB poisoning completely.

Pekka

To: Pekka Enberg <penberg@...>
Cc: Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, Rafael J. Wysocki <rjw@...>, <paulmck@...>, David Miller <davem@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>, <linux-ext4@...>, <herbert@...>, Christoph Lameter <clameter@...>
Date: Wednesday, April 23, 2008 - 1:28 pm

To: Linus Torvalds <torvalds@...>
Cc: Zdenek Kabelac <zdenek.kabelac@...>, Ingo Molnar <mingo@...>, Rafael J. Wysocki <rjw@...>, <paulmck@...>, David Miller <davem@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>, <linux-ext4@...>, <herbert@...>, Pekka Enberg <penberg@...>, Christoph Lameter <clameter@...>
Date: Thursday, April 24, 2008 - 6:26 pm

And this too :/:

#include <err.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#define MAGIC 0xff00aa00deadcc22ULL

int main()
{
unsigned int a, b, c = 0;
unsigned long long *ch;

while (1) {
ch = malloc(1000000000);
if (!ch)
err(1, "malloc");

for (a = 0; a < 1000000000/sizeof(*ch); a++)
ch[a] = MAGIC;

printf("alloced %u\n", c);
sleep(10);

for (a = 0; a < 1000000000/sizeof(*ch); a++)
if (ch[a] != MAGIC) {
printf("WHAT THE HELL (%.8lx):\n", a *
sizeof(*ch));
for (b = a - a % 10; b < (a - a % 10) + 100; b++) {
printf("%.16llx ", ch[b]);
if (!((b + 1) % 10))
puts("");
}
exit(1);
}
free(ch);
printf("freed %u\n", c);
sleep(10);
c++;
}

return 0;
}

20 arpings running on wlan0 (don't know if this is related so far), suspend,
resume, right after resume:

freed 114
alloced 115
freed 115
alloced 116
WHAT THE HELL (000a3ff8):
ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadccf0
ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadccf0 ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00...

To: Jiri Slaby <jirislaby@...>
Cc: Zdenek Kabelac <zdenek.kabelac@...>, Ingo Molnar <mingo@...>, Rafael J. Wysocki <rjw@...>, <paulmck@...>, David Miller <davem@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>, <linux-ext4@...>, <herbert@...>, Pekka Enberg <penberg@...>, Christoph Lameter <clameter@...>
Date: Thursday, April 24, 2008 - 6:41 pm

Do you actually see this with _plain_ 2.6.25? So far I've assumed that all
the reports are about post-2.6.25 issues.

Also, it does seem like you can re-create this at will in ways that others
can not. Could you try to bisect it a bit? Right now we have no real clue
what it is all about, except that it seems to be related to suspend/resume
and there are some indications that it's about networking (and _perhaps_
wireless in particular).

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: Zdenek Kabelac <zdenek.kabelac@...>, Ingo Molnar <mingo@...>, Rafael J. Wysocki <rjw@...>, <paulmck@...>, David Miller <davem@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>, <linux-ext4@...>, <herbert@...>, Pekka Enberg <penberg@...>, Christoph Lameter <clameter@...>
Date: Friday, April 25, 2008 - 1:10 pm

Yes! I'm able to reproduce it 90%! We can exclude X, 80211 and so ath5k too (I
removed them from /lib/modules, so they won't ever load).

I set the water mark at 1700 megabytes to allocate by the testing programs, so
that it eats almost all free available memory. Then, I wait to "alloced 1" and
then pm-suspend on second console. After resume, it spits out the corruption.

I'm going to bisect it, will be back in few hours ;).
--

To: <jirislaby@...>
Cc: <torvalds@...>, <zdenek.kabelac@...>, <mingo@...>, <rjw@...>, <paulmck@...>, <linux-kernel@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>
Date: Friday, April 25, 2008 - 5:13 am

From: Jiri Slaby <jirislaby@gmail.com>

Thanks for all of this hard work and investigation Jiri!
--

To: David Miller <davem@...>
Cc: <jirislaby@...>, <torvalds@...>, <mingo@...>, <rjw@...>, <paulmck@...>, <linux-kernel@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>
Date: Friday, April 25, 2008 - 8:15 am

Well just to show it's not happing only to Jiri:

It's actually shows immediately on my box after suspend-resume...

./testf (Jiri's test code from this thread)
alloced 0
WHAT THE HELL (015130f8):
ff00aa00deadcc22 ff00aa00f0adcc22 ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22
--

To: David Miller <davem@...>
Cc: <jirislaby@...>, <torvalds@...>, <mingo@...>, <rjw@...>, <paulmck@...>, <linux-kernel@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>
Date: Friday, April 25, 2008 - 8:27 am

And now tested without the iwl wifi driver loaded:

./testf
alloced 0
freed 0
alloced 1
freed 1
alloced 2
WHAT THE HELL (38cdabe8):
ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadccf0
ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22 fff0aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00f0adcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22 ff00aa00deadcc22
ff00aa00deadcc22 ff00aa00deadcc22
[kabi@localhost ~]$ lsmod
Module Size Used by
nls_iso8859_2 6592 1
nls_cp852 6848 1
vfat 15808 1
fat ...

To: David Miller <davem@...>
Cc: <torvalds@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, Jiri Slaby <jirislaby@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, Andi Kleen <andi@...>, <pageexec@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, Ingo Molnar <mingo@...>
Date: Sunday, April 27, 2008 - 8:51 pm

Thanks. Bisected mm down to git-x86.patch, bisected git-x86-latest down to
x86: enhance DEBUG_RODATA support - alternatives
The patch below fixes the problem for me. Comments welcome.

The 0xf0 pattern comes from alternatives_smp_lock:
text_poke(*ptr, ((unsigned char []){0xf0}), 1);

I grepped for it a long time ago, but not in a form of coumpound literal :/.

*Never* more :).

--

kernel_text_address returns true even for modules which is not wanted
in text_poke. Use core_kernel_text instead.

This is a regression introduced in e587cadd8f47e202a30712e2906a65a0606d5865
which caused occasionaly crashes after suspend/resume.

Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
CC: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Andi Kleen <andi@firstfloor.org>
CC: pageexec@freemail.hu
CC: H. Peter Anvin <hpa@zytor.com>
CC: Jeremy Fitzhardinge <jeremy@goop.org>
CC: Ingo Molnar <mingo@elte.hu>
---
arch/x86/kernel/alternative.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 5412fd7..0b074cb 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -515,7 +515,7 @@ void *__kprobes text_poke(void *addr, const void *opcode, size_t len)
BUG_ON(len > sizeof(long));
BUG_ON((((long)addr + len - 1) & ~(sizeof(long) - 1))
- ((long)addr & ~(sizeof(long) - 1)));
- if (kernel_text_address((unsigned long)addr)) {
+ if (core_kernel_text((unsigned long)addr)) {
struct page *pages[2] = { virt_to_page(addr),
virt_to_page(addr + PAGE_SIZE) };
if (!pages[1])
--
1.5.4.5

--

To: Jiri Slaby <jirislaby@...>
Cc: David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, Andi Kleen <andi@...>, <pageexec@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, Ingo Molnar <mingo@...>
Date: Friday, April 25, 2008 - 11:03 am

You're a hero, Jiri.

And that also explains why I didn't see it - I don't do modules.

And we should really add a lot more sanity checking there.

Linus
--

To: <torvalds@...>
Cc: <jirislaby@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, <mathieu.desnoyers@...>, <andi@...>, <pageexec@...>, <hpa@...>, <jeremy@...>, <mingo@...>
Date: Friday, April 25, 2008 - 4:18 pm

From: Linus Torvalds <torvalds@linux-foundation.org>

Indeed, what a heroic effort to fix a bug, thanks Jiri!!
--

To: Linus Torvalds <torvalds@...>
Cc: Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, Andi Kleen <andi@...>, <pageexec@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 11:19 am

neither does my auto-test :-/

Suspend/resume goes from SMP to UP and then back - and triggers all the
instrument patching code. I suspect we should/could have seen similar

yeah.

incidentally, this bug was fixed by Mathieu yesterday but the full
impact of the bug was not realized. Below is that patch from
sched-devel.

i'm wondering what the best sanity checking would be. What we want is to
be sure the patch we modify is truly a kernel or module text page.

Perhaps we should start marking all kernel/module text pages with
PageReserved? That way we can not corrupt any userspace/pagecache page.
(and we'd clear PageReserved on module unload)

Ingo

------------------------->
Subject: Fix sched-devel text_poke
From: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Date: Thu, 24 Apr 2008 11:03:33 -0400

Use core_text_address() instead of kernel_text_address(). Deal with modules in
the same way used for the core kernel.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
arch/x86/kernel/alternative.c | 38 ++++++++++++++++++--------------------
1 file changed, 18 insertions(+), 20 deletions(-)

Index: linux/arch/x86/kernel/alternative.c
===================================================================
--- linux.orig/arch/x86/kernel/alternative.c
+++ linux/arch/x86/kernel/alternative.c
@@ -511,31 +511,29 @@ void *__kprobes text_poke(void *addr, co
unsigned long flags;
char *vaddr;
int nr_pages = 2;
+ struct page *pages[2];
+ int i;

- BUG_ON(len > sizeof(long));
- BUG_ON((((long)addr + len - 1) & ~(sizeof(long) - 1))
- - ((long)addr & ~(sizeof(long) - 1)));
- if (kernel_text_address((unsigned long)addr)) {
- struct page *pages[2] = { virt_to_page(addr),
- virt_to_page(addr + PAGE_SIZE) };
- if (!pages[1])
- nr_pages = 1;
- vaddr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
- BUG_ON(!vaddr);
- local_irq_save(flags);
- memcpy(&vadd...

To: Ingo Molnar <mingo@...>
Cc: Linus Torvalds <torvalds@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, Andi Kleen <andi@...>, <pageexec@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 11:27 am

> i'm wondering what the best sanity checking would be. What we want is to

If you enable VIRTUAL_BUG_ON on x86-64 in mmzone_64.h it would have
caught it on a NUMA kernel I think.

-Andi
--

To: Linus Torvalds <torvalds@...>
Cc: Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, Andi Kleen <andi@...>, <pageexec@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 11:26 am

something like the patch below? (untested)

Ingo

--------------->
Subject: harden kernel code patching
From: Ingo Molnar <mingo@elte.hu>
Date: Fri Apr 25 17:07:03 CEST 2008

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
arch/x86/kernel/alternative.c | 5 +++++
mm/vmalloc.c | 3 +++
2 files changed, 8 insertions(+)

Index: linux/arch/x86/kernel/alternative.c
===================================================================
--- linux.orig/arch/x86/kernel/alternative.c
+++ linux/arch/x86/kernel/alternative.c
@@ -518,6 +518,11 @@ void *__kprobes text_poke(void *addr, co
if (core_kernel_text((unsigned long)addr)) {
struct page *pages[2] = { virt_to_page(addr),
virt_to_page(addr + PAGE_SIZE) };
+ /*
+ * Module text pages are PageReserved:
+ */
+ WARN_ON(pages[0] && !PageReserved(pages[0]))
+ WARN_ON(pages[1] && !PageReserved(pages[1]))
if (!pages[1])
nr_pages = 1;
vaddr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
Index: linux/mm/vmalloc.c
===================================================================
--- linux.orig/mm/vmalloc.c
+++ linux/mm/vmalloc.c
@@ -391,6 +391,7 @@ static void __vunmap(const void *addr, i
struct page *page = area->pages[i];

BUG_ON(!page);
+ ClearPageReserved(page);
__free_page(page);
}

@@ -507,6 +508,8 @@ static void *__vmalloc_area_node(struct
area->nr_pages = i;
goto fail;
}
+ if (prot == PAGE_KERNEL_EXEC)
+ SetPageReserved(page);
area->pages[i] = page;
}

--

To: Ingo Molnar <mingo@...>
Cc: Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, Andi Kleen <andi@...>, <pageexec@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 11:33 am

No. That whole code sequence is total and utter crap. It needs to be
rewritten.

It first does a BUG_ON() if it's not naturally aligned (because that
wouldn't be atomic), and then it has code for page crossing! What a TOTAL
PIECE OF SH*T!

Hint:
- if it's naturally aligned, it couldn't be page crossing ANYWAY
- and if it was a page-crosser, it sure as hell couldn't be atomic!

The code is just crap, crap, crap. It needs to be rewritten from scratch.
I'll have a patch soonish.

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, Andi Kleen <andi@...>, <pageexec@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 11:54 am

Woooow, just a sec here. I removed the atomicity test _because_ there
happen to be a case where it's safe to do non-atomic instruction
modification. If we do :

1) replace the instruction first byte by a breakpoint, execute an
instruction bypass (see the immediate values patches for detail)
2) modify the instruction non-atomically
3) put back the original instruction first byte.

That's why I removed the BUG_ONs at the beginning of the function.
That's also why it's required to deal with page crossing.

Mathieu

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--

To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Linus Torvalds <torvalds@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, Andi Kleen <andi@...>, <pageexec@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 11:59 am

but the code as-is is nonsensical. It checks for:

BUG_ON(len > sizeof(long));

but then deals with page crossing...

it should also rename text_poke_early() to text_poke_core(), and call
_that_ from text_poke() if core_kernel_text(). From that alone the whole
poke_text() function would look a whole lot cleaner.

Ingo
--

To: Ingo Molnar <mingo@...>
Cc: Linus Torvalds <torvalds@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, Andi Kleen <andi@...>, <pageexec@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 12:11 pm

That was in the initial version, before my patch, yes. I dealt with page
crossing at first, then added a more restrictive test to "play safe" (I
should have removed the page-crossing code at that point), but later on
noticed that there was a single case where it's valid to do non-atomic
updates, and it's when the execution flow is bypassed by a breakpoint
(as the immediate values are doing), so the last patch you have removes

hrm, I am not convinced it's safe to call vmap() very early at boot
time. In the immediate values implementation, I do call text_poke very
very early at boot to populate the initial values. Or maybe are you
proposing something different from what I currently understand ?

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--

To: Linus Torvalds <torvalds@...>
Cc: Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, Andi Kleen <andi@...>, <pageexec@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 11:50 am

yeah :(

it seems that this code only worked because text_poke_early() [which can
take arbitrary length and alignment] does most of the patching, it is
the real code-patching machinery that is used during early bootup - and
that's not used later on.

text_poke() itself only applies/unapplies the LOCK prefix - a single
byte. We shouldnt be doing that at all: the cost of LOCK is
insignificant (a few cycles) and most systems are SMP anyway.

any other type of code patching should use stop_machine_run(), where
every CPU is stopped with irqs disabled.

Ingo
--

To: Ingo Molnar <mingo@...>
Cc: Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, Andi Kleen <andi@...>, <pageexec@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 12:11 pm

No, the cost of LOCK is quite high on a lot of systems.

On P4's in particular, since LOCK is serializing, it's about 140 cycles or
so (and breaks all speculation). So we definitely want to remove it for
any generic kernels.

(lock is fairly cheap on AMD K8's, and reportedly on Intel's upcoming
Nehalem too, but on Core 2 it's about 35 cycles - quite noticeable,
although not nearly the disaster that netburst is)

Oh, and text_poke() is also used for inserting the debug instruction for
kprobes (and restoring the original byte), but yes, that is always just a
single byte too.

Linus
--

To: Ingo Molnar <mingo@...>
Cc: Linus Torvalds <torvalds@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, Andi Kleen <andi@...>, <pageexec@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 11:57 am

Alas, on older CPUs the cost of LOCK can be massive. The question is
how much we really care - the embedded people (who would definitely be
affected) will simply build UP kernels, and this only affects booting
SMP kernels on UP.

-hpa
--

To: H. Peter Anvin <hpa@...>
Cc: Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, Andi Kleen <andi@...>, <pageexec@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 2:53 pm

Like... say distros on older hardware?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

To: Linus Torvalds <torvalds@...>
Cc: Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, Andi Kleen <andi@...>, <pageexec@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 11:48 am

With the current code it doesn't need to be atomic anyways because
all patching is done with other CPUs stopped, except for kprobes
but those only ever write a single byte.

So all these checks can be just removed.

-Andi
--

To: Andi Kleen <andi@...>
Cc: Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, <pageexec@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 12:06 pm

Quite frankly, I'd rather tighten them up. All the callers actually seem
to do just a single-byte one.

So I'd suggest really tightening it up to require total natural alignment
(rather than the weaker version that required that it fit in an aligned
unsigned long or whatever). And I'd suggest using FIXMAP's instead of
vmap. Maybe something like the appended (TOTALLY UNTESTED!)

Linus

---
arch/x86/kernel/alternative.c | 32 ++++++++++++++++----------------
include/asm-x86/fixmap_32.h | 1 +
include/asm-x86/fixmap_64.h | 1 +
3 files changed, 18 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index df4099d..6172e40 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -508,24 +508,24 @@ void *text_poke_early(void *addr, const void *opcode, size_t len)
*/
void *__kprobes text_poke(void *addr, const void *opcode, size_t len)
{
- unsigned long flags;
- char *vaddr;
- int nr_pages = 2;
+ static DEFINE_SPINLOCK(poke_lock);
+ unsigned long flags, bits;

+ bits = (unsigned long) addr;
BUG_ON(len > sizeof(long));
- BUG_ON((((long)addr + len - 1) & ~(sizeof(long) - 1))
- - ((long)addr & ~(sizeof(long) - 1)));
- if (kernel_text_address((unsigned long)addr)) {
- struct page *pages[2] = { virt_to_page(addr),
- virt_to_page(addr + PAGE_SIZE) };
- if (!pages[1])
- nr_pages = 1;
- vaddr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
- BUG_ON(!vaddr);
- local_irq_save(flags);
- memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len);
- local_irq_restore(flags);
- vunmap(vaddr);
+ BUG_ON(len & (len-1));
+ BUG_ON(bits & (len-1));
+
+ if (core_kernel_text(bits)) {
+ unsigned long phys = __pa(addr);
+ unsigned long offset = phys & ~PAGE_MASK;
+ unsigned long virt = fix_to_virt(FIX_POKE);
+ phys &= PAGE_MASK;
+
+ spin_lock_irqsave(&poke_lock, flags);
+ set_fixmap(FIX_POKE, phys);
+ memcpy((void *)(vi...

To: Linus Torvalds <torvalds@...>
Cc: Andi Kleen <andi@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, <pageexec@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 12:22 pm

hm, right now we've got a debug protection in set_fixmap() to make sure
it's only ever called once. So it's going to be a noisy bootup. (but
it's a warning only) The patch below removes that.

Ingo

------------->
Subject: x86: remove set_fixmap() warning
From: Ingo Molnar <mingo@elte.hu>
Date: Fri Apr 25 18:05:57 CEST 2008

set_fixmap() is safe as long as it's explicitly serialized between
all users.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
arch/x86/mm/init_64.c | 3 ---
1 file changed, 3 deletions(-)

Index: linux/arch/x86/mm/init_64.c
===================================================================
--- linux.orig/arch/x86/mm/init_64.c
+++ linux/arch/x86/mm/init_64.c
@@ -173,9 +173,6 @@ set_pte_phys(unsigned long vaddr, unsign
new_pte = pfn_pte(phys >> PAGE_SHIFT, prot);

pte = pte_offset_kernel(pmd, vaddr);
- if (!pte_none(*pte) &&
- pte_val(*pte) != (pte_val(new_pte) & __supported_pte_mask))
- pte_ERROR(*pte);
set_pte(pte, new_pte);

/*
--

To: Ingo Molnar <mingo@...>
Cc: Andi Kleen <andi@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, <pageexec@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 12:37 pm

No, I think the warning is good, I should have done some kind of
clear_fixmap() after doing the mmap.

But there was actually a much worse problem with my patch: __set_fixmap()
is __init. Which means that my patch was just totally broken.

What I really wanted to do was to just follow the page tables and mark it
writable temporarily over the whole loop, and get rid of the whole mess.

(We'd need to make __set_fixmap() non-init, and probably return the pte_t
pointer that it used, so that we could then just use "native_pte_clear()"
on the thing after having done the memcpy()).

I suspect I should have just kept using vmap(), even if I do dislike just
how insanely expensive that likely is.

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: Andi Kleen <andi@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, <pageexec@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 12:52 pm

clear_fixmap() is OK. I've made a tree with all these fixlets, in the
proper order and with the commit logs tidied up:

git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-x86-fixes3.git for-linus

[ i integrated Jiri's commit to before your fix because he really
deserves that commit (and more) for his relentless debugging effort. ]

below is the full shortlog and diff. Minimally tested on 64-bit so far.

Ingo

------------------>

Ingo Molnar (3):
x86: make clear_fixmap() available on 64-bit as well
x86: make __set_fixmap() non-init
x86: harden kernel code patching

Jiri Slaby (1):
x86: fix text_poke()

Linus Torvalds (1):
x86: clean up text_poke()

arch/x86/kernel/alternative.c | 35 +++++++++++++++++++----------------
arch/x86/mm/init_64.c | 5 ++---
include/asm-x86/fixmap.h | 8 ++++++++
include/asm-x86/fixmap_32.h | 8 +++-----
include/asm-x86/fixmap_64.h | 5 +++--
5 files changed, 35 insertions(+), 26 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index df4099d..2e39830 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -508,24 +508,27 @@ void *text_poke_early(void *addr, const void *opcode, size_t len)
*/
void *__kprobes text_poke(void *addr, const void *opcode, size_t len)
{
- unsigned long flags;
- char *vaddr;
- int nr_pages = 2;
+ static DEFINE_SPINLOCK(poke_lock);
+ unsigned long flags, bits;

+ bits = (unsigned long) addr;
BUG_ON(len > sizeof(long));
- BUG_ON((((long)addr + len - 1) & ~(sizeof(long) - 1))
- - ((long)addr & ~(sizeof(long) - 1)));
- if (kernel_text_address((unsigned long)addr)) {
- struct page *pages[2] = { virt_to_page(addr),
- virt_to_page(addr + PAGE_SIZE) };
- if (!pages[1])
- nr_pages = 1;
- vaddr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
- BUG_ON(!vaddr);
- local_irq_save(flags);
- memcpy(&vaddr[(unsigned long)addr & ~PAGE_MA...

To: Linus Torvalds <torvalds@...>
Cc: Ingo Molnar <mingo@...>, Andi Kleen <andi@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, <pageexec@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 12:56 pm

If it's really a problem it would be better to just batch it and extract
it into a separate function. The larger scale callers of text_poke() are
loops, so you could just map it once before the loop and then unmap after.
But I haven't heard about anyone complaining about this.

-Andi
--

To: Linus Torvalds <torvalds@...>
Cc: Andi Kleen <andi@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, <pageexec@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 12:45 pm

ah, on 64-bit. That we better make consistent anyway, via the patch
below. set_pte_phys() needs to become non-init as well.

Ingo

----------->
Subject: x86: make __set_fixmap() non-init
From: Ingo Molnar <mingo@elte.hu>
Date: Fri Apr 25 18:28:21 CEST 2008

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
arch/x86/mm/init_64.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)

Index: linux/arch/x86/mm/init_64.c
===================================================================
--- linux.orig/arch/x86/mm/init_64.c
+++ linux/arch/x86/mm/init_64.c
@@ -135,7 +135,7 @@ static __init void *spp_getpage(void)
return ptr;
}

-static __init void
+static void
set_pte_phys(unsigned long vaddr, unsigned long phys, pgprot_t prot)
{
pgd_t *pgd;
@@ -214,8 +214,7 @@ void __init cleanup_highmap(void)
}

/* NOTE: this is meant to be run only at boot */
-void __init
-__set_fixmap(enum fixed_addresses idx, unsigned long phys, pgprot_t prot)
+void __set_fixmap(enum fixed_addresses idx, unsigned long phys, pgprot_t prot)
{
unsigned long address = __fix_to_virt(idx);

--

To: Ingo Molnar <mingo@...>
Cc: Andi Kleen <andi@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, <pageexec@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 12:51 pm

Make it return the "pte_t *", and now you don't have to walk the page
tables twice to just clear it immediately afterwards. At that point I
think my patch will be happy and useful, but I also worry a bit whether it
was worth the changes..

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: Andi Kleen <andi@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, <pageexec@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 1:02 pm

performance i dont think we should be too worried about at this moment -
this code is so rarely used that it should be driven by robustness i
think.

one theoretical worry i have is that we've got the pending immediate
values changes from Mathieu. Those end up removing the original
BUG_ON(len > sizeof(long)) restriction (and the alignment check) and
uses a carefully crafted (but scary as hell) sequence of text_poke()
sequences to turn a marker into a single-instruction NOP when the marker
is inactive.

Single-instruction NOP markers is a rather ... tempting goal and it can
(and must be able to) patch instructions across page boundaries as well.

i think with the PageReserved WARN_ON() we should be sufficiently
protected against stray scribbles so Mathieu's fix might be usable as
well - see it below.

Note that the BUG_ON()s at the end of the text_poke() version below
should have caught this bug too i think - because the bug was due to
mis-mapping the pages due to the incorrect kernel_text_address()
condition so we'd have noticed that the expected bits did not end up in
the right place.

Ingo

----------------------->
Subject: Fix sched-devel text_poke
From: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Date: Thu, 24 Apr 2008 11:03:33 -0400

Use core_text_address() instead of kernel_text_address(). Deal with modules in
the same way used for the core kernel.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
arch/x86/kernel/alternative.c | 38 ++++++++++++++++++--------------------
1 file changed, 18 insertions(+), 20 deletions(-)

Index: linux/arch/x86/kernel/alternative.c
===================================================================
--- linux.orig/arch/x86/kernel/alternative.c
+++ linux/arch/x86/kernel/alternative.c
@@ -511,31 +511,29 @@ void *__kprobes text_poke(void *addr, co
unsigned long flags;
char *vaddr;
int nr_pages = 2;
+ struct page *...

To: Ingo Molnar <mingo@...>
Cc: Andi Kleen <andi@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, <pageexec@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 1:13 pm

That really isn't true. This isn't done just once. It's done many
thousands of times.

I agree that it has to be robust, but if we want to make suspend/resume
be instantaneous (and we do), performance does actually matter. Yes, this
is probably much less of a problem than waiting for devices, and no, I
haven't timed it, but if I counted right, we'll literally be going almost
ten thousand of these calls over a suspend/resume cycle.

That's not "rarely used".

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: Andi Kleen <andi@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, <pageexec@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 1:53 pm

yeah, it's done 2800 times on my box with a distro .config.

no strong feeling either way - but i dont think there's any cross-CPU
TLB flush done in this case within vmap()/vunmap(). Why? Because when
alternative_instructions() runs then we have just a single CPU in
cpu_online_map.

So i think it's only direct vmap()/vunmap() overhead, on a single CPU.
We do a kmalloc/kfree which is rather fast - sub-microsecond. We install
the pages in the pte's - this is rather fast as well - sub-microsecond.
Even assuming cache-cold lines (which they are most of the time) and
taken thousands of times that's at most a few milliseconds IMO.

In fact, most of the actual vmap() related overhead should be
well-cached (the kmalloc bits) - the main cost should come from trashing
through all the instruction sites and modifying them.

i just measured the actual costs, and the UP/SMP offline/online
transition time (with Jiri's patch applied) is:

# time echo 0 > /sys/devices/system/cpu/cpu1/online

real 0m0.116s
user 0m0.000s
sys 0m0.008s

# time echo 1 > /sys/devices/system/cpu/cpu1/online

real 0m0.095s
user 0m0.000s
sys 0m0.069s

with your fixmap patch:

# time echo 0 > /sys/devices/system/cpu/cpu1/online

real 0m0.110s
user 0m0.001s
sys 0m0.003s

# time echo 1 > /sys/devices/system/cpu/cpu1/online

real 0m0.099s
user 0m0.000s
sys 0m0.072s

(i ran it multiple times and picked a representative run)

i also did a third control run with a kernel that had
alternative_instructions() disabled. The offline/online cost is:

# time echo 0 > /sys/devices/system/cpu/cpu1/online

real 0m0.108s
user 0m0.000s
sys 0m0.000s

# time echo 1 > /sys/devices/system/cpu/cpu1/online

real 0m0.096s
user 0m0.000s
sys 0m0.068s

_perhaps_ there's a decrease in time but i couldnt say it for sure,
because in the 'go online' case the numbers are so similar.

...

To: Linus Torvalds <torvalds@...>
Cc: Andi Kleen <andi@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, <pageexec@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 2:13 pm

i just did some direct measurements of alternatives_smp_switch() itself:

alternatives took: 7374 usecs
alternatives took: 8775 usecs
alternatives took: 7498 usecs
alternatives took: 8776 usecs

that's on a ~2GHz Athlon64 X2 - so not the latest hw.

i also added a sysctl to turn alternatives patching on/off, and the CPU
offline+online cycle:

# alternatives on:
real 0m0.152s
real 0m0.172s

# alternatives off:
real 0m0.146s
real 0m0.168s

so it's measurable and it is in the few milliseconds range. (But there
seems to be strong dependency on the kernel image layout or some other
detail - compare these timings to my previous timings - they were
radically different.)

Ingo
--

To: Ingo Molnar <mingo@...>
Cc: Andi Kleen <andi@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, <pageexec@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 2:09 pm

Ok, fair enough. Without the IPI, I don't think there's a big deal. And
you have the numbers to prove it. Consider me convinced.

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: Andi Kleen <andi@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, <pageexec@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 2:19 pm

great - i've lined up all the fixes into this git tree which you can
pull from:

git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-x86-fixes4.git for-linus

this has Jiri's fix followed by Mathieu's vmap logic cleanups, plus a
bit of extra checks and the API extensions for set_fixmap (we didnt end
up using them but they make sense nevertheless).

Lightly tested though, so even if you agree with the changes you might
want to wait an hour with the pull just in case some trivial build issue
slipped in. Shortlog and diff below.

Ingo

------------------>
Ingo Molnar (4):
x86: make clear_fixmap() available on 64-bit as well
x86: make __set_fixmap() non-init
x86: remove set_fixmap() warning
x86: harden kernel code patching

Jiri Slaby (1):
x86: fix text_poke()

Mathieu Desnoyers (1):
x86: clean up text_poke()

arch/x86/kernel/alternative.c | 39 +++++++++++++++++++--------------------
arch/x86/mm/init_64.c | 7 +++----
include/asm-x86/fixmap.h | 8 ++++++++
include/asm-x86/fixmap_32.h | 7 ++-----
include/asm-x86/fixmap_64.h | 4 ++--
5 files changed, 34 insertions(+), 31 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index df4099d..65c7857 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -511,31 +511,30 @@ void *__kprobes text_poke(void *addr, const void *opcode, size_t len)
unsigned long flags;
char *vaddr;
int nr_pages = 2;
+ struct page *pages[2];
+ int i;

- BUG_ON(len > sizeof(long));
- BUG_ON((((long)addr + len - 1) & ~(sizeof(long) - 1))
- - ((long)addr & ~(sizeof(long) - 1)));
- if (kernel_text_address((unsigned long)addr)) {
- struct page *pages[2] = { virt_to_page(addr),
- virt_to_page(addr + PAGE_SIZE) };
- if (!pages[1])
- nr_pages = 1;
- vaddr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
- BUG_ON(!vaddr);
- local_irq_save(flags);
- memcpy(&vaddr[(un...

To: Linus Torvalds <torvalds@...>
Cc: Andi Kleen <andi@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, <pageexec@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 2:56 pm

ok, it's better tested now - 10 random bootups, amongst them 64/32-bit
allyesconfig bootups, and some targeted testing as well with
offlining/onlining CPUs and no problems found so far. Please pull.

Ingo
--

To: Linus Torvalds <torvalds@...>
Cc: Andi Kleen <andi@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, <pageexec@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 2:04 pm

i mean, alternatives_smp_switch().

Ingo
--

To: Linus Torvalds <torvalds@...>
Cc: Ingo Molnar <mingo@...>, Andi Kleen <andi@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, <pageexec@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 1:26 pm

For suspend/resume we can actually just disable all the text_poke()s.
They are not needed because we don't expect to stay in single CPU
mode for long after wake up and they will just be undone again.

I guess if it really was a problem (but really I haven't heard about it)
the easiest fix would be to just extended system_state to SYSTEM_SUSPEND
and then skip them if that is true.

-Andi
--

To: Andi Kleen <andi@...>
Cc: Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, <pageexec@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 1:29 pm

I do agree that we might decide to just not do this at all except for the
actual physical bootup phase (which can use early_text_poke()). There may
not be a whole lot of point to ever play with smp_alterinatives() at any

Our device suspend right now takes about 3.5 seconds (that's using the
debug thing, which adds a 5-second timeout). That *is* a problem, but it's
historically been hidden by the fact that people are happy that suspend
works at all when it does.

These days, we're getting to the point (I think) that a lot more people
are going to take suspend for granted. And I'd actually like to use it as
a sleep state for desktop like usage (let's face it, when the screen goes
dark, the CPU should just go into suspend too if it's used as a desktop by
non-technical users).

And for that to be useful, it needs to come up quickly. Not add another
second on top of the already-irritating delay of the screen waking up.

Are we there yet? Hell no. But I don't think we're too far off.

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: Andi Kleen <andi@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, <pageexec@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 12:43 pm

yeah - then you need the patch below that makes clear_fixmap() available
on 64-bit as well.

Ingo

--------------->
Subject: x86: make clear_fixmap() available on 64-bit as well
From: Ingo Molnar <mingo@elte.hu>
Date: Fri Apr 25 18:25:25 CEST 2008

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
include/asm-x86/fixmap.h | 8 ++++++++
include/asm-x86/fixmap_32.h | 7 ++-----
include/asm-x86/fixmap_64.h | 4 ++--
3 files changed, 12 insertions(+), 7 deletions(-)

Index: linux/include/asm-x86/fixmap.h
===================================================================
--- linux.orig/include/asm-x86/fixmap.h
+++ linux/include/asm-x86/fixmap.h
@@ -1,5 +1,13 @@
+#ifndef _ASM_FIXMAP_H
+#define _ASM_FIXMAP_H
+
#ifdef CONFIG_X86_32
# include "fixmap_32.h"
#else
# include "fixmap_64.h"
#endif
+
+#define clear_fixmap(idx) \
+ __set_fixmap(idx, 0, __pgprot(0))
+
+#endif
Index: linux/include/asm-x86/fixmap_32.h
===================================================================
--- linux.orig/include/asm-x86/fixmap_32.h
+++ linux/include/asm-x86/fixmap_32.h
@@ -10,8 +10,8 @@
* Support of BIGMEM added by Gerhard Wichert, Siemens AG, July 1999
*/

-#ifndef _ASM_FIXMAP_H
-#define _ASM_FIXMAP_H
+#ifndef _ASM_FIXMAP_32_H
+#define _ASM_FIXMAP_32_H

/* used by vmalloc.c, vsyscall.lds.S.
@@ -121,9 +121,6 @@ extern void reserve_top_address(unsigned
#define set_fixmap_nocache(idx, phys) \
__set_fixmap(idx, phys, PAGE_KERNEL_NOCACHE)

-#define clear_fixmap(idx) \
- __set_fixmap(idx, 0, __pgprot(0))
-
#define FIXADDR_TOP ((unsigned long)__FIXADDR_TOP)

#define __FIXADDR_SIZE (__end_of_permanent_fixed_addresses << PAGE_SHIFT)
Index: linux/include/asm-x86/fixmap_64.h
===================================================================
--- linux.orig/include/asm-x86/fixmap_64.h
+++ linux/include/asm-x86/fixmap_64.h
@@ -8,8 +8,8 @@
* Copyright (C) 1998 Ingo Molnar
*/

-#ifndef _ASM_FIXMAP_H
-#define _ASM_...

To: Linus Torvalds <torvalds@...>
Cc: Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, <pageexec@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 12:19 pm

I think Mathieu did them to prepare for his immediate values which
need to write more bytes (although actually it would be quite
possible to have immediate values only for byte immediates too)

For the common (everything but kprobes) "other code not running"
it doesn't matter and I don't think natural alignment works for the
other cases anyways.

Not sure how the fixmap is better. It's pretty much equivalent, isn't it?
Perhaps a little cheaper, but the code shouldn't be performance critical.

-Andi

--

To: Andi Kleen <andi@...>
Cc: Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, <pageexec@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 12:24 pm

I have no really strong opinions. However, we do have a *lot* of lock
prefixes in the kernel, and fixmaps are a lot cheaper than vmap(). It may
not be performance-critical, but for me the "locks" section for the kernel
is 0x8060 bytes long, which would seem to say that this is called four
thousand times for each suspend and resume.

With each invocation being thousands of instructions and a cross-CPU IPI
for the tlb flush, that kind of stuff adds up. We're likely talking real
fractions of a second, rather than milliseconds.

But no, I didn't time it or really think very deeply about it.

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, <pageexec@...>, H. Peter Anvin <hpa@...>, Nick Piggin <nickpiggin@...>
Date: Friday, April 25, 2008 - 2:13 pm

Doesn't vunmap batch the cross-CPU tlb flushes to amortize the cost?
Hm, no, it doesn't seem to. Oh, right, it was one of Nick's TBDs.

J
--

To: Jeremy Fitzhardinge <jeremy@...>
Cc: Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, <pageexec@...>, H. Peter Anvin <hpa@...>
Date: Sunday, May 4, 2008 - 10:36 pm

Yeah and I do have patches... posted a few months ago IIRC. I forget
offhand the exact details of the batching, but yes it should batch this
case of Linus's. I think I set it to batch 1024 vunmaps or xxxxKB per
IPI flush. With those patches it basically removed IPI flushing
completely from profiles of vmap intensive workloads.

The problem with vmap (outside this particular issue of text poking,
which should be single-threaded anyway), really is that it is a single
threaded allocator. The locking actually ends up hurting much more than
the IPIs for non-trivial uses of vmap (like xfs with directories larger
than PAGE_SIZE). Fortunately I have patches for that too ;)
--

To: Linus Torvalds <torvalds@...>
Cc: Andi Kleen <andi@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, <pageexec@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 12:33 pm

the other thing is atomicity - your new version of text_poke() is
evidently atomic - while vmap() does a kmalloc which might sleep.
Atomicity for something as fragile as code-patching never hurts, so i
definitely like your version more.

it's also the more familar API - set_fixmap() is used more frequently
than vmap() - hence less danger of doing something wrong.

in fact i'd do the extra sanity check below as well on top of your patch
- all core kernel text pages are PageReserved so the one below would
have caught the memory corruption right at its source.

Ingo

---------------->
Subject: x86: harden kernel code patching
From: Ingo Molnar <mingo@elte.hu>
Date: Fri Apr 25 17:07:03 CEST 2008

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
arch/x86/kernel/alternative.c | 2 ++
1 file changed, 2 insertions(+)

Index: linux/arch/x86/kernel/alternative.c
===================================================================
--- linux.orig/arch/x86/kernel/alternative.c
+++ linux/arch/x86/kernel/alternative.c
@@ -522,6 +522,8 @@ void *__kprobes text_poke(void *addr, co
unsigned long virt = fix_to_virt(FIX_POKE);
phys &= PAGE_MASK;

+ WARN_ON(!PageReserved(virt_to_page(addr)));
+
spin_lock_irqsave(&poke_lock, flags);
set_fixmap(FIX_POKE, phys);
memcpy((void *)(virt + offset), opcode, len);
--

To: Andi Kleen <andi@...>
Cc: Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, <pageexec@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 12:30 pm

Yes, the immediate values, in general, only need to do atomic writes,
because I have taken care of placing the mov instruction in the correct
alignment so its immediate value happens to be aligned in memory.
However, the latest optimisation I did to change a conditional branch
into a jump when the correct code pattern is detected :

mov, test, bne short
into a
nop2, nop2, nop1, jmp short

or

mov, test, bne near
into a
nop2, nop2, nop1, jmp near

"replace_instruction_safe" is used for that. It puts a breakpoint in
lieue of each instruction's first byte before changing the rest of the
(potentially non aligned) instruction non atomically, and only then,
after issuing a sync_core on every CPUs to flush the trace cache, does
it put back the first byte, so it's done safely wrt intel's erratas
regarding code modification on SMP. Also note that it changes a 6 bytes
branch instruction into a 1 byte nop + 5 byte jump in the near jump
case, which is ok : you can split an instruction in multiple smaller
instructions safely on a live system wrt any execution context, but the
opposite is _not_ ok, since there could be a return address pointing in
the middle of the grouped instructions sitting on some other kernel
thread or interrupt stack (we should also take into account hypervisor
interaction here).

Mathieu

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--

To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Andi Kleen <andi@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, <pageexec@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 12:42 pm

And how, pray tell, do you deal with the fact that:

a) the EFLAGS may be live on exit;
b) there might be a jump into the middle of this instruction sequence?

-hpa
--

To: H. Peter Anvin <hpa@...>
Cc: Andi Kleen <andi@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, <pageexec@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 1:09 pm

Actually, not only EFLAGS can be live on exit, but also the immediate
value itself.

If we take the mov, test, jne short case into account, I force the mov
to populate the %al register with some immediate value. Then, this value
is extracted from the inline assembly and feeded to an if() c statement
under the form of a variable. So, I check precisely for a mov %al,0,
followed by test and bne. If I don't find it (due to gcc optimizations),
then I leave the original immediate value there. I start the pattern
matching from the address of the movb instruction, which I extract from
the inline assembly. So, about the EFLAGS : given that I first change
the jne for an unconditional jump, I just don't care about the status of
the ZF : jump does not change the EFLAGS, and it does not depend on any.
However, it is still valid to leave the mov and test instructions there,
because ZF is considered "live" by gcc across the test+jne instructions.

Then, I patch mov and test in any order, because we just don't care
about the status of the ZF, or do we... ? The only limitation is that a
given imv_cond(var) should only be used in the following pattern :

if (imv_cond(var)) ...

Trying to save the result of imv_cond(var) and use it in multiple if()
statements would cause the compiler to duplicate tests and branches on
that variable and the pattern matching would not see that. I think it's
what you fear. Now that you speak of it, it might be better to leave the
movb and test instruction there to make sure we don't kill the ZF which

If we change that, as discussed above, so the liveliness of ZF and of
the %al register is still insured by leaving the mov and test
instructions in place, we end up only modifying a single instruction and
the problem fades away. We would end up changing a jne for a jmp.

Thanks,

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--

To: H. Peter Anvin <hpa@...>
Cc: Andi Kleen <andi@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, <pageexec@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 2:37 pm

Thinking about it, there could be a way to insure limited ZF and %al
liveliness: adding an epilogue to the expected instruction sequence
formed by an asm statement which clobbers the flags (flags are clobbered
in any asm statement on x86) and clobbers %al.

From that point, we just have to find a specific signature that gcc
could not imitate to put in this asm statement, so we can detect if
other instructions have been placed in the middle of our sequence by
gcc. Actually, I think the best thing to do with this asm statement is
to put the instruction pointer in a special section, so we know that
this code location marks the end of ZF and %al liveliness. There would
be therefore no added code, just asm constraints.

This epilogue should then be used on both branches of the condition,
like this :

if (unlikely(imv_cond(var))) {
imv_cond_end();
...
} else {
imv_cond_end();
...
}

Where imv_cond_end() would look like this :

+/*
+ * Puts a test and branch make sure the %al register and ZF are not live
+ * anymore.
+ * All asm statements clobbers the flags, but add "cc" clobber just to be sure.
+ * Clobbers %al.
+ */
+#define imv_cond_end() \
+ do { \
+ asm (".section __imv_cond_end,\"a\",@progbits\n\t" \
+ _ASM_PTR "1f\n\t" \
+ ".previous\n\t" \
+ "1:\n\t" \
+ : : : "a", "cc"); \
+ } while (0)
+

The pattern to test for will therefore become :

mov, test, branch, address following branch should be in the
__imv_cond_end table.
The address of the branch target site would also have to be in the

So, if we do is I propose here, we have to take into account this
question too. Any jump that jumps in the middle of ...

To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Andi Kleen <andi@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, <pageexec@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 4:18 pm

As far as this is concerned, all you accomplish here is that gcc, if it
wants to re-use the %al value, will copy it into another register before
doing your imv_conv_end().

-hpa
--

To: H. Peter Anvin <hpa@...>
Cc: Andi Kleen <andi@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, <pageexec@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 4:37 pm

Exactly, and by doing so, it will have to add instructions (mov, push..)
in the instruction pattern I am looking for and therefore I will detect
this and fall back on standard immediate values.

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--

To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Andi Kleen <andi@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, <pageexec@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 4:41 pm

So what you're saying is you'll follow all the branches of code until
you detect an immediate value (and eflags) kill.

Yes, that should work. It's still ugly, and I have to say I find the
complexity rather distasteful. I am willing to be convinced it's worth
it, but I would really like to see hard numbers.

Personally, I wouldn't be all that surprised if you lost more in
constraining gcc scheduling than you gain.

-hpa
--

To: <hpa@...>
Cc: <mathieu.desnoyers@...>, <andi@...>, <torvalds@...>, <mingo@...>, <jirislaby@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, <pageexec@...>, <jeremy@...>
Date: Friday, April 25, 2008 - 5:02 pm

From: "H. Peter Anvin" <hpa@zytor.com>

This stuff would have been a lot easier if it just worked with
normal relocations generated by the assembler, and that would
work in such a straightforward way on EVERY architecture.

The immediate instance generators could just use macros that
architectures define, which are given a range of legal values for the
immediate, and the macro emits the inline asm sequence that can
support an immediate value of that range.

Then we do a half-link of the kernel, collect the unresolved
relocations from generated by the immediate macros into a table which
gets linked into the kernel, then resolve them in the final link all
to zero or some defined initial value.

Then it's just a matter of running through the relocation handling
we already have for module loading when changing an immediate
value.

None of this crazy instruction parsing and branch following crap.
I can't believe we're seriously considering this crud. :-/
--

To: David Miller <davem@...>
Cc: <mathieu.desnoyers@...>, <andi@...>, <torvalds@...>, <mingo@...>, <jirislaby@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, <pageexec@...>, <jeremy@...>
Date: Friday, April 25, 2008 - 5:11 pm

That's already there, for all practical purposes. The point of
contention here is trying to go from immediate value rewriting to branch
rewriting, which is probably the vast majority of all desired uses.
However, branch rewriting affects the flow of control, and flow of
control is inherently vital for gcc to understand.

I'm not inherently opposed to branch target rewriting, but I believe gcc
really needs to be involved in the process. On systems compiled with
older compilers, we just won't use that feature -- similar to most other
features introduced in a new compiler.

-hpa

--

To: H. Peter Anvin <hpa@...>
Cc: Mathieu Desnoyers <mathieu.desnoyers@...>, Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, <pageexec@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 4:51 pm

I really cannot imagine that this kind of pain is *ever* worth it.

Please give an example of something so important that we'd want to do
complex code rewriting on the fly. What _is_ the point of imv_cond()?

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: H. Peter Anvin <hpa@...>, Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, <pageexec@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 5:12 pm

The point is to provide a way to dynamically enable code at runtime
without noticeable performance impact on the system. It's principally
useful to control the markers in the kernel, which can be placed in very
frequently executed code paths. The original markers add a memory read,
test and conditional branch at each marker site. By using the immediate
values patchset, it goes down to a load immediate value, test and branch.

However, Ingo was still unhappy with the conditional branch, so I cooked
this jump patching optimization on top of the immediate values. It
looks for an expected pattern which limits the liveliness of the %al and
ZF registers to the 3 instructions and, if it finds it, patches a jump
located just before the mov instruction to skip the whole pattern and
behave exactly like the conditional branch.

So basically we get code dynamically actvated by patching a single jump.

Mathieu

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--

To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Linus Torvalds <torvalds@...>, H. Peter Anvin <hpa@...>, Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, <pageexec@...>
Date: Saturday, April 26, 2008 - 2:50 am

I think all this demonstrates that the conditional branch is a bearable
cost compared to the alternative. A conditional branch which almost
always branches the same way is very predictable, and really shouldn't
cost very much.

J
--

To: Jeremy Fitzhardinge <jeremy@...>
Cc: Mathieu Desnoyers <mathieu.desnoyers@...>, Linus Torvalds <torvalds@...>, H. Peter Anvin <hpa@...>, Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, <pageexec@...>
Date: Sunday, April 27, 2008 - 8:49 pm

I agree with you.

When I measured the performance of a tracer (LKST) which used conditional
branches, the overhead of the conditional branch itself was very small
(less than 1%).
Moreover, some benchmarks showed the performance of the patched kernel
became ~1% faster than before :-) (I guessed that came from changing of
memory access pattern and timing.)

I think, if someone is considering about the actual performance impacts,
we'd better discuss the effects of the individual trace points, based
on the actual results of some benchmarks.
Thus, we can improve it step by step.

Thank you,

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

--

To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: H. Peter Anvin <hpa@...>, Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, <pageexec@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 6:04 pm

Quite frankly, maybe I'm a bit dense, but why don't you just recompile the
whole original function (at run-time), load that new version of a function
as a mini-module, and then insert a marker at the top of the old function
that just does a "jmp replacementfunction".

That has _zero_ cost for the non-marker case, and allows you to do pretty
much any arbitrary code changes for the marker case.

It's also a much simpler replacement.

Yeah, that "jmp replacementfunction" is five or more bytes, but you can
trivially do the actual _replacement_ write by writing it first as a
single-byte debug trap, and after that has been written, write the target
address after it, and then write the first byte of the "jmp" instruction
last. In the (very unlikely) case that another CPU hits that debug trap,
you just fix it up in the debug handler - you only need a single datum of
"this is where that debug trap should relocate", because you simply create
a triial spinlock around the code-sequence that does the instruction
rewrite.

When undoing it, just do the same thing in reverse.

Yeah, this requires you to basically recompile some function snippet when
you insert a probe, but if that scares people, you could basically do it
using the old code and inserting the markers and "relinking" it - avoiding
the C compiler, and just basically have an "assembly recompiler".

And yeah, maybe you want to do without the use of modules, and you'd just
have a memory area that is kept free for these kinds of code replacement
issues. And you can optimize it to not recompile the whole function, but
do it on a finer granularity if you want.

And sure, you want to really make sure that there is security in place so
that this isn't used for rootkits, but isn't that true of pretty much
*any* trace facility?

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: Mathieu Desnoyers <mathieu.desnoyers@...>, H. Peter Anvin <hpa@...>, Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, <peterz@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, <pageexec@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Thursday, June 5, 2008 - 1:44 pm

Linus, was it your intention to signal that you would veto any uses of
the current trace_mark mechanism, and wait for this hypothetical
function-recompilation-splicing widget as a replacement? This is how
some people are interpreting this old thread.

A number of problems with the new idea were brought up, and no one
appears to have taken interest in trying to build it to see if they
can be overcome or if there are more.

On the other hand, a number of concerns with the markers have been
dealt with since, such as performance numbers showing near-zero
impact, and a variety of experience with the few dozen lttng markers
and the tools that consume the data. The current debate appears to be
stuck on fuzzier aesthetic issues.

How are we to move forward? Do you see any *harm* in letting in the
lttng markers soon?

Could it be that once this "recompile function with instrumentation on
the fly" machinery comes into existence eventually, then these exact
same marker points could be reinterpreted as one particular potential
instrumentation spot? (This could be something as simple as building
the kernel with CONFIG_MARKERS=n by default so the markers are
compiled out, then having selected alternative functions built with
CONFIG_MARKERS=y.)

- FChE
--

To: Linus Torvalds <torvalds@...>
Cc: Mathieu Desnoyers <mathieu.desnoyers@...>, H. Peter Anvin <hpa@...>, Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, <pageexec@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 10:12 pm

You mentioned possible solutions to some of the problems this
ambitious an approach would cause. Here are a few more complications:

- instrumenting inlined functions

- proper sharing of static function data amongst multiple live
copies of same function

- unknown implications of violating long-standing assumptions about
functions not changing addresses

- interaction with other code modification machinery (kprobes, ...)

- necessity to carry kernel sources & compilers on machines; slow
marker activation

- FChE
--

To: Linus Torvalds <torvalds@...>
Cc: H. Peter Anvin <hpa@...>, Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, <pageexec@...>, Jeremy Fitzhardinge <jeremy@...>, Frank Ch. Eigler <fche@...>, <systemtap@...>
Date: Friday, April 25, 2008 - 7:00 pm

This idea has been considered a few years ago at OLS in the tracing BOF
if I remember well. The results were this : First, there is no way to
guarantee that no code path, nor any return address from any function,
interrupt, sleeping thread, will return to the "old" version of the
function. Nor is it possible to determine when a quiescent state is
reached. Therefore, we couldn't see how we can do the teardown.

The second point is dependency between execution flow and variables. If
we don't do a complete copy of the variables (which I don't see how we
can do atomically), we will have to share the variables between the old
and the new copies of the functions. However, some variables might
encode information about the execution flow of the program and depend on
the actual address at which the code is linked (function pointers for

Then dealing with multiple code patching infrastructures (kprobes,
alternatives, paravirt) would become hellish. If a kprobe is planted in
the original version of the function, we have to insert it in the new

Yep.

The discussion I refer to took place at OLS a few years ago. Other
participants might remember some other details I forgot.

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--

To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Linus Torvalds <torvalds@...>, H. Peter Anvin <hpa@...>, Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, <pageexec@...>, Frank Ch. Eigler <fche@...>, <systemtap@...>
Date: Friday, April 25, 2008 - 7:13 pm

Does that matter? The new function is semantically identical to the old
one, and the old code will remain in place. If there's still users in
the old function it may take a while for them to get flushed out (and
won't be traced in the meantime), but you have to expect some missed
events if you're shoving any kind of dynamic marker into the code. The
main problem is if there's something still depending on the first 5
bytes of the function (most likely if there's a loop head somewhere near
the top of the function).

Updating the markers would mean you'd leave a trail of old versions

Obviously you'd only pick up new callers of the function, which would
mean that they'd pick up the new versions of those function-local
things. Though you'd need to make sure that the new versions of the

The module machinery already deals with patching paravirt and
alternatives into loaded modules. Your bespoke module would get dealt
with like any other module.

J
--

To: Jeremy Fitzhardinge <jeremy@...>
Cc: Mathieu Desnoyers <mathieu.desnoyers@...>, Linus Torvalds <torvalds@...>, H. Peter Anvin <hpa@...>, Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, <pageexec@...>, Frank Ch. Eigler <fche@...>, <systemtap@...>
Date: Friday, April 25, 2008 - 7:34 pm

I think we have to ensure no threads sleeping or being interrupted on
the function when removing new function. How would you check it?

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

--

To: Masami Hiramatsu <mhiramat@...>
Cc: Mathieu Desnoyers <mathieu.desnoyers@...>, Linus Torvalds <torvalds@...>, H. Peter Anvin <hpa@...>, Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, <pageexec@...>, Frank Ch. Eigler <fche@...>, <systemtap@...>
Date: Saturday, April 26, 2008 - 2:21 am

Not sure I follow you. You'd never remove any code. But you'd only
start tracing new callers of the function. If the function loops
indefinitely, you could potentially have some users which never end up
getting traced. Also, if those users depend on instructions in the
first 5 bytes of the function, they would crash because of the jump to
the new function patched on top of them.

Overall, it doesn't seem like a very satisfactory mechanism...

J
--

To: Jeremy Fitzhardinge <jeremy@...>
Cc: Masami Hiramatsu <mhiramat@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, Linus Torvalds <torvalds@...>, H. Peter Anvin <hpa@...>, Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, <pageexec@...>, Frank Ch. Eigler <fche@...>, <systemtap@...>
Date: Saturday, April 26, 2008 - 7:56 am

You do, when you decide to stop tracing. He is not talking about the old
function, that one, indeed will always be there, but what about the new
one? When tracing stops we want to remove it and revert to using the old
one...

But perhaps you are suggesting that the new one, once loaded, stays
there forever, that would work, but after several tracing sessions one
would have to eventually reboot the machine due to many modules left
loaded.

- Arnaldo
--

To: Arnaldo Carvalho de Melo <acme@...>, Jeremy Fitzhardinge <jeremy@...>, Masami Hiramatsu <mhiramat@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, Linus Torvalds <torvalds@...>, H. Peter Anvin <hpa@...>, Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, <pageexec@...>, Frank Ch. Eigler <fche@...>, <systemtap@...>
Date: Saturday, April 26, 2008 - 7:38 pm

As I said, it doesn't seem like a very satisfactory way to solve the
problem.

J

--

To: Jeremy Fitzhardinge <jeremy@...>
Cc: Arnaldo Carvalho de Melo <acme@...>, Masami Hiramatsu <mhiramat@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, Linus Torvalds <torvalds@...>, H. Peter Anvin <hpa@...>, Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, <pageexec@...>, Frank Ch. Eigler <fche@...>, <systemtap@...>
Date: Saturday, April 26, 2008 - 9:00 pm

Indeed :-)

- Arnaldo
--

To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, <pageexec@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 5:15 pm

Note that all these optimizations only make sense if the case where we
*take* the "marker" is frequent, *and* the marker itself is not too
expensive.

If that is not the case, just put in a noop that is dynamically patched
to an INT3 or ICEBP instruction (one byte) or an INT instruction (two
bytes), take the exception, look up the source address and revector to
the marker code.

-hpa
--

To: H. Peter Anvin <hpa@...>
Cc: Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, <pageexec@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 5:47 pm

Yes, this is the case. Using breakpoints for markers quickly becomes
noticeable for thing such as scheduler instrumentation, page fault
handler instrumentation, etc. And yes, I have developed kernel tracer,
LTTng, which takes care of writing the data to trace buffers
efficiently. The last time I took performance measurements, it was
performing locking and writing to the memory buffer in about 270ns on a
3GHz Pentium 4. It might be a tiny bit slower now that it parses the
markers format strings dynamically, but nothing very significant.

But there is another point that markers do which the breakpoint won't
give you : they extract local variables from functions and they identify
them with field names which separates the instrumentation from the
actual kernel implementation details. In order to do that, I rely on gcc
building a stack frame for a function call, which I don't want to build
unnecessarity when the marker is disabled. This is why I use a jump to
skip passing the arguments on the stack and the function call.

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--

To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, <pageexec@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 6:07 pm

Well, debuggers do it, and that's ultimately what why we have debugging
annotation formats like DWARF2 - to be able to take an arbitrary state
and decode local variables from the combined register-memory state.
This is often done by an interpreter, but that's not necessary; a
compiler can use the debugging information and build appropriate capture
code, which would be able to execute very quickly. Not only is this
capable of extracting arbitrary information, but it also guarantees that
the extraction code is out of line.

The act of building a stack frame not only preturbs the generated code
(gcc has to guarantee liveness, which you can see as a pro or a con),
but it also puts a fair amount of code in the icache path of the function.

Now, if a breakpoint is too expensive, one can do exactly the same trick
with a naked call instruction, with a higher icache impact in the unused
case (five bytes instead of one or two). However, the key to low impact
is to use the debugging information to recover state.

(Liveness at the probe point is still possible to enforce with this
technique: give gcc a "g" read constraint as part of the probe
instruction. That makes gcc ensure the information is *somewhere*. The
debugging information will tell you where to pick it up from.
Obviously, any time liveness is enforce you suffer a potential cost.)

-hpa
--

To: H. Peter Anvin <hpa@...>
Cc: Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, <pageexec@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 6:30 pm

DWARF2 is capable of extracting information only when not optimized away
by the compiler. That's the whole point of markers : liveness is good in
this case because we make sure the variable is there, not that it
*might* be there. The latter case might be good enough for a debugger,

if (unlikely(condition))
function_call(params);

The builtin expect will take care to put the instructions out of the
hot paths and therefore leave them out of the icache with gcc
-freorder-blocks (in -O2). The only addition to the frequently used
icache is, in this case, the 5 bytes jump, 2 bytes mov, 2 bytes test and
2 (or 6) bytes conditional branch, for a total of 11 bytes for small

The runtime cost of function call is bigger than the jump. I don't see

It could be possible to do so. However, passing a variable argument list
to a marker is rather more flexible than those inline assembly
constraints. And you are still tied to the variable names and offer no
abstraction between the kernel implementation and the conceptual name
associated to a traced variable.

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--

To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: H. Peter Anvin <hpa@...>, Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, <pageexec@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 6:36 pm

This is why you really do want to recompile the function entirely if
you're debugging it. Because it might simply not be debuggable in its
normal state.

I'd much rather see something truly generic that doesn't need any
pre-inserted "markers" at all that disable optimizations, and that allows
just about anything. Including live system bug-fixes etc (imagine finding
a bug - and not at somethign that was previously already "marked" - and
just replacing the buggy function with a non-buggy one).

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: H. Peter Anvin <hpa@...>, Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, <pageexec@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Monday, April 28, 2008 - 4:43 pm

Markers, with immediate values, only clobbers the eax register and the
ZF. It does not restrain inlining nor loop unrolling. It also requires
gcc to leave the variables in which the marker is interested "live". Are

kprobes is already doing a good job at probing a live system without
rebooting. Markers are best suited to export information about kernel
events which stays stable between releases so the information is readily
available in the kernel, with low overhead even when tracing is enabled
(which kprobes doesn't provide) which allows a user to flip the "on"
switch and get a trace of all system calls, scheduling, traps,
interrupts that happen on the system.

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--

To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Linus Torvalds <torvalds@...>, H. Peter Anvin <hpa@...>, Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, <pageexec@...>
Date: Monday, April 28, 2008 - 5:02 pm

That in itself is pretty significant. If that value would otherwise be
constant folded or strength-reduced away, you're putting a big
limitation on what the compiler can do. The mere fact that its
necessary to do something to preserve many values shows how much the
compiler transforms the code away from what's in the source, and
specifically referencing otherwise unused intermediates inhibits that.

In other words, if you weren't preventing optimisations, you wouldn't
need to preserve values as much, because the optimiser wouldn't be
getting rid of them. If you need to preserve lots of values, you're
necessarily preventing the optimiser from doing its job.

J
--

To: Jeremy Fitzhardinge <jeremy@...>
Cc: Linus Torvalds <torvalds@...>, H. Peter Anvin <hpa@...>, Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, <pageexec@...>
Date: Sunday, May 4, 2008 - 11:03 am

I am not saying that the standard marker will have to inhibit
optimizations. Actually, it's the contrary : a well-thought marker
should _not_ modify that kind of optimization, and we should put markers
in code locations less likely to inhibit gcc optimizations. However, in
the case where we happen to be interested in information otherwise
optimized away by GCC, it makes sense to inhibit this optimization in
order to have the information available for tracing.

I expect this to happen rarely, but I think we must deal with
optimizations to make sure we never trace garbage due to some unexpected
gcc optimization. I think it's a small (e.g. undetectable at the
macrobenchmark level) price to pay to get correct tracing information.

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--

To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Jeremy Fitzhardinge <jeremy@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, <pageexec@...>
Date: Sunday, May 4, 2008 - 12:18 pm

That's a pretty flippant reply... liveness causes register pressure
which can cause rapid degradation in code quality on a register-starved
architecture like x86.

-hpa
--

To: Linus Torvalds <torvalds@...>
Cc: Mathieu Desnoyers <mathieu.desnoyers@...>, H. Peter Anvin <hpa@...>, Andi Kleen <andi@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, <pageexec@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Monday, April 28, 2008 - 4:21 pm

Ob'plug: with the pending dyn-ftrace function tracer feature we do
something rather close to that already: we have a 5 byte NOP in the
prologue of every function that can be used as a non-destructive 'branch
away' place.

Right now we use that to trace a (regex-ish pattern identified) set of
functions. The regex pattern can be configured runtime via
/debug/tracing/function_filter is not parsed runtime in any fastpath -
it is used to activate/deactivate the tracepoints and patches them from
NOPs into CALLs.

_But_ the same mechanism could perhaps be used to patch the function as
well.

The cost is +5 bytes of NOP for every function in the system, but in
practice we've not been able to measure any actual runtime costs of
these NOPs - neither in micro-benchmarks nor in macro-benchmarks. (the
only real cost here is the +5 bytes of I$ cost - otherwise the NOP will
just be skipped by the decoder.)

the patching of these NOPs is inherently safe because they are inserted
at build time. There's no negative impact to gcc optimizations at all.
We get a nice selection of 75,000 tracepoints in an allmodconfig kernel
- without _any_ source code level impact in the functions.

On the other hand, i'm not opposed to a handful of static markers either
- i think the best model is to have both of these facilities. There are
a couple of 'core events' that are not expressed via function calls, and
even where they are expressed via function calls the function call
layout is not stable while markers are stable across kernel versions.
The notion of "a context-switch happened from task X to task Y" or "task
X woke up task Y" is not going to change anytime soon so i'm not opposed
to exposing that kind of information. And once we accept the static
markers, we might as well make them as cheap as possible.

Ingo
--

To: Ingo Molnar <mingo@...>
Cc: Linus Torvalds <torvalds@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, H. Peter Anvin <hpa@...>, Andi Kleen <andi@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, <pageexec@...>
Date: Monday, April 28, 2008 - 4:55 pm

Sure, so long as you take "as cheap as possible" to mean cheap in both
implementation complexity as well as runtime cost.

I don't have any specific objections to any of the stuff that Mathieu is
working on, but it does worry me that each time a problem is addressed
it ends up being an even more subtle piece of code. I just haven't seen
enough concrete justification to make me feel comfortable with it all.

It seems to me that a relatively simple implementation which allows the
desired tracing/marking functionality is the first step. If that proves
to cause a significant performance deficit then enabled then we can work
out how to address it in due course. But doing it all at once before
merging anything seems like overkill, particularly when we're talking
about specifics of gcc's codegen patterns, disassembling code fragments,
etc.

J
--

To: Jeremy Fitzhardinge <jeremy@...>
Cc: Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, Andi Kleen <andi@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, <pageexec@...>
Date: Monday, April 28, 2008 - 5:01 pm

I really feel that the latest information that has come up has indicated
that things are really not what they should be. They are in line, have
a substantial probe cost, and we're messing around with how to jump
around them.

That's not the problem.

I maintain what I said before: a call instruction (which defaults to a
NOP), and then extract the state based on debugging info or assembler
annotations.

As far as patchable static jumps, I can see the utility of them, but I
don't think this project is one of them. However, I believe the right
way to do them is via compiler support.

-hpa
--

To: H. Peter Anvin <hpa@...>
Cc: Jeremy Fitzhardinge <jeremy@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, <pageexec@...>
Date: Monday, April 28, 2008 - 6:42 pm

Do you consider all unlikely blocks to be in line ? If the real issue is
to make sure they don't share cache lines with the body of the function,
that could be arranged. However, I assume that using an unlikely branch
to let gcc with -freorder-blocks put the instructions at the end of the

When disabled : 0 cycles ? It additionnally clobbers eax and the EFLAGS.
For the parameters passed to the marker, I think the marker location
should be chosen carefully so most of the variables would be live anyway

I was perfectly happy with the immediate value + conditional branch, but

Let's consider this option :

First of all, I wouldn't like to require tracing users to get the
kernel debuginfos each time they want to trace. I think it should be a
the "on" switch kind of infrastructure. Getting a few hundreds MB worth
of data isn't exactly that.

If I get your idea right, you propose to use an inline assembly with "g"
constraints to make sure gcc lets them alive. I just did some testing of
your approach applied to a marker in schedule() that shows that as soon
as you need to dereference a pointer in the parameters, this adds
operations in the fast path, which is not the case for markers because,
as Ingo explained, this is done in a block outside the fast path.

So your assembly constraint solution works fine only if the information
happens to be there, in a register, at the inline assembly site. Then
there is no added cost for register preparation. However, given it won't
always be true, you have to bear the cost of setting up the registers
from the stack or, worse, from a pointer read in the function fast path.

The markers offloads this to the jump target located outside of the fast
path. Therefore, in the general case which includes parameters not

If you suppose the information is always live in registers at the
instrumented site, then yes, I guess your constraint+call approach is
good, modulo the fact that users will depend on hundreds of megabytes of
debuginfo.

However, in ...

To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, <pageexec@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 6:38 pm

You get zero instructions and five bytes of NOP in the non-taken case.

"Rather more flexible?" Surely you're joking, Mr. Feynman? There is no
difference, none, nada.

Furthermore, your capture stub compiler, or trace data extractor, can do
any kind of mapping it pleases; so I'm utterly confused what you're
talking about "still tied to variable names."

-hpa
--

To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Andi Kleen <andi@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, <pageexec@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 3:19 pm

I wanted to point out that this, in particular, is utter nonsense.
Consider a sequence that looks something like this:

if (foo ? bar : imv_cond(var)) {
blah();
}

An entirely sane transformation of this (as far as gcc is concerned), is
something like:

cmpl $0,foo
je 1f
cmpl $0,bar
jmp 2f
1:
#APP
movb var,%al /* This is your imv */
#NO_APP
testb %al,%al
2:
je 3f
call blah
3:

Your code would take the movb-testb-je sequence and combine them, then
we jump into the middle of the new instruction when jumping at 2!

There are only two ways to deal with this - extensive analysis of the
entire flow of control, or telling the compiler exactly what is
*actually* going on. The latter is the preferred way, obviously.

-hpa

--

To: H. Peter Anvin <hpa@...>
Cc: Andi Kleen <andi@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, <pageexec@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 4:04 pm

I am glad you come up with a counter argument. Let's look at what would
happen here with my modified code :

cmpl $0,foo
je 1f
cmpl $0,bar
jmp 2f
1:
#APP
mov %esi, %esi /* nop 2 bytes */
#NO_APP
mov %esi, %esi /* nop 2 bytes */
2:
jmp 3f /* 2 bytes short jump */
call blah
3:

First of all, I do not "combine" the instructions.. that would be really
dangerous (and bug-prone, since any interrupt could iret to an invalid
instruction). No, all I do is to swap instructions for other
instructions of the same size (or smaller in the case of jne 6 bytes ->
nop1 + jmp 5 bytes).

I see the problem you show here : it's dangerous to change an
instruction generated by gcc because it can be re-used for other
purposes, as in your example.

Then, what I propose is the following : instead of modifying the
conditional branch instruction, I prefix my inline assembly with a 5
bytes jump. I can then have the exact same behavior as the original
conditional branch; I either jump at the address following the
conditional branch or at the target address. I would still have to check
for ZF and %al liveliness as I proposed earlier, because I would skip

Yes, in an ideal world, gcc would help here.

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--

To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Andi Kleen <andi@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, <pageexec@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 4:09 pm

gcc is a free software project, and the gcc maintainers are around and
can be approached. A good proposal will go a long way, and patches will
go even longer.

-hpa
--

To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Andi Kleen <andi@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, <pageexec@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 2:47 pm

I don't think so. You're making way too many assumptions about the code
generated by gcc.

This kind of stuff absolutely can be done, *BUT* it requires the
cooperation of the compiler. The right way to do this is to negotiate a
set of appropriate builtins with the gcc people, and use them. This
means this optimization will only work when compiled with the new gcc,
so there is a substantial lag, but it's the only sane way to do this
kind of stuff.

-hpa
--

To: Linus Torvalds <torvalds@...>
Cc: Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, Andi Kleen <andi@...>, <pageexec@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Friday, April 25, 2008 - 11:32 am

the one below even builds and boots.

this assumes that all modules areas are allocated via PAGE_KERNEL_EXEC -
but that is generally true on x86 due to NX. 32-bit uses vmalloc_exec(),
64-bit uses __vmalloc_area(..., PAGE_KERNEL_EXEC).

Jiri ... if you have any desire/stamina to still test this code - does
the patch below produce any warnings if you unapply your fix as well,
during suspend/resume?

Ingo

--------------->
Subject: x86: harden kernel code patching
From: Ingo Molnar <mingo@elte.hu>
Date: Fri Apr 25 17:07:03 CEST 2008

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
arch/x86/kernel/alternative.c | 5 +++++
mm/vmalloc.c | 3 +++
2 files changed, 8 insertions(+)

Index: linux/arch/x86/kernel/alternative.c
===================================================================
--- linux.orig/arch/x86/kernel/alternative.c
+++ linux/arch/x86/kernel/alternative.c
@@ -518,6 +518,11 @@ void *__kprobes text_poke(void *addr, co
if (core_kernel_text((unsigned long)addr)) {
struct page *pages[2] = { virt_to_page(addr),
virt_to_page(addr + PAGE_SIZE) };
+ /*
+ * Module text pages are PageReserved:
+ */
+ WARN_ON(pages[0] && !PageReserved(pages[0]));
+ WARN_ON(pages[1] && !PageReserved(pages[1]));
if (!pages[1])
nr_pages = 1;
vaddr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
Index: linux/mm/vmalloc.c
===================================================================
--- linux.orig/mm/vmalloc.c
+++ linux/mm/vmalloc.c
@@ -391,6 +391,7 @@ static void __vunmap(const void *addr, i
struct page *page = area->pages[i];

BUG_ON(!page);
+ ClearPageReserved(page);
__free_page(page);
}

@@ -507,6 +508,8 @@ static void *__vmalloc_area_node(struct
area->nr_pages = i;
goto fail;
}
+ if (pgprot_val(prot) == pgprot_val(PAGE_KERNEL_EXEC))
+ SetPageReserved(page);
area->pages[i] = page;
}

--

To: Linus Torvalds <torvalds@...>
Cc: Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, Andi Kleen <andi@...>, <pageexec@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, Ingo Molnar <mingo@...>
Date: Friday, April 25, 2008 - 11:17 am

> And we should really add a lot more sanity checking there.

A debug mode for virt_to_page(),__pa,__va et.al. would probably make sense
and would have caught it.

I used to have that partly in the x86-64 port with VIRTUAL_BUG_ON.

-Andi
--

To: Andi Kleen <andi@...>
Cc: Linus Torvalds <torvalds@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, <pageexec@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, Ingo Molnar <mingo@...>
Date: Friday, April 25, 2008 - 3:36 pm

Good idea! Do you have a patch?

--

To: Christoph Lameter <clameter@...>
Cc: Andi Kleen <andi@...>, Linus Torvalds <torvalds@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, <pageexec@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, Ingo Molnar <mingo@...>
Date: Saturday, April 26, 2008 - 5:59 am

Yes. Appended. But it just enables the old NUMA VIRTUAL_BUG_ON()s, more
work could be done e.g. by instrumenting pa/va and the non NUMA and i386
case too.

-Andi

---

Add CONFIG option to enable VIRTUAL_BUG_ON()

VIRTUAL_BUG_ON was used in the early days of x86-64 NUMA to debug the
virtual address to struct page code.

Later it was noped, but the call kept intact.

Add a CONFIG option to enable it as a BUG_ON again. This would have
likely caught the recent text_poke bug.

Signed-off-by: Andi Kleen <andi@firstfloor.org>

Index: linux/arch/x86/Kconfig.debug
===================================================================
--- linux.orig/arch/x86/Kconfig.debug
+++ linux/arch/x86/Kconfig.debug
@@ -245,4 +245,11 @@ config CPA_DEBUG
help
Do change_page_attr() self-tests every 30 seconds.

+config DEBUG_VIRTUAL
+ bool "Virtual memory translation debugging"
+ depends on DEBUG_KERNEL && NUMA && X86_64
+ help
+ Enable some costly sanity checks in the NUMA virtual to page
+ code. This can catch mistakes with virt_to_page() and friends.
+
endmenu
Index: linux/include/asm-x86/mmzone_64.h
===================================================================
--- linux.orig/include/asm-x86/mmzone_64.h
+++ linux/include/asm-x86/mmzone_64.h
@@ -7,7 +7,11 @@

#ifdef CONFIG_NUMA

+#ifdef CONFIG_DEBUG_VIRTUAL
+#define VIRTUAL_BUG_ON(x) BUG_ON(x)
+#else
#define VIRTUAL_BUG_ON(x)
+#endif

#include <asm/smp.h>

--

To: Andi Kleen <andi@...>
Cc: Linus Torvalds <torvalds@...>, Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, <pageexec@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, Ingo Molnar <mingo@...>
Date: Monday, April 28, 2008 - 4:24 pm

Hmmmm.. No hooks yet? I have some pieces here that do something similar:

Subject: Add checks for virtual addresses

Add checks to insure that virtual addresses are not used in invalid contexts.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
arch/x86/mm/ioremap.c | 1 +
include/asm-x86/page_32.h | 7 ++++++-
2 files changed, 7 insertions(+), 1 deletion(-)

Index: linux-2.6.25-mm1/arch/x86/mm/ioremap.c
===================================================================
--- linux-2.6.25-mm1.orig/arch/x86/mm/ioremap.c 2008-04-25 23:17:31.872390404 -0700
+++ linux-2.6.25-mm1/arch/x86/mm/ioremap.c 2008-04-25 23:37:43.202391820 -0700
@@ -25,6 +25,7 @@

unsigned long __phys_addr(unsigned long x)
{
+ VM_BUG_ON(is_vmalloc_addr((void *)x));
if (x >= __START_KERNEL_map)
return x - __START_KERNEL_map + phys_base;
return x - PAGE_OFFSET;
Index: linux-2.6.25-mm1/include/asm-x86/page_32.h
===================================================================
--- linux-2.6.25-mm1.orig/include/asm-x86/page_32.h 2008-04-25 23:17:31.882389317 -0700
+++ linux-2.6.25-mm1/include/asm-x86/page_32.h 2008-04-25 23:37:43.202391820 -0700
@@ -64,7 +64,12 @@ typedef struct page *pgtable_t;
#endif

#ifndef __ASSEMBLY__
-#define __phys_addr(x) ((x) - PAGE_OFFSET)
+static inline unsigned long __phys_addr(unsigned long x)
+{
+ VM_BUG_ON(is_vmalloc_addr((void *)x));
+ return x - PAGE_OFFSET;
+}
+
#define __phys_reloc_hide(x) RELOC_HIDE((x), 0)

#ifdef CONFIG_FLATMEM
--

To: Christoph Lameter <clameter@...>
Cc: <linux-mm@...>, Ingo Molnar <mingo@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, <pageexec@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, <herbert@...>, <penberg@...>, <akpm@...>, <linux-ext4@...>, <paulmck@...>, <rjw@...>, <zdenek.kabelac@...>, David Miller <davem@...>, Linus Torvalds <torvalds@...>, <linux-kernel@...>, Jiri Slaby <jirislaby@...>, Andi Kleen <andi@...>, Christoph Lameter <clameter@...>
Date: Thursday, May 1, 2008 - 3:22 pm

> typedef

To: Jiri Slaby <jirislaby@...>
Cc: <linux-mm@...>, Ingo Molnar <mingo@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, <pageexec@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, <herbert@...>, <penberg@...>, <akpm@...>, <linux-ext4@...>, <paulmck@...>, <rjw@...>, <zdenek.kabelac@...>, David Miller <davem@...>, Linus Torvalds <torvalds@...>, <linux-kernel@...>, Andi Kleen <andi@...>
Date: Thursday, May 1, 2008 - 4:18 pm

The 64 bit piece works fine here and I used it for debugging the vmalloc

Great! Someone else picks this up. You can probably do a more thorough

Hmmm.. We could use include/linux/bounds.h to make
VMALLOC_START/VMALLOC_END (or whatever you need for checking the memory
boundaries) a cpp constant which may allow the use in page_32.h without
circular dependencies.

--

To: Christoph Lameter <clameter@...>
Cc: <linux-mm@...>, Ingo Molnar <mingo@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, <pageexec@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, <herbert@...>, <penberg@...>, <akpm@...>, <linux-ext4@...>, <paulmck@...>, <linux-kernel@...>, Andi Kleen <andi@...>
Date: Tuesday, May 13, 2008 - 10:38 am

Hrm, not that easy. I ended up in splitting fixmap_32.h (VMALLOC constants
depends on it on 32-bit), moving around constants from over all the tree
(NR_CPUS, FIX_ACPI_PAGES...) to not include files which would create loops,
but still not having e.g. PMD_MASK available on all configurations. I think
it's not worth it. Objections to merging the patch as was
(http://lkml.org/lkml/2008/5/1/300)?

Thanks.
--

To: Christoph Lameter <clameter@...>
Cc: <linux-mm@...>, Ingo Molnar <mingo@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, <pageexec@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, <herbert@...>, <penberg@...>, <akpm@...>, <linux-ext4@...>, <paulmck@...>, <rjw@...>, <zdenek.kabelac@...>, David Miller <davem@...>, Linus Torvalds <torvalds@...>, <linux-kernel@...>, Andi Kleen <andi@...>
Date: Tuesday, May 6, 2008 - 5:54 pm

I like the idea, I'll get back with a patch in few days (sorry, too busy).
Anyway bounds.h should be include/asm/ thing though.
--

To: Jiri Slaby <jirislaby@...>
Cc: <linux-mm@...>, Ingo Molnar <mingo@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, <pageexec@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, <herbert@...>, <penberg@...>, <akpm@...>, <linux-ext4@...>, <paulmck@...>, <rjw@...>, <zdenek.kabelac@...>, David Miller <davem@...>, Linus Torvalds <torvalds@...>, <linux-kernel@...>, Andi Kleen <andi@...>
Date: Wednesday, May 7, 2008 - 1:30 pm

For arch specific stuff use asm-offsets.h. It would have to be included in
page_xx.h.

--

To: Andi Kleen <andi@...>
Cc: Christoph Lameter <clameter@...>, Linus Torvalds <torvalds@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, <pageexec@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, Ingo Molnar <mingo@...>
Date: Saturday, April 26, 2008 - 7:16 am

--

To: Jiri Slaby <jirislaby@...>
Cc: Andi Kleen <andi@...>, Christoph Lameter <clameter@...>, Linus Torvalds <torvalds@...>, David Miller <davem@...>, <zdenek.kabelac@...>, <rjw@...>, <paulmck@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, <pageexec@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, Ingo Molnar <mingo@...>
Date: Saturday, April 26, 2008 - 7:34 am

> Is anybody working on that? I would volunteer to do it.

Feel free to take it.

-Andi
--

To: Linus Torvalds <torvalds@...>
Cc: Zdenek Kabelac <zdenek.kabelac@...>, Ingo Molnar <mingo@...>, Rafael J. Wysocki <rjw@...>, <paulmck@...>, David Miller <davem@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>, <linux-ext4@...>, <herbert@...>, Pekka Enberg <penberg@...>, Christoph Lameter <clameter@...>
Date: Thursday, April 24, 2008 - 8:57 pm

Blah, sorry, I lived in a theory, that Rafael got one from 2.6.25 and no,

Not really. I have no idea what triggers it. Seems like suspend is some kind of
catalyzer not working every time. I can't get it to crash for 1 whole day.
Probably with the program (I hacked it a hour ago or so) it would be easier to
reveal the error far before the crash itself. Will keep you informed. Now going
to catch some sleep.
--

To: Jiri Slaby <jirislaby@...>
Cc: Zdenek Kabelac <zdenek.kabelac@...>, Ingo Molnar <mingo@...>, Rafael J. Wysocki <rjw@...>, <paulmck@...>, David Miller <davem@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>, <linux-ext4@...>, <herbert@...>, Pekka Enberg <penberg@...>, Christoph Lameter <clameter@...>
Date: Thursday, April 24, 2008 - 7:45 pm

No problem. I do think this is a post-2.6.25 thing, because I don't think
we've ever seen it before that.

There is a 2.6.25-rc8-mm2 report, but that was a -mm tree that had a lot
of the stuff that was merged after 2.6.25, so I'm pretty sure that counts

I don't think suspend/resume is sufficient, because I've tried to
reproduce it here (and I tried your test program too) on my macmini, and
it's not happening. So there almost certainly something else too required
to trigger it.

Btw, how do you suspend/resume? That matters, because I've been testing
just the normal

echo mem > /sys/power/state

and with a kernel where everything is compiled-in. But if you use the GUI
suspend, on a common distro, I think that one ends up doing a whole lot
more, including doing things like unloading and reloading modules, and for
all we know the problem is not about suspend itself, but about the things
going on around it.

And it might be a module unload issue, rather than the suspend itself (the
same way I theorized that it might be a ifconfig down/up rather than the
suspend code itself). Who knows..

It might also very well be hardware-specific. You guys seem to have
different wireless setups (ath5k vs b43) but it migth be generic 80211
code, but it might also be something *totally* unrelated, and the only
reason wireless has shown up might be that networking is just in use when
the problem happens.

Jiri, Zdenek, Rafael, could you try to compare hardware with each other
and see if there is some pattern there?

(And btw, the program you used that allocates a hundred meg and tries to
find it - I'm assuming you're not paging or anythign like that, ie you're
not even close to out-of-memory. If that isn't correct, holler. I'm trying
to reproduce this thing).

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: Zdenek Kabelac <zdenek.kabelac@...>, Ingo Molnar <mingo@...>, Rafael J. Wysocki <rjw@...>, <paulmck@...>, David Miller <davem@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>, <linux-ext4@...>, <herbert@...>, Pekka Enberg <penberg@...>, Christoph Lameter <clameter@...>
Date: Friday, April 25, 2008 - 3:36 am

pm-suspend without suspend package -- i.e. it writes mem > state, but does some
processing before and after that. However no module loads or removes.

Particualry I have
hibernate|suspend)
service autofs stop >/dev/null
service vmware stop >/dev/null
;;
thaw|resume)
service autofs start >/dev/null
;;

While vmware is not running, autofs is.

The rest of scripts is from
http://download.opensuse.org/distribution/SL-OSS-factory/inst-source/sus...

[I see now that suse added autofs stopping to their scripts too.]
Not using networkmanager.
Nothing in any pm confs, no VIDEO s3 quirks, no unload modules.
No bluetooth, no pcmcia, no batteries, no cpufreq, no backlight. -- It's desktop.
/proc/acpi/fan/*/state doesn't exist

The probably only done handling is hwclock.
lrwxrwxrwx 1 root root 0 Apr 25 02:44 /sys/class/rtc/rtc0/device/driver ->

00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
(rev 02)
00:02.0 VGA compatible controller: Intel Corporation 82G33/G31 Express
Integrated Graphics Controller (rev 02)
00:03.0 Communication controller: Intel Corporation 82G33/G31/P35/P31 Express
MEI Controller (rev 02)
00:03.1 Communication controller: Intel Corporation 82G33/G31/P35/P31 Express
MEI Controller (rev 02)
00:03.2 IDE interface: Intel Corporation 82G33/G31/P35/P31 Express PT IDER
Controller (rev 02)
00:03.3 Serial controller: Intel Corporation 82G33/G31/P35/P31 Express Serial KT
Controller (rev 02)
00:19.0 Ethernet controller: Intel Corporation 82566DM-2 Gigabit Network
Connection (rev 02)
00:1a.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI
Controller #4 (rev 02)
00:1a.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI
Controller #5 (rev 02)
00:1a.2 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI
Controller #6 (rev 02)
00...

To: Jiri Slaby <jirislaby@...>
Cc: Linus Torvalds <torvalds@...>, Zdenek Kabelac <zdenek.kabelac@...>, Ingo Molnar <mingo@...>, <paulmck@...>, David Miller <davem@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>, <linux-ext4@...>, <herbert@...>, Pekka Enberg <penberg@...>, Christoph Lameter <clameter@...>
Date: Friday, April 25, 2008 - 11:30 am

Well, my machine is based on Athlon 64 X2 with an ATI chipset.

The only two common things it has with your machine is probably that we both use
64-bit kernels and wireless adapters (different ones, for that matter).

I do use NetworkManager, BTW.

Well, one thing that suspend does and which is not done routinely is CPU
hotplugging. Could you please check if you are able to provoke the symptoms
to appear by offlining-onlining CPU1?

Thanks,
Rafael
--

To: Jiri Slaby <jirislaby@...>
Cc: Linus Torvalds <torvalds@...>, Zdenek Kabelac <zdenek.kabelac@...>, Ingo Molnar <mingo@...>, Rafael J. Wysocki <rjw@...>, <paulmck@...>, David Miller <davem@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>, <linux-ext4@...>, <herbert@...>, Pekka Enberg <penberg@...>, Christoph Lameter <clameter@...>
Date: Friday, April 25, 2008 - 10:09 am

One useful trick is trying to boot with init=/bin/bash, and see if the
problem persists. If it is gone, it is likely one of the modules, and
those may be binary-searched -- which is often easier than bisect.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

To: <jirislaby@...>
Cc: <torvalds@...>, <zdenek.kabelac@...>, <mingo@...>, <rjw@...>, <paulmck@...>, <linux-kernel@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>
Date: Thursday, April 24, 2008 - 9:35 pm

From: Jiri Slaby <jirislaby@gmail.com>

0xf0... Is this a 4-cpu machine?

I doubt it, because this is on a laptop as far as I can tell,
but I thought I'd ask. :-)

So the clue is setting some byte at ((offset % 8) == 0) into a
structure with 0xf0...

--

To: David Miller <davem@...>
Cc: <jirislaby@...>, <zdenek.kabelac@...>, <mingo@...>, <rjw@...>, <paulmck@...>, <linux-kernel@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>
Date: Thursday, April 24, 2008 - 9:48 pm

It's not always at (offset % 8) == 0. We've seen that 0xf0 pattern in
other oopses, but it's not always 8-byte aligned. In fact, when we've seen
it in oopses, it has generally been in the higher bytes (eg offset 5
within a 64-bit word, causing an invalid pointer on x86-64).

But that 0xf0 definitely has shown up before. It's not the *only*
corruption, but it's definitely a very interesting pattern. And the other
ones that didn't show the 0xf0 pattern could obviously be due to pointers
that were corrupted by 0xf0 in low bytes, so it _may_ be the source of the
other corruptions too that didn't have an obvious 0xf0 directly in them.

Linus
--

To: <torvalds@...>
Cc: <jirislaby@...>, <zdenek.kabelac@...>, <mingo@...>, <rjw@...>, <paulmck@...>, <linux-kernel@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>
Date: Thursday, April 24, 2008 - 9:57 pm

From: Linus Torvalds <torvalds@linux-foundation.org>

Ok.

Do we know of any pattern of the wireless device type in use?
If there is a pattern to that, it would be a huge clue.

And if it is predominantly one particular wireless device type, we
should be able to come up with a patch to test.
--

To: David Miller <davem@...>
Cc: <torvalds@...>, <zdenek.kabelac@...>, <mingo@...>, <rjw@...>, <paulmck@...>, <linux-kernel@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, Johannes Berg <johannes@...>, Michael Wu <flamingice@...>, Jiri Benc <jbenc@...>
Date: Friday, April 25, 2008 - 3:41 am

Added 3 80211 experts.

Johannes, Michael, Jiri? Someone writes to freed memory patterns 0xf0 (not
aligned to anything, addressed per byte), one of suspects is mac80211, don't you
know that pattern from anywhere?

Thanks, Jiri.
--

To: <jirislaby@...>
Cc: <torvalds@...>, <zdenek.kabelac@...>, <mingo@...>, <rjw@...>, <paulmck@...>, <linux-kernel@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <johannes@...>, <flamingice@...>, <jbenc@...>
Date: Friday, April 25, 2008 - 3:45 am

From: Jiri Slaby <jirislaby@gmail.com>

I notice Jiri, in your hardware list, you have an ath5k Atheros AR5212 chip
in there.

I took a look at the resume code for ath5k but nothing really suspicious
there except:

err = pci_enable_device(pdev);
if (err)
return err;

pci_restore_state(pdev);

Shouldn't we restore state before we turn the chip back on and thus
potentially let it start DMA'ing all over the place?
--

To: David Miller <davem@...>
Cc: <linux-pci@...>, <linux-kernel@...>, Jesse Barnes <jbarnes@...>
Date: Friday, April 25, 2008 - 4:18 am

Hmm, actually every second wireless driver do this :/. I think it's wrong too.
Jesse?

BTW pci_set_power_state(pdev, PCI_D0); in resume isn't needed at all, right?
It's done in pci_enable_device, isn't it?
--

To: <linux-pci@...>
Cc: Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <linux-kernel@...>
Date: Friday, April 25, 2008 - 1:11 pm

It might be a little safer to enable the device after restoring its state, but
if your device starts DMAing randomly when going from D3->D0 it's probably

Right, pci_enable_device will put the device in D0, so setting it again is
redundant (looks like lots of drivers do this).

Jesse
--

To: David Miller <davem@...>
Cc: <torvalds@...>, <zdenek.kabelac@...>, <mingo@...>, <rjw@...>, <paulmck@...>, <linux-kernel@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <johannes@...>, <flamingice@...>, <jbenc@...>
Date: Friday, April 25, 2008 - 4:02 am

Hmm, I cut&pasted that code from somewhere, it seems to be broken. Anyway it
worked for a half a year or so. Mending locally. Thanks.
--

To: David Miller <davem@...>
Cc: <jirislaby@...>, <torvalds@...>, <zdenek.kabelac@...>, <mingo@...>, <rjw@...>, <paulmck@...>, <linux-kernel@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>, <johannes@...>, <flamingice@...>, <jbenc@...>
Date: Friday, April 25, 2008 - 6:53 am

On Fri, Apr 25, 2008 at 12:45:23AM -0700, David Miller wrote:

I'm not sure how much code is shared between AR5212 and AR2413 but I
saw kmalloc poison issues about a month back with a fedora rawhide
kernel on a machine with a AR2413 chipset ... I filed the bug here:

http://madwifi.org/ticket/1856

I don't use that machine much so I haven't tried other kernels but
perhaps the above info helps in some way.

bye,

--Craig
--

To: David Miller <davem@...>
Cc: <torvalds@...>, <zdenek.kabelac@...>, <mingo@...>, <rjw@...>, <paulmck@...>, <linux-kernel@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>
Date: Friday, April 25, 2008 - 3:42 am

Well, it's not :), it's a desktop.

Jiri
--

To: <jirislaby@...>
Cc: <torvalds@...>, <zdenek.kabelac@...>, <mingo@...>, <rjw@...>, <paulmck@...>, <linux-kernel@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>
Date: Friday, April 25, 2008 - 3:49 am

From: Jiri Slaby <jirislaby@gmail.com>

Two hyperthreads per-core?

If so that could match up to the pattern. It is just one theory
though. The wireless possibility holds just as much weight.
--

To: David Miller <davem@...>
Cc: <torvalds@...>, <zdenek.kabelac@...>, <mingo@...>, <rjw@...>, <paulmck@...>, <linux-kernel@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>
Date: Friday, April 25, 2008 - 3:56 am

Hmm, how to find out? I suppose it will show up 4 (virtual) processors in
cpuinfo, right? Although there is ht bit in cpuinfo on each core and

The mac80211 is theory too so far :).

--

To: <jirislaby@...>
Cc: <torvalds@...>, <zdenek.kabelac@...>, <mingo@...>, <rjw@...>, <paulmck@...>, <linux-kernel@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>
Date: Friday, April 25, 2008 - 3:58 am

From: Jiri Slaby <jirislaby@gmail.com>

I'll try to do some commit mining of my own.

Thanks for all of the info so far.
--

To: David Miller <davem@...>
Cc: <torvalds@...>, <zdenek.kabelac@...>, <mingo@...>, <rjw@...>, <paulmck@...>, <linux-kernel@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>
Date: Friday, April 25, 2008 - 4:00 am

BTW Doesn't exist any tool to compare diffs? Particularly
2.6.28-rc8-mm1..2.6.28-rc8-mm2 with 2.6.25..2.6.25-git2... I would just give it
a try.
--

To: Jiri Slaby <jirislaby@...>
Cc: David Miller <davem@...>, <torvalds@...>, <zdenek.kabelac@...>, <mingo@...>, <rjw@...>, <paulmck@...>, <linux-kernel@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, <penberg@...>, <clameter@...>
Date: Friday, April 25, 2008 - 11:47 am

'interdiff', part of the patchutils package:
http://cyberelk.net/tim/software/patchutils/

---
~Randy
--

To: Linus Torvalds <torvalds@...>
Cc: Zdenek Kabelac <zdenek.kabelac@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, Rafael J. Wysocki <rjw@...>, <paulmck@...>, David Miller <davem@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>, <linux-ext4@...>, <herbert@...>, Christoph Lameter <clameter@...>
Date: Wednesday, April 23, 2008 - 2:52 pm

Okay, this doesn't make sense to me. The code does:

u8 *start;
u8 *fault;

/* ... */

start = page_address(page);

/* ... */

fault = check_bytes(start + length, POISON_INUSE, remainder);
if (!fault)
return 1;
while (end > fault && end[-1] == POISON_INUSE)
end--;

slab_err(s, page, "Padding overwritten. 0x%p-0x%p", fault, end - 1);

So how come we're printing out 'fault' as zero and 'end' at 4 GB? Christoph?

Zdenek, can you please send the full dmesg?

Pekka
--

To: Pekka Enberg <penberg@...>
Cc: Linus Torvalds <torvalds@...>, Zdenek Kabelac <zdenek.kabelac@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, Rafael J. Wysocki <rjw@...>, <paulmck@...>, David Miller <davem@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>, <linux-ext4@...>, <herbert@...>
Date: Wednesday, April 23, 2008 - 3:05 pm

fault == NULL if the check was successful. Otherwise it contains the first

We should have returned from the function and not printed this message. If
we somehow skipped the test for !fault then end could have wrapped around
which gets us to 4GB.

--

To: Christoph Lameter <clameter@...>
Cc: Linus Torvalds <torvalds@...>, Zdenek Kabelac <zdenek.kabelac@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, Rafael J. Wysocki <rjw@...>, <paulmck@...>, David Miller <davem@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>, <linux-ext4@...>, <herbert@...>
Date: Wednesday, April 23, 2008 - 3:19 pm

Aah, looks like it's just a silly bug in slab_fix(). If this looks ok to
Christoph, can you re-test with this patch applied Zdenek? That way we'll
actually know where SLUB expected to see POISON_INUSE.

Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
---

diff --git a/mm/slub.c b/mm/slub.c
index 7f8aaa2..dac50e3 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -456,6 +456,15 @@ static void print_page_info(struct page *page)

}

+static void __slab_bug(struct kmem_cache *s, char *buf)
+{
+ printk(KERN_ERR "========================================"
+ "=====================================\n");
+ printk(KERN_ERR "BUG %s: %s\n", s->name, buf);
+ printk(KERN_ERR "----------------------------------------"
+ "-------------------------------------\n\n");
+}
+
static void slab_bug(struct kmem_cache *s, char *fmt, ...)
{
va_list args;
@@ -464,11 +473,7 @@ static void slab_bug(struct kmem_cache *s, char *fmt, ...)
va_start(args, fmt);
vsnprintf(buf, sizeof(buf), fmt, args);
va_end(args);
- printk(KERN_ERR "========================================"
- "=====================================\n");
- printk(KERN_ERR "BUG %s: %s\n", s->name, buf);
- printk(KERN_ERR "----------------------------------------"
- "-------------------------------------\n\n");
+ __slab_bug(s, buf);
}

static void slab_fix(struct kmem_cache *s, char *fmt, ...)
@@ -533,7 +538,7 @@ static void slab_err(struct kmem_cache *s, struct page *page, char *fmt, ...)
va_start(args, fmt);
vsnprintf(buf, sizeof(buf), fmt, args);
va_end(args);
- slab_bug(s, fmt);
+ __slab_bug(s, buf);
print_page_info(page);
dump_stack();
}
--

To: Pekka J Enberg <penberg@...>
Cc: Linus Torvalds <torvalds@...>, Zdenek Kabelac <zdenek.kabelac@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, Rafael J. Wysocki <rjw@...>, <paulmck@...>, David Miller <davem@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>, <linux-ext4@...>, <herbert@...>
Date: Wednesday, April 23, 2008 - 3:28 pm

Or simpler (catching yet another case):

Subject: slab_err: Pass parameters correctly to slab_bug

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
mm/slub.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2008-04-23 12:24:02.000000000 -0700
+++ linux-2.6/mm/slub.c 2008-04-23 12:27:03.000000000 -0700
@@ -521,7 +521,7 @@ static void print_trailer(struct kmem_ca
static void object_err(struct kmem_cache *s, struct page *page,
u8 *object, char *reason)
{
- slab_bug(s, reason);
+ slab_bug(s, "%s", reason);
print_trailer(s, page, object);
}

@@ -533,7 +533,7 @@ static void slab_err(struct kmem_cache *
va_start(args, fmt);
vsnprintf(buf, sizeof(buf), fmt, args);
va_end(args);
- slab_bug(s, fmt);
+ slab_bug(s, "%s", buf);
print_page_info(page);
dump_stack();
}
--

To: Pekka J Enberg <penberg@...>
Cc: Christoph Lameter <clameter@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, Rafael J. Wysocki <rjw@...>, <paulmck@...>, David Miller <davem@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>, <linux-ext4@...>, <herbert@...>
Date: Wednesday, April 23, 2008 - 4:27 pm

Unfortunately it won't be easy to retest - I just know it happened to
me with some wi-fi networking interaction after resume. I'll rebuild
kernel with these slab patches - but I have now idea how to trigger
the bug.

In the attachment is bzip-ed dmesg in case it would be still needed
for something.

Zdenek

To: Linus Torvalds <torvalds@...>
Cc: Zdenek Kabelac <zdenek.kabelac@...>, Jiri Slaby <jirislaby@...>, Rafael J. Wysocki <rjw@...>, <paulmck@...>, David Miller <davem@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>, <linux-ext4@...>, <herbert@...>, Pekka Enberg <penberg@...>, Christoph Lameter <clameter@...>
Date: Wednesday, April 23, 2008 - 1:40 pm

hm, irqs already disabled isnt bad in itself and it happens all the
time. The irq enabling in unlock_ipi_call_lock() should be OK.

Any race with irqs there should at most result in a hung or crashed
bootup, not in any memory corruption i believe.

Ingo
--

To: Linus Torvalds <torvalds@...>
Cc: Zdenek Kabelac <zdenek.kabelac@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, <paulmck@...>, David Miller <davem@...>, <linux-kernel@...>, <akpm@...>, <herbert@...>, Alan Stern <stern@...>, pm list <linux-pm@...>
Date: Tuesday, April 22, 2008 - 4:34 pm

More or less as stated in the changelog. If we register a child of a sleeping
device, the child ends up on dpm_active before the parent, so the ordering will
be wrong during the next suspend.

There is a bug in device_add() that IMO can be fixed this way:

Index: linux-2.6/drivers/base/core.c
===================================================================
--- linux-2.6.orig/drivers/base/core.c
+++ linux-2.6/drivers/base/core.c
@@ -820,11 +820,11 @@ int device_add(struct device *dev)
error = bus_add_device(dev);
if (error)
goto BusError;
+ bus_attach_device(dev);
error = device_pm_add(dev);
if (error)
goto PMError;
kobject_uevent(&dev->kobj, KOBJ_ADD);
- bus_attach_device(dev);
if (parent)
klist_add_tail(&dev->knode_parent, &parent->klist_children);

The problem is that bus_remove_device() assumes bus_attach_device() to have

Are drivers supposed to register children of suspended devices? That doesn't
make much sense IMO ...

Thanks,
Rafael

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
drivers/base/power/main.c | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)

Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -76,12 +76,10 @@ int device_pm_add(struct device *dev)
else
dev_warn(dev, "devices are sleeping, will not add\n");
WARN_ON(true);
- error = -EBUSY;
- } else {
- error = dpm_sysfs_add(dev);
- if (!error)
- list_add_tail(&dev->power.entry, &dpm_active);
}
+ error = dpm_sysfs_add(dev);
+ if (!error)
+ list_add_tail(&dev->power.entry, &dpm_active);
mutex_unlock(&dpm_list_mtx);
return error;
}
--

To: Rafael J. Wysocki <rjw@...>, Greg KH <greg@...>
Cc: Zdenek Kabelac <zdenek.kabelac@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, <paulmck@...>, David Miller <davem@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>, <herbert@...>, Alan Stern <stern@...>, pm list <linux-pm@...>
Date: Tuesday, April 22, 2008 - 4:58 pm

Well, that's why I think the warning itself makes sense - and then we can
decide whether it makes sense for that particular case or not. Clearly it
happens (since it triggered), now we need to figure out _why_ it happened.

But I don't think debugging messages should change behaviour.

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: Greg KH <greg@...>, Zdenek Kabelac <zdenek.kabelac@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, <paulmck@...>, David Miller <davem@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>, <herbert@...>, Alan Stern <stern@...>, pm list <linux-pm@...>
Date: Tuesday, April 22, 2008 - 8:50 pm

[Sorry for resending, but it looks like this message didn't reach the lists.]

Okay, below is a more complete patch that changes device_pm_add() so that
it doesn't refuse to add children of suspended devices.  Note, however, that
even if such a registration succeeds, it will probably lead to some future
problems.

Thanks,
Rafael

---
Do not refuse to actually register children of suspended devices,
but still warn about attempts to do that.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
drivers/base/power/main.c | 15 ++++++---------
1 file changed, 6 insertions(+), 9 deletions(-)

Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -62,7 +62,7 @@ static bool all_sleeping;
*/
int device_pm_add(struct device *dev)
{
- int error = 0;
+ int error;

pr_debug("PM: Adding info for %s:%s\n",
dev->bus ? dev->bus->name : "No Bus",
@@ -70,18 +70,15 @@ int device_pm_add(struct device *dev)
mutex_lock(&dpm_list_mtx);
if ((dev->parent && dev->parent->power.sleeping) || all_sleeping) {
if (dev->parent->power.sleeping)
- dev_warn(dev,
- "parent %s is sleeping, will not add\n",
+ dev_warn(dev, "parent %s is sleeping\n",
dev->parent->bus_id);
else
- dev_warn(dev, "devices are sleeping, will not add\n");
+ dev_warn(dev, "all devices are sleeping\n");
WARN_ON(true);
- error = -EBUSY;
- } else {
- error = dpm_sysfs_add(dev);
- if (!error)
- list_add_tail(&dev->power.entry, &dpm_active);
}
+ error = dpm_sysfs_add(dev);
+ if (!error)
+ list_add_tail(&dev->power.entry, &dpm_active);
mutex_unlock(&dpm_list_mtx);
return error;
}
--

To: Rafael J. Wysocki <rjw@...>
Cc: Linus Torvalds <torvalds@...>, Greg KH <greg@...>, Zdenek Kabelac <zdenek.kabelac@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, <paulmck@...>, David Miller <davem@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>, <herbert@...>, pm list <linux-pm@...>
Date: Wednesday, April 23, 2008 - 10:56 am

I think the reason it happened is clear enough. Call it a race if you
want, but the window is so large that it hardly qualifies.

It's a result of the way the MMC core is written. There's an
upper-level controller device, and below that is a host device, and
below that is the card itself. The code that adds and removes children
of the host device runs as part of the controller driver.

Hence the problem: The driver adds children below the _host_ as soon as
the _controller_ is resumed, even though the host is still suspended.
It's not as big an error as it sounds -- the host was originally a
class_device and then got converted over to a regular device. It
doesn't have a driver of its own.

This is one of the things that needs to be fixed up as part of the
reworking of the system-sleep API. I simply haven't had any time to
work on it (and I'm not likely to in the near future).

You ought to be able to provoke this more or less at will, depending on
the order in which your PCI devices are probed, by inserting or
removing an MMC card during the disk spin-up interval while the system
is waking up.

Alan Stern

--

To: Linus Torvalds <torvalds@...>
Cc: Greg KH <greg@...>, Zdenek Kabelac <zdenek.kabelac@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, <paulmck@...>, David Miller <davem@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>, <herbert@...>, Alan Stern <stern@...>, pm list <linux-pm@...>
Date: Tuesday, April 22, 2008 - 6:48 pm

Well, this particular case looks like a race to me, but I'm waiting for the

Okay, below is a more complete patch that changes device_pm_add() so that
it doesn't refuse to add children of suspended devices. Note, however, that
even if such a registration succeeds, it will probably lead to some future
problems.

Thanks,
Rafael

---
Do not refuse to actually register children of suspended devices,
but still warn about attempts to do that.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
drivers/base/power/main.c | 15 ++++++---------
1 file changed, 6 insertions(+), 9 deletions(-)

Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -62,7 +62,7 @@ static bool all_sleeping;
*/
int device_pm_add(struct device *dev)
{
- int error = 0;
+ int error;

pr_debug("PM: Adding info for %s:%s\n",
dev->bus ? dev->bus->name : "No Bus",
@@ -70,18 +70,15 @@ int device_pm_add(struct device *dev)
mutex_lock(&dpm_list_mtx);
if ((dev->parent && dev->parent->power.sleeping) || all_sleeping) {
if (dev->parent->power.sleeping)
- dev_warn(dev,
- "parent %s is sleeping, will not add\n",
+ dev_warn(dev, "parent %s is sleeping\n",
dev->parent->bus_id);
else
- dev_warn(dev, "devices are sleeping, will not add\n");
+ dev_warn(dev, "all devices are sleeping\n");
WARN_ON(true);
- error = -EBUSY;
- } else {
- error = dpm_sysfs_add(dev);
- if (!error)
- list_add_tail(&dev->power.entry, &dpm_active);
}
+ error = dpm_sysfs_add(dev);
+ if (!error)
+ list_add_tail(&dev->power.entry, &dpm_active);
mutex_unlock(&dpm_list_mtx);
return error;
}

--

To: Linus Torvalds <torvalds@...>
Cc: Rafael J. Wysocki <rjw@...>, Zdenek Kabelac <zdenek.kabelac@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, <paulmck@...>, David Miller <davem@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>, <herbert@...>, Alan Stern <stern@...>, pm list <linux-pm@...>
Date: Tuesday, April 22, 2008 - 6:12 pm

No, the other patch from Rafael looks "more correct", I'd prefer that
one to go in. I'm about to be away from email for probably 24 hours or
so, so if you could commit it, I'd appreciate it.

thanks,

greg k-h
--

To: Linus Torvalds <torvalds@...>
Cc: Zdenek Kabelac <zdenek.kabelac@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, <paulmck@...>, David Miller <davem@...>, <linux-kernel@...>, <akpm@...>, <herbert@...>, Alan Stern <stern@...>, pm list <linux-pm@...>, Greg KH <greg@...>
Date: Tuesday, April 22, 2008 - 4:57 pm

Hm, actually it's better to do this instead IMHO:

---
Prevent bus_remove_device() from crashing if dev->knode_bus has not been
initialized before it's called.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
drivers/base/bus.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

Index: linux-2.6/drivers/base/bus.c
===================================================================
--- linux-2.6.orig/drivers/base/bus.c
+++ linux-2.6/drivers/base/bus.c
@@ -530,7 +530,8 @@ void bus_remove_device(struct device *de
sysfs_remove_link(&dev->bus->p->devices_kset->kobj,
dev->bus_id);
device_remove_attrs(dev->bus, dev);
- klist_del(&dev->knode_bus);
+ if (klist_node_attached(&dev->knode_bus))
+ klist_del(&dev->knode_bus);

pr_debug("bus: '%s': remove device %s\n",
dev->bus->name, dev->bus_id);
--

To: Rafael J. Wysocki <rjw@...>
Cc: Linus Torvalds <torvalds@...>, Zdenek Kabelac <zdenek.kabelac@...>, Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, <paulmck@...>, David Miller <davem@...>, <linux-kernel@...>, <akpm@...>, <herbert@...>, Alan Stern <stern@...>, pm list <linux-pm@...>
Date: Tuesday, April 22, 2008 - 6:11 pm

This fix looks like the correct one to me, but then again, I'm jet
lagged, and don't know all of the context here. Moving where we call
bus_attach_device() does not seem correct, as you are changing the logic
of when we export the kobject, which might not be a good idea.

--

To: Zdenek Kabelac <zdenek.kabelac@...>
Cc: Ingo Molnar <mingo@...>, Jiri Slaby <jirislaby@...>, Linus Torvalds <torvalds@...>, <paulmck@...>, David Miller <davem@...>, <linux-kernel@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>
Date: Tuesday, April 22, 2008 - 5:46 pm

Zdenek, can you please send me the full dmesg containing this?

Thanks,
Rafael
--

To: Jiri Slaby <jirislaby@...>
Cc: Linus Torvalds <torvalds@...>, Rafael J. Wysocki <rjw@...>, <paulmck@...>, David Miller <davem@...>, <linux-kernel@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, Zdenek Kabelac <zdenek.kabelac@...>, H. Peter Anvin <hpa@...>
Date: Tuesday, April 22, 2008 - 3:09 pm

note that full PAT (where in essence Linux takes over control of the
cache attributes via PTEs, instead of relying on the BIOS initialized
MTRRs alone) you should only get with -mm or with x86.git applied.

I.e. x86 PAT might explain any -mm issue but not the upstream -git
issue.

In upstream -git we dont have the second wave of the PAT changes applied
yet (the /dev/mem bits) so CONFIG_X86_PAT is not yet activated. (it's
only safe to enable if we have all the changes together and perfectly
control all cache attributes in the system)

i.e. PAT complications here would not happen in form of real cache
attribute conflicts [i.e. the lockups and corruptions cannot be due to
that] - but as side-effects to other code it changes.

and most of the PAT failures we ever saw had different patterns anyway:
the leading failure was API rejections and hence non-working Xorg or
non-working ioremap() in certain drivers. The worst-case scenario, early
in the PAT code's cycle, was a spontaneous triple fault - months ago.

the basis for the PAT changes was the hardening of the CPA code and its
general use for everything (such as DEBUG_PAGEALLOC). And much of that
happened and was finished in v2.6.25. Nothing conceptually new really
happened there - and even where we touched the code in .26 it happened
long ago and would have surfaced by now.

... but ... nothing can be excluded.

Ingo
--

To: Linus Torvalds <torvalds@...>
Cc: Jiri Slaby <jirislaby@...>, <paulmck@...>, David Miller <davem@...>, <linux-kernel@...>, <mingo@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, Zdenek Kabelac <zdenek.kabelac@...>
Date: Monday, April 21, 2008 - 9:30 pm

Well, I thought about that too. However, I had a hang before 2.6.25-git2 that
I suspect was related (I couldn't get any information from the box, as it just

I'm not sure.

Thanks,
Rafael
--

To: <paulmck@...>
Cc: Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <torvalds@...>, <linux-kernel@...>, <mingo@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, Zdenek Kabelac <zdenek.kabelac@...>
Date: Monday, April 21, 2008 - 9:15 pm

My kernel is only voluntarily preemptible (ie. CONFIG_PREEMPT_VOLUNTARY=y).

It is an SMP one, however.

Thanks,
Rafael
--

To: Rafael J. Wysocki <rjw@...>
Cc: Jiri Slaby <jirislaby@...>, David Miller <davem@...>, <torvalds@...>, <linux-kernel@...>, <mingo@...>, <akpm@...>, <linux-ext4@...>, <herbert@...>, Zdenek Kabelac <zdenek.kabelac@...>
Date: Monday, April 21, 2008 - 9:25 pm

Then this patch won't help you. :-/ I submitted separately anyway.

Thanx, Paul
--

To: Linus Torvalds <torvalds@...>
Cc: Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-ext4@...>, Paul E. McKenney <paulmck@...>
Date: Sunday, April 20, 2008 - 9:18 pm

Hi Linus:

Semantically there should be no difference between the two versions.
The purpose of rcu_dereference is really similar to smp_rmb, i.e.,
it adds a (conditional) read barrier between what has been read so
far (including its argument), and what will be read subsequently.

So if we expand out the current code it would look like

fetch (head)->next
store into pos
again:
smp_read_barrier_depends()
prefetch(pos->next)
pos != (head)

...loop body...

fetch pos->next
store into pos
goto again

Yours looks like

fetch (head)->next
smp_read_barrier_depends()
store into pos
again:
prefetch(pos->next)
pos != (head)

...loop body...

fetch pos->next
smp_read_barrier_depends()
store into pos
goto again

As the objective here is to insert a barrier before dereferencing
pos (e.g., reading pos->next or using it in the loop body), these
two should be identical.

But I do concede that your version looks clearer, and has the
benefit that should prefetch ever be optimised out with no side-
effects, yours would still be correct while the current one will
lose the barrier completely.

Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

To: Herbert Xu <herbert@...>
Cc: Linus Torvalds <torvalds@...>, Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-ext4@...>
Date: Sunday, April 20, 2008 - 10:08 pm

Agreed as well -- compilers would also be within their right to bypass
the rcu_dereference() around the test/prefetch, which would allow
them to refetch. For example, with __list_for_each_rcu(), the original
implementation allows the compiler to treat a use of "pos" within the body
of the loop as if it was a use of (head)->next, refetching if convenient.
Not so good.

So good catch, Linus!!!

Could we also eliminate the (both unused in 2.6.25 and useless as
well) list_for_each_safe_rcu()? After all, if you use list_del_rcu()
and call_rcu(), all the RCU list-traversal primitives are "safe" in
this sense. Patch attached (testing in progress), based on Linus's
earlier patch.

Signed_off_by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---

list.h | 47 +++++++++++++++--------------------------------
1 file changed, 15 insertions(+), 32 deletions(-)

diff -urpNa linux-2.6.25/include/linux/list.h linux-2.6.25-rcu-list/include/linux/list.h
--- linux-2.6.25/include/linux/list.h 2008-04-16 19:49:44.000000000 -0700
+++ linux-2.6.25-rcu-list/include/linux/list.h 2008-04-20 18:44:55.000000000 -0700
@@ -631,31 +631,14 @@ static inline void list_splice_init_rcu(
* as long as the traversal is guarded by rcu_read_lock().
*/
#define list_for_each_rcu(pos, head) \
- for (pos = (head)->next; \
- prefetch(rcu_dereference(pos)->next), pos != (head); \
- pos = pos->next)
+ for (pos = rcu_dereference((head)->next); \
+ prefetch(pos->next), pos != (head); \
+ pos = rcu_dereference(pos->next))

#define __list_for_each_rcu(pos, head) \
- for (pos = (head)->next; \
- rcu_dereference(pos) != (head); \
- pos = pos->next)
-
-/**
- * list_for_each_safe_rcu
- * @pos: the &struct list_head to use as a loop cursor.
- * @n: another &struct list_head to use as temporary storage
- * @head: the head for your list.
- *
- * Iterate over an rcu-protected list, safe against removal of list entry.
- *
- * This list-traversa...

To: Paul E. McKenney <paulmck@...>
Cc: Herbert Xu <herbert@...>, Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-ext4@...>
Date: Monday, April 21, 2008 - 11:49 am

That is *not* the main problem.

If you use "rcu_dereference()" on the wrong access, it not only loses the
"smp_read_barrier_depends()" (which is a no-op on all sane architectures
anyway), but it loses the ACCESS_ONCE() thing *entirely*.

Accessign a local automatic variable through a volatile pointer has
absolutely no effect - it's a total no-op apart from possibly generating
slightly worse code (although if I were a compiler, I'd just ignore it),
since the compiler is totally free to spill and reload the local variable
to its memory location - the stack - anyway!

So the important part (for sane architectures) of rcu_dereference() is
that ACCESS_ONCE() hack, and it _only_ works if you actually do it on the
value as it gets loaded from the RCU-protected data structure, not later.

So forget about the prefetch, and forget about the barrier. They had
nothing to do with the bug. The bug existed even without the prefetch,
even in the versions that didn't have it at all. For example, look at the
"__list_for_each_rcu()" thing - the bug is there too, because it did just

pos = (head)->next ... pos = pos->next

where both of those assignments to pos were done without rcu_derefence, so
the compiler could happily decide to use the value once, forget it, and
then re-load it later (when it might have changed).

In other words, the thing I objected to was something much more
fundamental than any barriers. It was the fact that "rcu_dereference()"
simply *fundamentally* doesn't make sense when done on a local variable,
it can only make sense when actually loading the value from the data
structure.

In short:

pos = ..

rcu_dereference(pos)

is crazy and senseless, but

pos = rcu_dereference(pos->next)

actually has some logical meaning.

Now, all this said, I seriously doubt this was the source of the bug
itself. I do not actually really believe that the compiler had much room
for reloading things with or without any rcu_dereference(), and I d...

To: Linus Torvalds <torvalds@...>
Cc: Paul E. McKenney <paulmck@...>, Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-ext4@...>
Date: Monday, April 21, 2008 - 9:03 pm

Actually rcu_dereference didn't have ACCESS_ONCE when I did this.
That only appearaed later with the preemptible RCU work.

The original purpose of rcu_dereference was exactly to replace the
explicit barriers that people were using for RCU, nothing more,
nothing less.

Oh and I totally agree that the compiler is going to generate insane
code whenever ACCESS_ONCE is used. In this case we may have avoided
it by rearranging the code, but in general the introduction of ACCESS_ONCE
in rcu_dereference is likely to have a negative impact on the code
generated.

Remember that "volatile" discussion? I think this is where it all came
from.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

To: Herbert Xu <herbert@...>
Cc: Linus Torvalds <torvalds@...>, Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-ext4@...>
Date: Tuesday, April 22, 2008 - 9:36 am

Yep, ACCESS_ONCE() is quite recent -- within the last year. So I should

And I still have the bug in to gcc:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33102

Interesting, currently in status "unconfirmed"... I guess I should
supply a test case.

Thanx, Paul
--

To: Linus Torvalds <torvalds@...>
Cc: Herbert Xu <herbert@...>, Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-ext4@...>
Date: Monday, April 21, 2008 - 1:05 pm

Yep, "compilers would also be within their right to bypass the

Agreed. The only reasons I can think of for doing rcu_dereference()
on a local variable are as follows:

1. The local variable is passed into a called function that is
also invoked on shared storage. In this case, the use of
rcu_dereference() on a local variable is the cost of common
code.

2. The address of the local variable is published globally
so that other CPUs can access it under RCU protection. Yes,
this is generally insane -- the last time I did this sort of
thing was in the early 1980s on a PDP-11, where it was necessary
due to that machine's 64K address space. (No, I didn't use
RCU on this UP machine, but I did publish locals -- malloc()

Agreed -- you would have to have an uncommonly aggressive compiler to get
this to happen. Seems like it would be worth trying the patch, though.

I did take a quick look for improperly freeing dentries -- unhashed
dentries are freed directly, so if there is a code path that somehow
unhashes dentries and then d_free()s them without a grace period, we
have a problem.

Hmmmm... This could happen if someone called the final dput() on
a dentry that was hashed. If this can really happen, the crude and

Rafael, does the following (crude, untested, probably does not even
compile) patch help?

Thanx, Paul

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---

dcache.c | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)

diff -urpNa -X dontdiff linux-2.6.25/fs/dcache.c linux-2.6.25-d_free/fs/dcache.c
--- linux-2.6.25/fs/dcache.c 2008-04-16 19:49:44.000000000 -0700
+++ linux-2.6.25-d_free/fs/dcache.c 2008-04-21 09:57:53.000000000 -0700
@@ -88,11 +88,7 @@ static void d_free(struct dentry *dentry
{
if (dentry->d_op && dentry->d_op->d_release)
dentry->d_op->d_release(dentry);
- /* if dentry was never inserted into hash, immediate free is OK */
- if (hlist_unhashed(&dentry->d_hash))
...

To: Paul E. McKenney <paulmck@...>
Cc: Herbert Xu <herbert@...>, Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-ext4@...>
Date: Monday, April 21, 2008 - 1:30 pm

No, not even then.

We *always* unhash the dentries before freeing them, but we very
consciously use "hlist_del_rcu()" on them, not "hlist_del_init()".

That, in turn, will mean that the "pprev" pointer will still be set, so
the "hlist_unhashed()" thing will *not* trigger.

IOW, when we do that direct-free with:

if (hlist_unhashed(&dentry->d_hash))
__d_free(dentry);

the "hlist_unhashed()" will literally guarantee that i has *never* been on
a hash-list at all!

(If you want to test whether it is currently unhashed or not, you actually
have to use "d_unhashed()" on the dentry under the dentry lock, which
tests the DCACHE_UNHASHED bit).

Of course, there could be some bug in there, but the thing is, none of
this has even changed in a long time, certainly not since 2.6.25. Which is
why I think the dcache code is all fine, and the bug comes from somewhere
else corrupting the data structures.

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: Herbert Xu <herbert@...>, Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-ext4@...>
Date: Monday, April 21, 2008 - 1:43 pm

Got it, hlist_del_rcu() sets ->pprev to LIST_POISON2, which is non-NULL,

And as it looks like you guessed, I was misreading the hlist_unhashed()
above as d_unhashed(). :-/

--

To: Herbert Xu <herbert@...>
Cc: Linus Torvalds <torvalds@...>, Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-ext4@...>
Date: Monday, April 21, 2008 - 12:59 am

And here is an update with one bug fixed -- testing continues.
This is an update of http://lkml.org/lkml/2008/4/20/217, deleting
list_for_each_safe_rcu() and fixing hlist_for_each_entry_rcu().
Testing continues...

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---

list.h | 48 +++++++++++++++---------------------------------
1 file changed, 15 insertions(+), 33 deletions(-)

diff -urpNa -X dontdiff linux-2.6.25/include/linux/list.h linux-2.6.25-rcu-list/include/linux/list.h
--- linux-2.6.25/include/linux/list.h 2008-04-16 19:49:44.000000000 -0700
+++ linux-2.6.25-rcu-list/include/linux/list.h 2008-04-20 21:48:29.000000000 -0700
@@ -631,31 +631,14 @@ static inline void list_splice_init_rcu(
* as long as the traversal is guarded by rcu_read_lock().
*/
#define list_for_each_rcu(pos, head) \
- for (pos = (head)->next; \
- prefetch(rcu_dereference(pos)->next), pos != (head); \
- pos = pos->next)
+ for (pos = rcu_dereference((head)->next); \
+ prefetch(pos->next), pos != (head); \
+ pos = rcu_dereference(pos->next))

#define __list_for_each_rcu(pos, head) \
- for (pos = (head)->next; \
- rcu_dereference(pos) != (head); \
- pos = pos->next)
-
-/**
- * list_for_each_safe_rcu
- * @pos: the &struct list_head to use as a loop cursor.
- * @n: another &struct list_head to use as temporary storage
- * @head: the head for your list.
- *
- * Iterate over an rcu-protected list, safe against removal of list entry.
- *
- * This list-traversal primitive may safely run concurrently with
- * the _rcu list-mutation primitives such as list_add_rcu()
- * as long as the traversal is guarded by rcu_read_lock().
- */
-#define list_for_each_safe_rcu(pos, n, head) \
- for (pos = (head)->next; \
- n = rcu_dereference(pos)->next, pos != (head); \
- pos = n)
+ for (pos = rcu_dereference((head)->next); \
+ pos != (head); \
+ pos = rcu_dereference(pos->next))

/**
* list_for_each...

To: Herbert Xu <herbert@...>
Cc: Linus Torvalds <torvalds@...>, Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-ext4@...>
Date: Monday, April 21, 2008 - 1:47 am

And it passes.

--

To: Paul E. McKenney <paulmck@...>
Cc: Herbert Xu <herbert@...>, Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-ext4@...>
Date: Monday, April 21, 2008 - 12:06 pm

Ok, I applied it, with hopefully an understandable commit message.

That said, now we just need to figure out what actually caused the bug in
question.

Rafael: if it's a too-early free of the dentry (which could be because
somebody didn't do a proper rcu read-lock, or maybe the rcu grace period
logic itself got broken?), then enabling SLUB/SLAB debugging should catch
it much more quickly (and hopefully we'd see the signature of a
use-after-free - the poisoning byte pattern rather than the -1).

The other alternative is simply memory corruption. Ie the -1 may well be
somebody *else* overwritin the ->next pointer because they did a
use-after-free and maybe the dentry_cache is shared with some other
allocation of the same size (SLUB does that, no?)

Rafael: your last oops does seem to imply that there is some strange
memory corruption going on, because in that case the invalid pointer is
different: instead of being all-ones, it is "fff0810023444c98", which is
not a possible pointer. It very much looks like a single nybble got
cleared (because ffff810023444c98 _would_ be a valid pointer, notice the
"fff0" vs "ffff" prefix).

So I do suspect it's *some* kind of use-after-free thing. But nothing in
fs/ has changed, so it's not a dentry bug, I think. Which is why my
"preferred" suspect is that "somebody else also does allocations of the
same size as the dentry code, and shares the same SLUB alloc space, and
does something bad".

So Rafael - are you using SLUB, and if you are, can you enable SLUB_DEBUG,
and then use the "slub_debug" kernel command line to enable it?

Linus

--

To: Linus Torvalds <torvalds@...>
Cc: Paul E. McKenney <paulmck@...>, Herbert Xu <herbert@...>, LKML <linux-kernel@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-ext4@...>
Date: Monday, April 21, 2008 - 12:24 pm

Sure, I have SLUB_DEBUG on already, rebooting with "slub_debug".

Thanks,
Rafael
--

To: Paul E. McKenney <paulmck@...>
Cc: Herbert Xu <herbert@...>, Linus Torvalds <torvalds@...>, Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, <linux-ext4@...>
Date: Monday, April 21, 2008 - 9:00 am

i have queued up your patch in its form below. (but Linus might beat me
at applying it)

Ingo

-------------------->
Subject: RCU, list.h: fix list iterators
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Date: Sun, 20 Apr 2008 21:59:13 -0700

RCU list iterators: should prefetch ever be optimised out with no
side-effects, the current version will lose the barrier completely.

Pointed-out-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
include/linux/list.h | 48 +++++++++++++++---------------------------------
1 file changed, 15 insertions(+), 33 deletions(-)

Index: linux/include/linux/list.h
===================================================================
--- linux.orig/include/linux/list.h
+++ linux/include/linux/list.h
@@ -631,31 +631,14 @@ static inline void list_splice_init_rcu(
* as long as the traversal is guarded by rcu_read_lock().
*/
#define list_for_each_rcu(pos, head) \
- for (pos = (head)->next; \
- prefetch(rcu_dereference(pos)->next), pos != (head); \
- pos = pos->next)
+ for (pos = rcu_dereference((head)->next); \
+ prefetch(pos->next), pos != (head); \
+ pos = rcu_dereference(pos->next))

#define __list_for_each_rcu(pos, head) \
- for (pos = (head)->next; \
- rcu_dereference(pos) != (head); \
- pos = pos->next)
-
-/**
- * list_for_each_safe_rcu
- * @pos: the &struct list_head to use as a loop cursor.
- * @n: another &struct list_head to use as temporary storage
- * @head: the head for your list.
- *
- * Iterate over an rcu-protected list, safe against removal of list entry.
- *
- * This list-traversal primitive may safely run concurrently with
- * the _rcu list-mutation primitives such as list_add_rcu()
- * as long as the traversal is guarded by rcu_read_lock().
- */
-#define list_for_each_safe_rcu(pos, n, head) \
- for (po...

To: Rafael J. Wysocki <rjw@...>
Cc: LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, <linux-ext4@...>
Date: Monday, April 21, 2008 - 9:17 am

so you've got ext3. Nothing changed in the VFS or in ext3 in -git yet.

the instruction pattern:

Code: f6 43 04 10 75 06 f0 ff 03 48 89 d8 fe 43 08 eb 31 fe 43 08 48 8b
45 d0 48 8b 00 48 89 45 d0 48 8b 45 d0 48 85 c0 74 18 48 89 c2 <48> 8b
00 48 8d 5a e8 44 39 73 30 0f 18 08 75 d9 e9 6a ff ff ff
========

shows that you've got "prefetchnta (%esi)" indirect:

0f 18 00 prefetcht0 (%eax)

so the prefetch instructions are patched in, neither the compiler nor
the CPU should ignore them.

Ingo
--

To: Ingo Molnar <mingo@...>
Cc: LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, <linux-ext4@...>
Date: Monday, April 21, 2008 - 9:35 am

Well, I don't really know what that means ...

Besides, that's 64-bit code, but I guess that doesn't matter here.

Thanks,
Rafael
--

To: Rafael J. Wysocki <rjw@...>
Cc: LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, <linux-ext4@...>
Date: Monday, April 21, 2008 - 2:56 pm

correct, for 64-bit code that's prefetcht0 (%rax) - a non-destructive
'prefetch stuff from there into the cache' x86 instruction. So real
prefetches are done so i'd exclude any true SMP related barrier race.
(not that it's likely on x86 hardware anyway - memory barriers usually
only matter on Alpha and similar weakly-ordered architectures.)

Ingo
--

To: LKML <linux-kernel@...>
Cc: Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, <linux-ext4@...>
Date: Sunday, April 20, 2008 - 3:14 pm

That is line 1250 in fs/dcache.c, btw:

(gdb) l *__d_lookup+0xf1
0xffffffff802a7b3c is in __d_lookup (/home/rafael/src/linux-2.6/fs/dcache.c:1250).
1245 struct hlist_node *node;
1246 struct dentry *dentry;
1247
1248 rcu_read_lock();
1249
1250 hlist_for_each_entry_rcu(dentry, node, head, d_hash) {
1251 struct qstr *qstr;
1252
1253 if (dentry->d_name.hash != hash)
1254 continue;

--

Previous thread: [2.6.25 REGRESSION] Lenovo 3000 N100 does not wake up from ACPI S3 state by Thomas Bächler on Saturday, April 19, 2008 - 8:49 am. (1 message)

Next thread: [GIT PULL] sh updates for 2.6.26-rc1 by Paul Mundt on Saturday, April 19, 2008 - 9:25 am. (1 message)