[PATCH] [29/58] x86_64: fiuxp pt_reqs leftovers

Previous thread: [PATCH][RFC] Remove R/W semaphore content from generic semaphore.h headers. by Robert P. J. Day on Thursday, July 19, 2007 - 2:37 am. (3 messages)

Next thread: [ANNOUNCE] RSBAC 1.3.5 released by Amon Ott on Thursday, July 19, 2007 - 2:37 am. (1 message)
From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:54 am

- Some more improvements for AMD family 10
- Some help for gcc 4.3
- Out of line string functions for i386 (saves >20k text) 
- x86-64 vDSO
- improved fake numa node code from David Rientjes
- various machine checking handling improvements from Tim H.
- timer cleanups and fixes from Thomas Gleixner
- various other cleanup and fixes

Please review. I plan to send them off relatively quickly
because I'm very late with this merge.

-Andi
-

From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:54 am

Fix a bug introduced with the CLFLUSH changes: we must always flush pages
changed in cpa(), not just when they are reverted.

Reenable CLFLUSH usage with that now (it was temporarily disabled
for .22) 

Add some BUG_ONs

Contains fixes from  Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/i386/mm/pageattr.c   |   20 +++++++++++++++++---
 arch/x86_64/mm/pageattr.c |   23 ++++++++++++++---------
 2 files changed, 31 insertions(+), 12 deletions(-)

Index: linux/arch/x86_64/mm/pageattr.c
===================================================================
--- linux.orig/arch/x86_64/mm/pageattr.c
+++ linux/arch/x86_64/mm/pageattr.c
@@ -74,14 +74,12 @@ static void flush_kernel_map(void *arg)
 	struct page *pg;
 
 	/* When clflush is available always use it because it is
-	   much cheaper than WBINVD. Disable clflush for now because
-	   the high level code is not ready yet */
-	if (1 || !cpu_has_clflush)
+	   much cheaper than WBINVD. */
+	if (!cpu_has_clflush)
 		asm volatile("wbinvd" ::: "memory");
 	else list_for_each_entry(pg, l, lru) {
 		void *adr = page_address(pg);
-		if (cpu_has_clflush)
-			cache_flush_page(adr);
+		cache_flush_page(adr);
 	}
 	__flush_tlb_all();
 }
@@ -95,7 +93,8 @@ static LIST_HEAD(deferred_pages); /* pro
 
 static inline void save_page(struct page *fpage)
 {
-	list_add(&fpage->lru, &deferred_pages);
+	if (!test_and_set_bit(PG_arch_1, &fpage->flags))
+		list_add(&fpage->lru, &deferred_pages);
 }
 
 /* 
@@ -129,9 +128,12 @@ __change_page_attr(unsigned long address
 	pte_t *kpte; 
 	struct page *kpte_page;
 	pgprot_t ref_prot2;
+
 	kpte = lookup_address(address);
 	if (!kpte) return 0;
 	kpte_page = virt_to_page(((unsigned long)kpte) & PAGE_MASK);
+	BUG_ON(PageLRU(kpte_page));
+	BUG_ON(PageCompound(kpte_page));
 	if (pgprot_val(prot) != pgprot_val(ref_prot)) { 
 		if (!pte_huge(*kpte)) {
 			set_pte(kpte, pfn_pte(pfn, prot));
@@ -159,10 +161,9 @@ __change_page_attr(unsigned ...
From: Jan Beulich
Date: Monday, August 6, 2007 - 3:15 am

But that is still wrong - you're again flushing the page table page rather than
the data one. Fixing this was the purpose of the patch I had sent, plus the

Fix a bug introduced with the CLFLUSH changes: we must always flush pages
changed in cpa(), not just when they are reverted.

Reenable CLFLUSH usage with that now (it was temporarily disabled
for .22) 

Add some BUG_ONs

Contains fixes from  Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/i386/mm/pageattr.c   |   20 +++++++++++++++++---
 arch/x86_64/mm/pageattr.c |   23 ++++++++++++++---------
 2 files changed, 31 insertions(+), 12 deletions(-)

Index: linux/arch/x86_64/mm/pageattr.c
===================================================================
--- linux.orig/arch/x86_64/mm/pageattr.c
+++ linux/arch/x86_64/mm/pageattr.c
@@ -74,14 +74,12 @@ static void flush_kernel_map(void *arg)
 	struct page *pg;
 
 	/* When clflush is available always use it because it is
-	   much cheaper than WBINVD. Disable clflush for now because
-	   the high level code is not ready yet */
-	if (1 || !cpu_has_clflush)
+	   much cheaper than WBINVD. */
+	if (!cpu_has_clflush)
 		asm volatile("wbinvd" ::: "memory");
 	else list_for_each_entry(pg, l, lru) {
 		void *adr = page_address(pg);
-		if (cpu_has_clflush)
-			cache_flush_page(adr);
+		cache_flush_page(adr);
 	}
 	__flush_tlb_all();
 }
@@ -95,7 +93,8 @@ static LIST_HEAD(deferred_pages); /* pro
 
 static inline void save_page(struct page *fpage)
 {
-	list_add(&fpage->lru, &deferred_pages);
+	if (!test_and_set_bit(PG_arch_1, &fpage->flags))
+		list_add(&fpage->lru, &deferred_pages);
 }
 
 /* 
@@ -129,9 +128,12 @@ __change_page_attr(unsigned long address
 	pte_t *kpte; 
 	struct page *kpte_page;
 	pgprot_t ref_prot2;
+
 	kpte = lookup_address(address);
 	if (!kpte) return 0;
 	kpte_page = virt_to_page(((unsigned long)kpte) & PAGE_MASK);
+	BUG_ON(PageLRU(kpte_page));
+	BUG_ON(PageCompound(kpte_page));
 	if ...
From: Andi Kleen
Date: Monday, August 6, 2007 - 3:36 am

True. The problem is that we can't necessarily use the LRU list_head of the data
pages though; e.g. when the page is mapped to user space.

I guess we might need to go back to wbinvd again.

What was the remaining problem of the reference counting? 

-Andi

-

From: Jan Beulich
Date: Monday, August 6, 2007 - 3:49 am

That was what my patch did, plus an attempt to avoid the wbinvd if all accumulated

The counter gets adjusted regardless of the current attribute in effect, e.g. if you
change a page to PAGE_KERNEL that already happens to be PAGE_KERNEL, the
counter still gets decremented, which in turn may result in reverting the containing
2M/4M page prematurely. Likewise if a not-PAGE_KERNEL gets changed to another
non-PAGE_KERNEL attribute, the counter would get incremented, likely preventing
reverting the large page forever.

Jan

-

From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:54 am

Don't need 16 byte alignment because kernel doesn't use SSE2

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86_64/Makefile |    1 +
 1 file changed, 1 insertion(+)

Index: linux/arch/x86_64/Makefile
===================================================================
--- linux.orig/arch/x86_64/Makefile
+++ linux/arch/x86_64/Makefile
@@ -55,6 +55,7 @@ cflags-y += $(call cc-option,-mno-sse -m
 # this works around some issues with generating unwind tables in older gccs
 # newer gccs do it by default
 cflags-y += -maccumulate-outgoing-args
+cflags-y += -mpreferred-stack-boundary=4
 
 # do binutils support CFI?
 cflags-y += $(call as-instr,.cfi_startproc\n.cfi_endproc,-DCONFIG_AS_CFI=1,)
-

From: Chuck Ebbert
Date: Thursday, July 19, 2007 - 7:42 am

Should be:

-

From: Serge Belyshev
Date: Thursday, July 19, 2007 - 4:50 am

-mpreferred-stack-boundary=num
           Attempt to keep the stack boundary aligned to a 2 raised to num
           byte boundary.  If -mpreferred-stack-boundary is not specified, the
           default is 4 (16 bytes or 128 bits), except when optimizing for
           code size (-Os), in which case the default is the minimum correct
           alignment (4 bytes for x86, and 8 bytes for x86-64).

So -mpreferred-stack-boundary=4 is the default and to align stack
to 8 bytes you want -mpreferred-stack-boundary=3, not 4, IIUC.
-

From: Andi Kleen
Date: Thursday, July 19, 2007 - 5:06 am

Fixed thanks

-Andi


-

From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:54 am

From: Jean Delvare <khali@linux-fr.org>
On x86_64, <asm/ptrace.h> uses __user but doesn't include
<linux/compiler.h>. This could lead to build failures.

Signed-off-by: Jean Delvare <khali@linux-fr.org>
Signed-off-by: Andi Kleen <ak@suse.de>

---
 include/asm-x86_64/ptrace.h |    1 +
 1 file changed, 1 insertion(+)

Index: linux/include/asm-x86_64/ptrace.h
===================================================================
--- linux.orig/include/asm-x86_64/ptrace.h
+++ linux/include/asm-x86_64/ptrace.h
@@ -1,6 +1,7 @@
 #ifndef _X86_64_PTRACE_H
 #define _X86_64_PTRACE_H
 
+#include <linux/compiler.h>	/* For __user */
 #include <asm/ptrace-abi.h>
 
 #ifndef __ASSEMBLY__
-

From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:54 am

Linux 64bit only uses the IO-APIC ID as an internal cookie. In the future
there could be some cases where the IO-APIC IDs are not unique because
they share an 8 bit space with CPUs and if there are enough CPUs 
it is difficult to get them that. But Linux needs the io apic ID
internally for its data structures. Assign unique IO APIC ids on 
table parsing.

TBD do for 32bit too

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86_64/kernel/mpparse.c |   20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

Index: linux/arch/x86_64/kernel/mpparse.c
===================================================================
--- linux.orig/arch/x86_64/kernel/mpparse.c
+++ linux/arch/x86_64/kernel/mpparse.c
@@ -649,6 +649,20 @@ static int mp_find_ioapic(int gsi)
 	return -1;
 }
 
+static u8 uniq_ioapic_id(u8 id)
+{
+	int i;
+	DECLARE_BITMAP(used, 256);
+	bitmap_zero(used, 256);
+	for (i = 0; i < nr_ioapics; i++) {
+		struct mpc_config_ioapic *ia = &mp_ioapics[i];
+		__set_bit(ia->mpc_apicid, used);
+	}
+	if (!test_bit(id, used))
+		return id;
+	return find_first_zero_bit(used, 256);
+}
+
 void __init mp_register_ioapic(u8 id, u32 address, u32 gsi_base)
 {
 	int idx = 0;
@@ -656,14 +670,14 @@ void __init mp_register_ioapic(u8 id, u3
 	if (bad_ioapic(address))
 		return;
 
-	idx = nr_ioapics++;
+	idx = nr_ioapics;
 
 	mp_ioapics[idx].mpc_type = MP_IOAPIC;
 	mp_ioapics[idx].mpc_flags = MPC_APIC_USABLE;
 	mp_ioapics[idx].mpc_apicaddr = address;
 
 	set_fixmap_nocache(FIX_IO_APIC_BASE_0 + idx, address);
-	mp_ioapics[idx].mpc_apicid = id;
+	mp_ioapics[idx].mpc_apicid = uniq_ioapic_id(id);
 	mp_ioapics[idx].mpc_apicver = 0;
 	
 	/* 
@@ -680,6 +694,8 @@ void __init mp_register_ioapic(u8 id, u3
 		mp_ioapics[idx].mpc_apicaddr,
 		mp_ioapic_routing[idx].gsi_start,
 		mp_ioapic_routing[idx].gsi_end);
+
+	nr_ioapics++;
 }
 
 void __init
-

From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:54 am

Otherwise smp_affinity would only update after the next interrupt
on x86 systems.

Cc: tglx@linutronix.de
Cc: mingo@elte.hu

Signed-off-by: Andi Kleen <ak@suse.de>

---
 kernel/irq/proc.c |    9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Index: linux/kernel/irq/proc.c
===================================================================
--- linux.orig/kernel/irq/proc.c
+++ linux/kernel/irq/proc.c
@@ -19,7 +19,14 @@ static struct proc_dir_entry *root_irq_d
 static int irq_affinity_read_proc(char *page, char **start, off_t off,
 				  int count, int *eof, void *data)
 {
-	int len = cpumask_scnprintf(page, count, irq_desc[(long)data].affinity);
+	struct irq_desc *desc = irq_desc + (long)data;
+	cpumask_t *mask = &desc->affinity;
+	int len;
+#ifdef CONFIG_GENERIC_PENDING_IRQ
+	if (desc->status & IRQ_MOVE_PENDING)
+		mask = &desc->pending_mask;
+#endif
+	len = cpumask_scnprintf(page, count, *mask);
 
 	if (count - len < 2)
 		return -EINVAL;
-

From: Ingo Molnar
Date: Thursday, July 19, 2007 - 3:23 am

small style nit: please put a newline after the type definitions.

	Ingo
-

From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:54 am

From: David Rientjes <rientjes@google.com>

Use LOCAL_DISTANCE and  REMOTE_DISTANCE in x86_64 ACPI code

Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86_64/mm/srat.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Index: linux/arch/x86_64/mm/srat.c
===================================================================
--- linux.orig/arch/x86_64/mm/srat.c
+++ linux/arch/x86_64/mm/srat.c
@@ -106,9 +106,9 @@ static __init int slit_valid(struct acpi
 		for (j = 0; j < d; j++)  {
 			u8 val = slit->entry[d*i + j];
 			if (i == j) {
-				if (val != 10)
+				if (val != LOCAL_DISTANCE)
 					return 0;
-			} else if (val <= 10)
+			} else if (val <= LOCAL_DISTANCE)
 				return 0;
 		}
 	}
@@ -464,7 +464,7 @@ int __node_distance(int a, int b)
 	int index;
 
 	if (!acpi_slit)
-		return a == b ? 10 : 20;
+		return a == b ? LOCAL_DISTANCE : REMOTE_DISTANCE;
 	index = acpi_slit->locality_count * node_to_pxm(a);
 	return acpi_slit->entry[index + node_to_pxm(b)];
 }
-

From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:54 am

From: David Rientjes <rientjes@google.com>
In acpi_scan_nodes(), we immediately return -1 if acpi_numa <= 0, meaning
we haven't detected any underlying ACPI topology or we have explicitly
disabled its use from the command-line with numa=noacpi.

acpi_table_print_srat_entry() and acpi_table_parse_srat() are only
referenced within drivers/acpi/numa.c, so we can mark them as static and
remove their prototypes from the header file.

Likewise, pxm_to_node_map[] and node_to_pxm_map[] are only used within
drivers/acpi/numa.c, so we mark them as static and remove their externs
from the header file.

The automatic 'result' variable is unused in acpi_numa_init(), so it's
removed.

Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86_64/mm/srat.c |    6 +++---
 drivers/acpi/numa.c   |   20 ++++++++++----------
 include/linux/acpi.h  |    2 --
 3 files changed, 13 insertions(+), 15 deletions(-)

Index: linux/arch/x86_64/mm/srat.c
===================================================================
--- linux.orig/arch/x86_64/mm/srat.c
+++ linux/arch/x86_64/mm/srat.c
@@ -394,6 +394,9 @@ int __init acpi_scan_nodes(unsigned long
 {
 	int i;
 
+	if (acpi_numa <= 0)
+		return -1;
+
 	/* First clean up the node list */
 	for (i = 0; i < MAX_NUMNODES; i++) {
 		cutoff_node(i, start, end);
@@ -403,9 +406,6 @@ int __init acpi_scan_nodes(unsigned long
 		}
 	}
 
-	if (acpi_numa <= 0)
-		return -1;
-
 	if (!nodes_cover_memory()) {
 		bad_srat();
 		return -1;
Index: linux/drivers/acpi/numa.c
===================================================================
--- linux.orig/drivers/acpi/numa.c
+++ linux/drivers/acpi/numa.c
@@ -40,9 +40,9 @@ static nodemask_t nodes_found_map = NODE
 #define NID_INVAL	-1
 
 /* maps to convert between proximity domain and logical node ID */
-static int pxm_to_node_map[MAX_PXM_DOMAINS]
+static int __cpuinitdata pxm_to_node_map[MAX_PXM_DOMAINS]
 				= { [0 ... MAX_PXM_DOMAINS - 1] = NID_INVAL ...
From: Yinghai Lu
Date: Thursday, July 19, 2007 - 10:15 am

do we need to put __initdata just before =?

YH
-

From: Andi Kleen
Date: Thursday, July 19, 2007 - 10:21 am

AFAIK gcc __attribute__ syntax allows both. It certainly seems to compile.

-Andi
-

From: Yinghai Lu
Date: Thursday, July 19, 2007 - 10:38 am

in include/linux/init.h

it said

"
 * For initialized data:
 * You should insert __initdata between the variable name and equal
 * sign followed by value, e.g.:
 *
 * static int init_variable __initdata = 0;
 * static char linux_logo[] __initdata = { 0x32, 0x36, ... };
"

or we need to update these lines?

YH
-

From: Andi Kleen
Date: Thursday, July 19, 2007 - 1:00 pm

This might date back to old compilers that are now dropped (like 2.95) 

But recommending it this way is probably not bad, it's just not a catastrophe
to not follow the recommendation.

-Andi
-

From: David Rientjes
Date: Thursday, July 19, 2007 - 2:01 pm

You mangled the quoting of this patch: the deltas above are actually in 
drivers/acpi/numa.c and not arch/x86_64/mm/srat.c.

The placement of __cpuinitdata as shown above is permitted by gcc for a 
section attribute.  I've been corrected by akpm before when I've written 
function declarations such as "static __init int foo()" in preference of 
using the attribute syntax following all type qualifiers, but it is also 
proper syntax.  It's simply a matter of coding style, the semantics of the 
construct are identical.

		David
-

From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:54 am

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86_64/kernel/setup.c |    2 ++
 1 file changed, 2 insertions(+)

Index: linux/arch/x86_64/kernel/setup.c
===================================================================
--- linux.orig/arch/x86_64/kernel/setup.c
+++ linux/arch/x86_64/kernel/setup.c
@@ -575,6 +575,8 @@ static void __cpuinit init_amd(struct cp
 	level = cpuid_eax(1);
 	if (c->x86 == 15 && ((level >= 0x0f48 && level < 0x0f50) || level >= 0x0f58))
 		set_bit(X86_FEATURE_REP_GOOD, &c->x86_capability);
+	if (c->x86 == 0x10)
+		set_bit(X86_FEATURE_REP_GOOD, &c->x86_capability);
 
 	/* Enable workaround for FXSAVE leak */
 	if (c->x86 >= 6)
-

From: Jan Engelhardt
Date: Thursday, July 19, 2007 - 9:43 am

What processors carry 0x10?



	Jan
-- 
-

From: Yinghai Lu
Date: Thursday, July 19, 2007 - 10:00 am

Quad core Opteron.

YH
-

From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:54 am

Jan asked to always use the builtin memcpy on gcc 4.3 mainline because
it should generate better code than the old macro. Let's try it.

Cc: jh@suse.cz

Signed-off-by: Andi Kleen <ak@suse.de>

---
 include/asm-x86_64/string.h |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Index: linux/include/asm-x86_64/string.h
===================================================================
--- linux.orig/include/asm-x86_64/string.h
+++ linux/include/asm-x86_64/string.h
@@ -29,6 +29,9 @@ return (to);
    function. */
 
 #define __HAVE_ARCH_MEMCPY 1
+#if (__GNUC__ == 4 && __GNUC_MINOR__ >= 3) || __GNUC__ > 4
+extern void *memcpy(void *to, const void *from, size_t len);
+#else
 extern void *__memcpy(void *to, const void *from, size_t len); 
 #define memcpy(dst,src,len) \
 	({ size_t __len = (len);				\
@@ -38,7 +41,7 @@ extern void *__memcpy(void *to, const vo
 	   else							\
 		 __ret = __builtin_memcpy((dst),(src),__len);	\
 	   __ret; }) 
-
+#endif
 
 #define __HAVE_ARCH_MEMSET
 void *memset(void *s, int c, size_t n);
-

From: Oleg Verych
Date: Saturday, July 21, 2007 - 4:16 pm

* From: Andi Kleen <ak@suse.de>

Unfortunately such info is hard to find. The discuss@x86-64 list is
empty. So, let me ask how this memcpy relates to recently submitted
for glibc one [0]?

[0] <http://permalink.gmane.org/gmane.comp.lib.glibc.alpha/12217>

Also you are enabling rep. string operations for 10h family. Yet manual
says, that while they were improved, there are still various other
preferred optimization cases

____
-

From: Andi Kleen
Date: Saturday, July 21, 2007 - 4:27 pm

It doesn't relate at all. The kernel still uses its own memcpy.

Note that a lot of the traditional memcpy optimizations (like WC copies) 
are pointless in kernel space because the kernel rarely deals with continuous 
memory areas larger than a 4K page.

The only difference from the patch is that instead of using an own
heuristic when to use an out of line memcpy trust gcc's heuristic.

-Andi

-

From: Denis Vlasenko
Date: Saturday, July 21, 2007 - 5:29 pm

Am I stupid or the files attached to that post demonstrate than "new"
code isn't much better and sometimes worse (aligned 4096 byte memcpy
went from 558 to 648 for Core 2)?

Beware that text files in test-memcpy.tar.bz2 seem to have
simple_memcpy / builtin_memcpy / memcpy columns swapped
(-old and -new files have them in different order).
--
vda
-

From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:54 am

The compiler generally generates reasonable inline code for the simple
cases and for the rest it's better for code size for them to be out of line.
Also there they can be potentially optimized more in the future.

In fact they probably should be in a .S file because they're all pure
assembly, but that's for another day.

Also some code style cleanup on them while I was on it (this seems
to be the last untouched really early Linux code) 

This saves ~12k text for a defconfig kernel with gcc 4.1.


Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/i386/lib/Makefile    |    2 
 arch/i386/lib/string.c    |  233 ++++++++++++++++++++++++++++++++++++++++++++
 include/asm-i386/string.h |  243 ++--------------------------------------------
 3 files changed, 247 insertions(+), 231 deletions(-)

Index: linux/include/asm-i386/string.h
===================================================================
--- linux.orig/include/asm-i386/string.h
+++ linux/include/asm-i386/string.h
@@ -2,203 +2,35 @@
 #define _I386_STRING_H_
 
 #ifdef __KERNEL__
-/*
- * On a 486 or Pentium, we are better off not using the
- * byte string operations. But on a 386 or a PPro the
- * byte string ops are faster than doing it by hand
- * (MUCH faster on a Pentium).
- */
-
-/*
- * This string-include defines all string functions as inline
- * functions. Use gcc. It also assumes ds=es=data space, this should be
- * normal. Most of the string-functions are rather heavily hand-optimized,
- * see especially strsep,strstr,str[c]spn. They should work, but are not
- * very easy to understand. Everything is done entirely within the register
- * set, making the functions fast and clean. String instructions have been
- * used through-out, making for "slightly" unclear code :-)
- *
- *		NO Copyright (C) 1991, 1992 Linus Torvalds,
- *		consider these trivial functions to be PD.
- */
 
-/* AK: in fact I bet it would be better to move this stuff all out of line.
- */
+/* Let gcc decide wether to inline or use ...
From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:54 am

gcc 4.3 supports a new __attribute__((__cold__)) to mark functions cold. Any 
path directly leading to a call of this function will be unlikely. And gcc
will try to generate smaller code for the function itself.

Please use with care. The code generation advantage isn't large and in most
cases it is not worth uglifying code with this.

This patch marks some common error functions like panic(), printk(), BUG()
as cold.  This will longer term make many unlikely()s unnecessary, although
we can keep them for now for older compilers.

Also all __init and __exit functions are marked cold. With a non -Os
build this will tell the compiler to generate slightly smaller code
for them. I think it currently only uses less alignments for labels,
but that might change in the future.

One disadvantage over *likely() is that they cannot be easily instrumented 
to verify them.

Another drawback is that only the latest gcc 4.3 snapshots support this. 
Unfortunately we cannot detect this using the preprocessor. This means older 
snapshots will fail now. I don't think that's a problem because they are 
unreleased compilers that nobody should be using.

gcc also has a __hot__ attribute, but I don't see any sense in using
this in the kernel right now. But someday I hope gcc will be able
to use more aggressive optimizing for hot functions even in -Os,
if that happens it should be added.

Includes compile fix from Thomas Gleixner.

TBD wait for COLD()

Cc: jh@suse.cz
Signed-off-by: Andi Kleen <ak@suse.de>

---
 include/asm-generic/bug.h     |    1 +
 include/asm-i386/bug.h        |    4 ++++
 include/asm-x86_64/bug.h      |    8 ++++++--
 include/linux/compiler-gcc4.h |   23 +++++++++++++++++++++++
 include/linux/compiler.h      |   12 ++++++++++++
 include/linux/init.h          |    8 ++++----
 include/linux/kernel.h        |    8 ++++----
 7 files changed, 54 insertions(+), 10 deletions(-)

Index: ...
From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:54 am

This implements new vDSO for x86-64.  The concept is similar
to the existing vDSOs on i386 and PPC.  x86-64 has had static
vsyscalls before,  but these are not flexible enough anymore.

A vDSO is a ELF shared library supplied by the kernel that is mapped into 
user address space.  The vDSO mapping is randomized for each process
for security reasons.

Doing this was needed for clock_gettime, because clock_gettime
always needs a syscall fallback and having one at a fixed
address would have made buffer overflow exploits too easy to write.

The vdso can be disabled with vdso=0

It currently includes a new gettimeofday implemention and optimized
clock_gettime(). The gettimeofday implementation is slightly faster
than the one in the old vsyscall.  clock_gettime is significantly faster 
than the syscall for CLOCK_MONOTONIC and CLOCK_REALTIME.

The new calls are generally faster than the old vsyscall. 

Advantages over the old x86-64 vsyscalls:
- Extensible
- Randomized
- Cleaner
- Easier to virtualize (the old static address range previously causes
overhead e.g. for Xen because it has to create special page tables for it) 

Weak points: 
- glibc support still to be written

The VM interface is partly based on Ingo Molnar's i386 version.

Includes compile fix from Joachim Deguara

Signed-off-by: Andi Kleen <ak@suse.de>

---
 Documentation/kernel-parameters.txt |    2 
 arch/x86_64/Makefile                |    3 
 arch/x86_64/ia32/ia32_binfmt.c      |    1 
 arch/x86_64/kernel/time.c           |    1 
 arch/x86_64/kernel/vmlinux.lds.S    |    9 ++
 arch/x86_64/kernel/vsyscall.c       |   22 +----
 arch/x86_64/mm/init.c               |    9 ++
 arch/x86_64/vdso/Makefile           |   49 ++++++++++++
 arch/x86_64/vdso/vclock_gettime.c   |  120 +++++++++++++++++++++++++++++++
 arch/x86_64/vdso/vdso-note.S        |   12 +++
 arch/x86_64/vdso/vdso-start.S       |    2 
 arch/x86_64/vdso/vdso.S             |    2 
 arch/x86_64/vdso/vdso.lds.S         |   77 ...
From: Daniel Walker
Date: Tuesday, August 21, 2007 - 9:25 am

Some thoughts ,

In the -mm kernel there is some debugging that gets injected into the
likely/unlikely macros .. If they get called from userspace it causes a
hang .. We might want to add some new set of macros to specifically
denote that they are called from userspace, not just likely/unlikely but
all the macros so we don't get mixed usage ..

Daniel

-

From: Andi Kleen
Date: Tuesday, August 21, 2007 - 11:45 am

They should likely define a __likely()/__unlikely() then that doesn't

and add a hunk to change the vDSO code. Note that i386 is not the 
only architecture that has such code.

-Andi
-

From: Andrew Morton
Date: Tuesday, August 21, 2007 - 11:40 am

Yes, the simplest fix would be to remove all the troublesome likely/unlikely
calls within that debug patch.  I'll take a look at that, see if it fixes
the compile.
-

From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:54 am

Preparationary patch for the new sched/printk_clock()

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/i386/kernel/tsc.c   |    9 ++++-----
 arch/x86_64/kernel/tsc.c |    5 +----
 2 files changed, 5 insertions(+), 9 deletions(-)

Index: linux/arch/i386/kernel/tsc.c
===================================================================
--- linux.orig/arch/i386/kernel/tsc.c
+++ linux/arch/i386/kernel/tsc.c
@@ -331,7 +331,7 @@ static struct dmi_system_id __initdata b
  */
 __cpuinit int unsynchronized_tsc(void)
 {
-	if (!cpu_has_tsc || tsc_unstable)
+	if (!cpu_has_tsc)
 		return 1;
 	/*
 	 * Intel systems are normally all synchronized.
@@ -340,9 +340,9 @@ __cpuinit int unsynchronized_tsc(void)
 	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL) {
 		/* assume multi socket systems are not synchronized: */
 		if (num_possible_cpus() > 1)
-			tsc_unstable = 1;
+			return 1;
 	}
-	return tsc_unstable;
+	return 0;
 }
 
 /*
@@ -386,13 +386,12 @@ void __init tsc_init(void)
 	/* Check and install the TSC clocksource */
 	dmi_check_system(bad_tsc_dmi_table);
 
-	unsynchronized_tsc();
 	check_geode_tsc_reliable();
 	current_tsc_khz = tsc_khz;
 	clocksource_tsc.mult = clocksource_khz2mult(current_tsc_khz,
 							clocksource_tsc.shift);
 	/* lower the rating if we already know its unstable: */
-	if (check_tsc_unstable()) {
+	if (check_tsc_unstable() || unsynchronized_tsc()) {
 		clocksource_tsc.rating = 0;
 		clocksource_tsc.flags &= ~CLOCK_SOURCE_IS_CONTINUOUS;
 	} else
Index: linux/arch/x86_64/kernel/tsc.c
===================================================================
--- linux.orig/arch/x86_64/kernel/tsc.c
+++ linux/arch/x86_64/kernel/tsc.c
@@ -144,9 +144,6 @@ static int tsc_unstable = 0;
  */
 __cpuinit int unsynchronized_tsc(void)
 {
-	if (tsc_unstable)
-		return 1;
-
 #ifdef CONFIG_SMP
 	if (apic_is_clustered_box())
 		return 1;
@@ -218,7 +215,7 @@ void __init init_tsc_clocksource(void)
 	if (!notsc) {
 		clocksource_tsc.mult = ...
From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:54 am

Call a function on a target CPU but do the right thing when 
we're already on that CPU. That's the main difference from smp_call_function_single
which does the wrong thing in this case (erroring out)

Another advantage is that it is also defined for the UP case, avoiding
some ifdefs.

I also dropped retry (which never did anything) and wait (because the on
current cpu case will always wait)
Signed-off-by: Andi Kleen <ak@suse.de>

---
 include/linux/smp.h |   22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

Index: linux/include/linux/smp.h
===================================================================
--- linux.orig/include/linux/smp.h
+++ linux/include/linux/smp.h
@@ -138,4 +138,26 @@ static inline void smp_send_reschedule(i
 
 void smp_setup_processor_id(void);
 
+#ifdef CONFIG_SMP
+/* Similar to smp_call_function_single, but DTRT when we're already
+   on the right CPU. */
+static inline void on_cpu_single(int cpu, void (*func)(void *), void *info)
+{
+	int me = get_cpu();
+	if (cpu == me) {
+		func(info);
+		put_cpu();
+	} else {
+		put_cpu();
+		/* wait is forced on because the me==cpu case above will always wait */
+		smp_call_function_single(cpu, func, info, 0, 1);
+	}
+}
+#else
+static inline void on_cpu_single(int cpu, void (*func)(void *), void *info)
+{
+	func(info);
+}
+#endif
+
 #endif /* __LINUX_SMP_H */
-

From: Satyam Sharma
Date: Thursday, July 19, 2007 - 4:09 am

Hi Andi,


I think this is no longer the case, is it? With KVM updates already
merged in latest mainline -git, that modified smp_call_function_single()

In any case, this is unsafe. smp_call_function_single() -- with the old
semantics, which is what this patch assumes, obviously -- is quite
pointless without its _caller_ disabling preemption around it. So the
put_cpu() must come after the smp_call_function_single, otherwise
you won't even detect the error that might happen, seeing you're

WARN_ON(irqs_disabled());


... for the sake of API / behaviour consistency.


But probably you should just drop this ... with smp_call_function_single's
new semantics, I don't see this function growing any users.

Satyam
-

From: Andi Kleen
Date: Thursday, July 19, 2007 - 5:07 am

The new sched-clock uses it, but i'll update it to use smp_call_function_single

Thanks

-Andi
-

From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:54 am

Move it into an own file for easy sharing.
Do everything per CPU. This avoids problems with TSCs that
tick at different frequencies per CPU.
Resync properly on cpufreq changes. CPU frequency is instable
around cpu frequency changing, so fall back during a backing
clock during this period.
Hopefully TSC will work now on all systems except when there isn't a
physical TSC. 

And

+From: Jeremy Fitzhardinge <jeremy@goop.org>
Three cleanups there:
 - change "instable" -> "unstable"
 - it's better to use get_cpu_var for getting this cpu's variables
 - change cycles_2_ns to do the full computation rather than just the
   tsc->ns scaling.  It's a simpler interface, and it makes the function

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/i386/kernel/Makefile      |    3 
 arch/i386/kernel/sched-clock.c |  265 +++++++++++++++++++++++++++++++++++++++++
 arch/i386/kernel/tsc.c         |   74 -----------
 include/asm-i386/timer.h       |   32 ----
 include/asm-i386/tsc.h         |    1 
 5 files changed, 269 insertions(+), 106 deletions(-)

Index: linux/arch/i386/kernel/sched-clock.c
===================================================================
--- /dev/null
+++ linux/arch/i386/kernel/sched-clock.c
@@ -0,0 +1,265 @@
+/* A fast clock for the scheduler.
+ * Copyright 2007 Andi Kleen SUSE Labs
+ * Subject to the GNU Public License, version 2 only.
+ */
+#include <linux/init.h>
+#include <linux/cpu.h>
+#include <linux/cpufreq.h>
+#include <linux/kernel.h>
+#include <linux/percpu.h>
+#include <linux/ktime.h>
+#include <linux/hrtimer.h>
+#include <linux/smp.h>
+#include <linux/notifier.h>
+#include <linux/init.h>
+#include <asm/tsc.h>
+#include <asm/cpufeature.h>
+#include <asm/timer.h>
+
+/*
+ * convert from cycles(64bits) => nanoseconds (64bits)
+ *  basic equation:
+ *		ns = cycles / (freq / ns_per_sec)
+ *		ns = cycles * (ns_per_sec / freq)
+ *		ns = cycles * (10^9 / (cpu_khz * 10^3))
+ *		ns = cycles * (10^6 / cpu_khz)
+ *
+ *	Then we use scaling math ...
From: Daniel Walker
Date: Thursday, July 19, 2007 - 9:51 am

What about using the cycles2ns() clocksource helpers, it would eliminate
The comment still says something about interrupts off, but that was

Looking further down you aren't using this unstable path when the tsc is
just outright unstable (i.e. some Cyrix systems IIRC)? An improvement
over the original code would be to catch the systems that change
frequencies without cpufreq (like the ones that gave Thomas so much

-

From: Andi Kleen
Date: Thursday, July 19, 2007 - 10:13 am

They are completely different from what clocksource provides.

-Andi
-

From: Daniel Walker
Date: Thursday, July 19, 2007 - 10:15 am

How so?

Daniel

-

From: Andi Kleen
Date: Thursday, July 19, 2007 - 10:22 am

The new sched_clock's works CPU local and relative to the last sync point. 

-Andi
-

From: Daniel Walker
Date: Thursday, July 19, 2007 - 10:31 am

Right, I guess I'm speaking more low-level than that .. Both function do
shift-multiply style math .. So between the two the cycles to
nanoseconds conversion code is duplicated, and the code to calculate the
duplicated per architecture .. I think it would be a win to use the
generic functions if it's possible..

Daniel

-

From: Andi Kleen
Date: Thursday, July 19, 2007 - 10:38 am

They can't be used because they're not cpu local. The whole basic 
concept behind the new sched_clock is to be cpu local. 

-Andi
 
-

From: Daniel Walker
Date: Thursday, July 19, 2007 - 10:43 am

Your not following me .. The cpu localness is retained in the multiply
value, which is component of the math .. It's got nothing to do with the
conversion code itself.

You do the same operation to convert from cycles to nanosecond
regardless of the values you use. Example,

+static inline u64 __cycles_2_ns(struct sc_data *sc, u64 cyc)
+{
+       u64 ns;
+
+       cyc -= sc->sync_base;
+       ns = (cyc * sc->cyc2ns_scale) >> CYC2NS_SCALE_FACTOR;
+       ns += sc->ns_base;
+
+       return ns;
+}

Above the line "(cyc * sc->cyc2ns_scale) >> CYC2NS_SCALE_FACTOR;" is
part of the duplication that I'm referring to, not the surrounding code.

Which looks very much like this,

static inline s64 cyc2ns(struct clocksource *cs, cycle_t cycles)
{
        u64 ret = (u64)cycles;
        ret = (ret * cs->mult) >> cs->shift;
        return ret;
}

Daniel

-

From: Andi Kleen
Date: Thursday, July 19, 2007 - 11:00 am

Because you don't make much sense. You're really asking me to factor
an single multiplication and a shift (two CPU instructions!) out to share?

-Andi
-

From: Daniel Walker
Date: Thursday, July 19, 2007 - 11:00 am

Yes, but I also said that was only part of the duplication. Your already
doing a re-write, there is no reason not to add this..

Daniel

-

From: Mathieu Desnoyers
Date: Thursday, July 19, 2007 - 8:11 pm

I noticed the same thing about interrupts off when going through the
code. Andi, since you are already playing with per cpu variables, you
could leverage asm/local.h there by declaring last_val as local_t and
use either local_cmpxchg or local_add_return (depending on your needs)
to get both better performances than cli/sti _and_ be really atomic.

See this thread for performance tests:
http://www.ussg.iu.edu/hypermail/linux/kernel/0707.1/0832.html


-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-

From: Mathieu Desnoyers
Date: Thursday, July 19, 2007 - 8:47 pm

* Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca) wrote:

I just want to rectify a detail: local_t uses type "long", which is 32
bits on x86_32 and 64 bits on x86_64.

Using a cmpxchg8b on i386 seems to require the LOCK prefix to be taken,
so it may degrate performances too much. Therefore, you may prefer to
stay with cli/sti on i386, but using a local cmpxchg would make sense on
x86_64.

A side-note: I really dislike the new cmpxchg behavior when a too
large value is passed to it. If we pass a uint64_t * as first argument
to cmpxchg or cmpxchg_local on i386, it just fails silently. Before, a
linker error was produced, which required the kernel to be compiled with
-O2 as side-effect, but at least there wasn't any silent failure...

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-

From: Mathieu Desnoyers
Date: Thursday, July 19, 2007 - 9:18 pm

Actually, about the i386 case, I am trying to get my head around the
locked cmpxchg8b issue.

I just implemented what would look like something that could be added to
the standard cmpxchg on i386:

union pack64 {
        u64 value;
        struct {
                u32 low, high;
        };
};

static inline u64 cmpxchg8b(u64 * ptr, u64 old, u64 new)
{
        union pack64 oldu;
        union pack64 newu;
        union pack64 prevu;

        oldu.value = old;
        newu.value = new;

        __asm__ __volatile__("cmpxchg8b (%6)\n\t"
                             : "=d"(prevu.high), "=a" (prevu.low)
                             : "0" (oldu.high), "1" (oldu.low),
                               "c" (newu.high), "b" (newu.low),
                               "m"(*__xg(ptr))
                             : "memory");
        return prevu.value;
}

I tried it with and without the LOCK prefix on my Pentium 4.

Locked cmpxchg8b : 90 cycles
Non locked cmpxchg8b: 30 cycles
sti: 166 cycles
cli: 159 cycles

So, hrm, even if we use the locked version, it is still much faster than
the sti/cli. I am thoughtful about the comment in asm-i386/system.h:

/*
 * The semantics of XCHGCMP8B are a bit strange, this is why
 * there is a loop and the loading of %%eax and %%edx has to
 * be inside. This inlines well in most cases, the cached
 * cost is around ~38 cycles. (in the future we might want
 * to do an SIMD/3DNOW!/MMX/FPU 64-bit store here, but that
 * might have an implicit FPU-save as a cost, so it's not
 * clear which path to go.)
 *
 * cmpxchg8b must be used with the lock prefix here to allow 
 * the instruction to be executed atomically, see page 3-102
 * of the instruction set reference 24319102.pdf. We need
 * the reader side to see the coherent 64bit value.
 */

I just re-read 24319102.pdf page 3-102. Here is what seems to be the
reference used:

"This instruction can be used with a LOCK prefix to allow the
instruction to be executed atomi cally. To simplify ...
From: Nick Piggin
Date: Thursday, July 19, 2007 - 10:07 pm

Curious: what does it look like if the memory is not in cache? I
found that cmpxchg is relatively slower than other rmw instructions
in that case.

-- 
SUSE Labs, Novell Inc.
-

From: Mathieu Desnoyers
Date: Thursday, July 19, 2007 - 10:47 pm

Actually, I have just seen that cmpxchg64 and cmpxchg64_local are
doing exactly this and they are already implemented in asm-i386/system.h.

A quick test: I am doing clflush in a loop (substracting its time from the
following loops) to have a memory hit when I do cmpxchg. This is the
result of just the cmpxchg8b:

non locked cmpxchg8b: 583.37 cycles
locked cmpxchg8b: 650.48 cycles
rmw in 3 operations: 581.43 cycles

So the locked cmpxchg is 67 cycles slower than the non locked cmpxchg,
which fits with my 30 vs 90 cycles. rmw is a tiny bit faster than
cmpxchg8b (2 cycles), but nothing to call home about.

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-

From: Andi Kleen
Date: Friday, July 20, 2007 - 1:27 am

That's only on a slow path during cpu frequency changing while the TSC is instable.
Shouldn't be that common.

-Andi
-

From: Mathieu Desnoyers
Date: Friday, July 20, 2007 - 7:12 am

Hrm, I don't see why you can get away without disabling interrupts in
the fast path:

+unsigned long long tsc_sched_clock(void)
+{
+       unsigned long long r;
+       struct sc_data *sc = &get_cpu_var(sc_data);
+       
+       if (unlikely(sc->unstable)) {
+               r = (jiffies_64 - sc->sync_base) * (1000000000 / HZ);
+               r += sc->ns_base;
+               /*
+                * last_val is used to avoid non monotonity on a
+                * stable->unstable transition. Make sure the time
+                * never goes to before the last value returned by the
+                * TSC clock.
+                */
+               while (r <= sc->last_val) {
+                       rmb();
+                       r = sc->last_val + 1;
+                       rmb();
+               }
+               sc->last_val = r;

Here, slow path, we update last_val (64 bits value). Must be protected.

+       } else {
+               rdtscll(r);
+               r = __cycles_2_ns(sc, r);
+               sc->last_val = r;

Here, fast path, we update last_val too so it is ready to be read when
the tsc will become unstable.

If we don't disable interrupts around its update, we could have: (LSB vs
MSB update order is arbitrary)

update sc->last_val 32MSB
  interrupt comes
    update sc->last_val 32MSB
    update sc->last_val 32LSB
  iret
update sc->last_val 32LSB

So if, after this, we run tsc_sched_clock() with an unstable TSC, we
read a last_val containing the interrupt's MSB and the last_val LSB. It
can particularity hurt if we are around a 32 bits overflow, because time
could "jump" forward of about 1.43 seconds on a 3 GHz system.

So I guess we need synchronization on the fast path, and therefore using
cmpxchg_local on x86_64 and cmpxchg64_local on i386 makes sense.

Mathieu

+       }
+ 
+       put_cpu_var(sc_data);
+       
+       return r;
+}



-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de ...
From: Mathieu Desnoyers
Date: Friday, July 20, 2007 - 7:39 am

The case above explained the issue for i386. For x86_64, the race goes
like this:

read tsc
  interrupt
  read tsc
  update sc->last_val
  iret
update sc->last_val

Here, last_val is not at its highest value anymore. This is why a
cmpxchg is useful on x86_64.


-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-

From: Andi Kleen
Date: Friday, July 20, 2007 - 8:14 am

On x86-64 the 64bit write is atomic against interrupts.

You're right 32bit has a problem though. I'm not too happy about 
cmpxchg though because that wouldn't work on some CPUs.

I wonder if we can just get away with using a 32bit value on i386.
Just for the purpose of keeping the value monotonic it should be good
enough. Will think about it.

Thanks for the review.

-Andi

-

From: Mathieu Desnoyers
Date: Friday, July 20, 2007 - 8:22 am

Which CPUs ? 386 ? do they even have a cycle counter ?

Please have a look at my reply to this email for the x86_64 problematic
case. This problematic case also applies to i386 (non atomicity of tsc

Yes, it could work. In this case you have to be aware that the 32 LSBs
of the TSC will overflow every ~ 1s and, in order to be sure to be able
to detect the overflow, you have to do at least 1 TSC read per second

YW,


-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-

From: Mathieu Desnoyers
Date: Friday, July 20, 2007 - 9:49 am

Fall back on interrupt disable in cmpxchg8b on 80386 and 80486

Actually, on 386, cmpxchg and cmpxchg_local fall back on
cmpxchg_386_u8/16/32: it disables interruptions around non atomic
updates to mimic the cmpxchg behavior.

The comment:
/* Poor man's cmpxchg for 386. Unsuitable for SMP */

in the implementation tells much about how this cmpxchg implementation
should not be used in a SMP context. However, the cmpxchg_local can
perfectly use this fallback, since it only needs to be atomic wrt the
local cpu.

This patch adds a cmpxchg_386_u64 and uses it as a fallback for cmpxchg64
and cmpxchg64_local on 80386 and 80486.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
---
 arch/i386/kernel/cpu/intel.c |   17 +++++++
 include/asm-i386/cmpxchg.h   |   99 ++++++++++++++++++++++++++++---------------
 2 files changed, 82 insertions(+), 34 deletions(-)

Index: linux-2.6-lttng/arch/i386/kernel/cpu/intel.c
===================================================================
--- linux-2.6-lttng.orig/arch/i386/kernel/cpu/intel.c	2007-07-20 12:41:24.000000000 -0400
+++ linux-2.6-lttng/arch/i386/kernel/cpu/intel.c	2007-07-20 12:43:56.000000000 -0400
@@ -329,5 +329,22 @@ unsigned long cmpxchg_386_u32(volatile v
 EXPORT_SYMBOL(cmpxchg_386_u32);
 #endif
 
+#ifndef CONFIG_X86_CMPXCHG64
+unsigned long long cmpxchg_486_u64(volatile void *ptr, u64 old, u64 new)
+{
+	u64 prev;
+	unsigned long flags;
+
+	/* Poor man's cmpxchg8b for 386 and 486. Unsuitable for SMP */
+	local_irq_save(flags);
+	prev = *(u64 *)ptr;
+	if (prev == old)
+		*(u64 *)ptr = new;
+	local_irq_restore(flags);
+	return prev;
+}
+EXPORT_SYMBOL(cmpxchg_486_u64);
+#endif
+
 // arch_initcall(intel_cpu_init);
 
Index: linux-2.6-lttng/include/asm-i386/cmpxchg.h
===================================================================
--- linux-2.6-lttng.orig/include/asm-i386/cmpxchg.h	2007-07-20 12:41:24.000000000 -0400
+++ linux-2.6-lttng/include/asm-i386/cmpxchg.h	2007-07-20 12:42:12.000000000 ...
From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:55 am

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86_64/kernel/Makefile |    3 ++-
 arch/x86_64/kernel/time.c   |    1 -
 arch/x86_64/kernel/tsc.c    |   29 +----------------------------
 include/asm-x86_64/timer.h  |    5 +++++
 include/asm-x86_64/timex.h  |    1 -
 5 files changed, 8 insertions(+), 31 deletions(-)

Index: linux/arch/x86_64/kernel/tsc.c
===================================================================
--- linux.orig/arch/x86_64/kernel/tsc.c
+++ linux/arch/x86_64/kernel/tsc.c
@@ -8,6 +8,7 @@
 #include <linux/cpufreq.h>
 
 #include <asm/timex.h>
+#include <asm/tsc.h>
 
 static int notsc __initdata = 0;
 
@@ -16,32 +17,6 @@ EXPORT_SYMBOL(cpu_khz);
 unsigned int tsc_khz;
 EXPORT_SYMBOL(tsc_khz);
 
-static unsigned int cyc2ns_scale __read_mostly;
-
-void set_cyc2ns_scale(unsigned long khz)
-{
-	cyc2ns_scale = (NSEC_PER_MSEC << NS_SCALE) / khz;
-}
-
-static unsigned long long cycles_2_ns(unsigned long long cyc)
-{
-	return (cyc * cyc2ns_scale) >> NS_SCALE;
-}
-
-unsigned long long sched_clock(void)
-{
-	unsigned long a = 0;
-
-	/* Could do CPU core sync here. Opteron can execute rdtsc speculatively,
-	 * which means it is not completely exact and may not be monotonous
-	 * between CPUs. But the errors should be too small to matter for
-	 * scheduling purposes.
-	 */
-
-	rdtscll(a);
-	return cycles_2_ns(a);
-}
-
 static int tsc_unstable;
 
 static inline int check_tsc_unstable(void)
@@ -114,8 +89,6 @@ static int time_cpufreq_notifier(struct 
 			mark_tsc_unstable("cpufreq changes");
 	}
 
-	set_cyc2ns_scale(tsc_khz_ref);
-
 	return 0;
 }
 
Index: linux/arch/x86_64/kernel/time.c
===================================================================
--- linux.orig/arch/x86_64/kernel/time.c
+++ linux/arch/x86_64/kernel/time.c
@@ -408,7 +408,6 @@ void __init time_init(void)
 	else
 		vgetcpu_mode = VGETCPU_LSL;
 
-	set_cyc2ns_scale(tsc_khz);
 	printk(KERN_INFO "time.c: Detected %d.%03d MHz processor.\n",
 		cpu_khz / 1000, ...
From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:55 am

With that an L3 cache is correctly reported in the cache information in /sys

With fixes from Andreas Herrmann and Dean Gaudet

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/i386/kernel/cpu/intel_cacheinfo.c |   74 ++++++++++++++++++++++++---------
 arch/x86_64/kernel/setup.c             |    7 ++-
 2 files changed, 60 insertions(+), 21 deletions(-)

Index: linux/arch/i386/kernel/cpu/intel_cacheinfo.c
===================================================================
--- linux.orig/arch/i386/kernel/cpu/intel_cacheinfo.c
+++ linux/arch/i386/kernel/cpu/intel_cacheinfo.c
@@ -4,7 +4,7 @@
  *      Changes:
  *      Venkatesh Pallipadi	: Adding cache identification through cpuid(4)
  *		Ashok Raj <ashok.raj@intel.com>: Work with CPU hotplug infrastructure.
- *	Andi Kleen		: CPUID4 emulation on AMD.
+ *	Andi Kleen / Andreas Herrmann	: CPUID4 emulation on AMD.
  */
 
 #include <linux/init.h>
@@ -135,7 +135,7 @@ unsigned short			num_cache_leaves;
 
 /* AMD doesn't have CPUID4. Emulate it here to report the same
    information to the user.  This makes some assumptions about the machine:
-   No L3, L2 not shared, no SMT etc. that is currently true on AMD CPUs.
+   L2 not shared, no SMT etc. that is currently true on AMD CPUs.
 
    In theory the TLBs could be reported as fake type (they are in "dummy").
    Maybe later */
@@ -159,13 +159,26 @@ union l2_cache {
 	unsigned val;
 };
 
+union l3_cache {
+	struct {
+		unsigned line_size : 8;
+		unsigned lines_per_tag : 4;
+		unsigned assoc : 4;
+		unsigned res : 2;
+		unsigned size_encoded : 14;
+	};
+	unsigned val;
+};
+
 static const unsigned short assocs[] = {
 	[1] = 1, [2] = 2, [4] = 4, [6] = 8,
-	[8] = 16,
+	[8] = 16, [0xa] = 32, [0xb] = 48,
+	[0xc] = 64,
 	[0xf] = 0xffff // ??
-	};
-static const unsigned char levels[] = { 1, 1, 2 };
-static const unsigned char types[] = { 1, 2, 3 };
+};
+
+static const unsigned char levels[] = { 1, 1, 2, 3 };
+static const unsigned char types[] = { 1, 2, 3, 3 };
 
 ...
From: Andreas Herrmann
Date: Friday, July 20, 2007 - 10:00 am

Reporting of L3 cache information should also be enabled in 32bit mode.


Regards,

Andreas
--

Enable reporting of L3 cache info in 32 bit mode for family 0x10.

Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>

diff --git a/arch/i386/kernel/cpu/amd.c b/arch/i386/kernel/cpu/amd.c
index 6f47eee..815a5f0 100644
--- a/arch/i386/kernel/cpu/amd.c
+++ b/arch/i386/kernel/cpu/amd.c
@@ -272,8 +272,12 @@ static void __cpuinit init_amd(struct cpuinfo_x86 *c)
 	}
 #endif
 
-	if (cpuid_eax(0x80000000) >= 0x80000006)
-		num_cache_leaves = 3;
+	if (cpuid_eax(0x80000000) >= 0x80000006) {
+		if ((c->x86 == 0x10) && (cpuid_edx(0x80000006) & 0xf000))
+			num_cache_leaves = 4;
+		else
+			num_cache_leaves = 3;
+	}
 
 	if (amd_apic_timer_broken())
 		set_bit(X86_FEATURE_LAPIC_TIMER_BROKEN, c->x86_capability);



-

From: Andreas Herrmann
Date: Friday, July 20, 2007 - 10:15 am

I think, Joachim's patch (sent to patches@x86-64.org on June 14) should
be added as well. I have attached his patch below.


Regards,

Andreas

-- 
Operating | AMD Saxony Limited Liability Company & Co. KG,
  System  | Wilschdorfer Landstr. 101, 01109 Dresden, Germany
 Research | Register Court Dresden: HRA 4896, General Partner authorized
  Center  | to represent: AMD Saxony LLC (Wilmington, Delaware, US)
  (OSRC)  | General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy
--

This will allow the size field to be reported for all values instead of a 
handful and also fills the shard_cpu_map with meaning full value.

Signed-off-by: Joachim Deguara <joachim.deguara@amd.com>
Index: kernel/arch/i386/kernel/cpu/intel_cacheinfo.c
===================================================================
--- kernel.orig/arch/i386/kernel/cpu/intel_cacheinfo.c
+++ kernel/arch/i386/kernel/cpu/intel_cacheinfo.c
@@ -224,12 +224,7 @@ static void __cpuinit amd_cpuid4(int lea
 		assoc = l3.assoc;
 		line_size = l3.line_size;
 		lines_per_tag = l3.lines_per_tag;
-		switch (l3.size_encoded) {
-		case 4:  size_in_kb = 2 * 1024; break;
-		case 8:  size_in_kb = 4 * 1024; break;
-		case 12: size_in_kb = 6 * 1024; break;
-		default: size_in_kb = 0; break;
-		}
+		size_in_kb = l3.size_encoded * 512;
 		break;
 	default:
 		return;
@@ -238,7 +233,10 @@ static void __cpuinit amd_cpuid4(int lea
 	eax->split.is_self_initializing = 1;
 	eax->split.type = types[leaf];
 	eax->split.level = levels[leaf];
-	eax->split.num_threads_sharing = 0;
+	if (leaf == 3)
+		eax->split.num_threads_sharing = current_cpu_data.x86_max_cores - 1;
+	else
+		eax->split.num_threads_sharing = 0;
 	eax->split.num_cores_on_die = current_cpu_data.x86_max_cores - 1;
 
 

_______________________________________________
patches mailing list
patches@x86-64.org
https://www.x86-64.org/mailman/listinfo/patches






-

From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:55 am

From: "Yinghai Lu" <yhlu.kernel@gmail.com>

Signed-off-by: Yinghai Lu <yinghai.lu@sun.com>
Signed-off-by: Andi Kleen <ak@suse.de>

---
 include/asm-x86_64/dmi.h |    5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

Index: linux/include/asm-x86_64/dmi.h
===================================================================
--- linux.orig/include/asm-x86_64/dmi.h
+++ linux/include/asm-x86_64/dmi.h
@@ -3,15 +3,12 @@
 
 #include <asm/io.h>
 
-extern void *dmi_ioremap(unsigned long addr, unsigned long size);
-extern void dmi_iounmap(void *addr, unsigned long size);
-
 #define DMI_MAX_DATA 2048
 
 extern int dmi_alloc_index;
 extern char dmi_alloc_data[DMI_MAX_DATA];
 
-/* This is so early that there is no good way to allocate dynamic memory. 
+/* This is so early that there is no good way to allocate dynamic memory.
    Allocate data in an BSS array. */
 static inline void *dmi_alloc(unsigned len)
 {
-

From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:55 am

It is not fully softirq safe anyways.

Can't do a WARN_ON unfortunately because it could trigger in the 
panic case.

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86_64/kernel/smp.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux/arch/x86_64/kernel/smp.c
===================================================================
--- linux.orig/arch/x86_64/kernel/smp.c
+++ linux/arch/x86_64/kernel/smp.c
@@ -386,9 +386,9 @@ int smp_call_function_single (int cpu, v
 		return 0;
 	}
 
-	spin_lock_bh(&call_lock);
+	spin_lock(&call_lock);
 	__smp_call_function_single(cpu, func, info, nonatomic, wait);
-	spin_unlock_bh(&call_lock);
+	spin_unlock(&call_lock);
 	put_cpu();
 	return 0;
 }
-

From: Satyam Sharma
Date: Thursday, July 19, 2007 - 5:16 am

Ack

[ sorry, I remember having promised to send such a patch myself

But this is not true at all. This function doesn't come anywhere

So I'd say we do need a:

WARN_ON(irqs_disabled() || in_interrupt());


And oh, by the way, you can safely go ahead and put that warning
in smp_call_function() *also*.

Note that panic() -> smp_send_stop() -> calls into the lower-level
__smp_call_function() directly.

So neither smp_call_function() nor smp_call_function_single() come
in the panic codepath -- the warnings there would be okay.

Satyam
-

From: Andi Kleen
Date: Thursday, July 19, 2007 - 5:19 am

You're wrong.

-Andi


-

From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:55 am

From: [** iso-8859-1 charset **] Bj
From: Björn
Date: Thursday, July 19, 2007 - 3:24 am

> From: [** iso-8859-1 charset **] Bj
From: Andi Kleen
Date: Thursday, July 19, 2007 - 3:42 am

> > From: [** iso-8859-1 charset **] Bj
From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:55 am

From: [** iso-8859-1 charset **] Bj

> From: [** iso-8859-1 charset **] Bj
From: Andi Kleen
Date: Thursday, July 19, 2007 - 3:45 am

> > From: [** iso-8859-1 charset **] Bj
From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:55 am

From: Thomas Gleixner <tglx@linutronix.de>
The current SMI detection logic in read_hpet_tsc() makes sure,
that when a SMI happens between the read of the HPET counter and
the read of the TSC, this wrong value is used for TSC calibration.

This is not the intention of the function. The comparison must ensure,
that we do _NOT_ use such a value.

Fix the check to use calibration values where delta of the two TSC reads
is smaller than a reasonable threshold.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86_64/kernel/hpet.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux/arch/x86_64/kernel/hpet.c
===================================================================
--- linux.orig/arch/x86_64/kernel/hpet.c
+++ linux/arch/x86_64/kernel/hpet.c
@@ -190,7 +190,7 @@ int hpet_reenable(void)
  */
 
 #define TICK_COUNT 100000000
-#define TICK_MIN   5000
+#define SMI_THRESHOLD 50000
 #define MAX_TRIES  5
 
 /*
@@ -205,7 +205,7 @@ static void __init read_hpet_tsc(int *hp
 		tsc1 = get_cycles_sync();
 		hpet1 = hpet_readl(HPET_COUNTER);
 		tsc2 = get_cycles_sync();
-		if (tsc2 - tsc1 > TICK_MIN)
+		if ((tsc2 - tsc1) < SMI_THRESHOLD)
 			break;
 	}
 	*hpet = hpet1;
-

From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:55 am

From: Chris Wright <chrisw@sous-sol.org>

Remove pit_interrupt_hook as it adds just an extra layer.

Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andi Kleen <ak@suse.de>

---

 include/asm-i386/i8253.h                 |   11 -----------
 include/asm-i386/mach-default/do_timer.h |    2 +-
 include/asm-i386/mach-voyager/do_timer.h |    2 +-
 3 files changed, 2 insertions(+), 13 deletions(-)

Index: linux/include/asm-i386/i8253.h
===================================================================
--- linux.orig/include/asm-i386/i8253.h
+++ linux/include/asm-i386/i8253.h
@@ -7,15 +7,4 @@ extern spinlock_t i8253_lock;
 
 extern struct clock_event_device *global_clock_event;
 
-/**
- * pit_interrupt_hook - hook into timer tick
- * @regs:	standard registers from interrupt
- *
- * Call the global clock event handler.
- **/
-static inline void pit_interrupt_hook(void)
-{
-	global_clock_event->event_handler(global_clock_event);
-}
-
 #endif	/* __ASM_I8253_H__ */
Index: linux/include/asm-i386/mach-default/do_timer.h
===================================================================
--- linux.orig/include/asm-i386/mach-default/do_timer.h
+++ linux/include/asm-i386/mach-default/do_timer.h
@@ -12,5 +12,5 @@
 
 static inline void do_timer_interrupt_hook(void)
 {
-	pit_interrupt_hook();
+	global_clock_event->event_handler(global_clock_event);
 }
Index: linux/include/asm-i386/mach-voyager/do_timer.h
===================================================================
--- linux.orig/include/asm-i386/mach-voyager/do_timer.h
+++ linux/include/asm-i386/mach-voyager/do_timer.h
@@ -12,7 +12,7 @@
  **/
 static inline void do_timer_interrupt_hook(void)
 {
-	pit_interrupt_hook();
+	global_clock_event->event_handler(global_clock_event);
 	voyager_timer_interrupt();
 }
 
-

From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:55 am

From: Chris Wright <chrisw@sous-sol.org>

When making changes to x86_64 timers, I noticed that touching hpet.h triggered
an unreasonably large rebuild.  Untangling it from timex.h quiets the extra
rebuild quite a bit.

Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: john stultz <johnstul@us.ibm.com>

---

 drivers/char/rtc.c         |    2 +-
 include/asm-x86_64/apic.h  |    2 ++
 include/asm-x86_64/hpet.h  |    1 -
 include/asm-x86_64/timex.h |    1 -
 4 files changed, 3 insertions(+), 3 deletions(-)

Index: linux/drivers/char/rtc.c
===================================================================
--- linux.orig/drivers/char/rtc.c
+++ linux/drivers/char/rtc.c
@@ -82,7 +82,7 @@
 #include <asm/uaccess.h>
 #include <asm/system.h>
 
-#if defined(__i386__)
+#ifdef CONFIG_X86
 #include <asm/hpet.h>
 #endif
 
Index: linux/include/asm-x86_64/apic.h
===================================================================
--- linux.orig/include/asm-x86_64/apic.h
+++ linux/include/asm-x86_64/apic.h
@@ -86,6 +86,8 @@ extern void setup_apic_routing(void);
 extern void setup_APIC_extened_lvt(unsigned char lvt_off, unsigned char vector,
 				   unsigned char msg_type, unsigned char mask);
 
+extern int apic_is_clustered_box(void);
+
 #define K8_APIC_EXT_LVT_BASE    0x500
 #define K8_APIC_EXT_INT_MSG_FIX 0x0
 #define K8_APIC_EXT_INT_MSG_SMI 0x2
Index: linux/include/asm-x86_64/hpet.h
===================================================================
--- linux.orig/include/asm-x86_64/hpet.h
+++ linux/include/asm-x86_64/hpet.h
@@ -55,7 +55,6 @@
 
 extern int is_hpet_enabled(void);
 extern int hpet_rtc_timer_init(void);
-extern int apic_is_clustered_box(void);
 extern int hpet_arch_init(void);
 extern int hpet_timer_stop_set_go(unsigned long tick);
 extern int hpet_reenable(void);
Index: ...
From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:55 am

From: Thomas Gleixner <tglx@linutronix.de>
Use the generic cmos update function in kernel/time/ntp.c

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: john stultz <johnstul@us.ibm.com>
---

 arch/x86_64/Kconfig       |    4 ++++
 arch/x86_64/kernel/time.c |   25 +++++++++----------------
 2 files changed, 13 insertions(+), 16 deletions(-)

Index: linux/arch/x86_64/Kconfig
===================================================================
--- linux.orig/arch/x86_64/Kconfig
+++ linux/arch/x86_64/Kconfig
@@ -32,6 +32,10 @@ config GENERIC_TIME_VSYSCALL
 	bool
 	default y
 
+config GENERIC_CMOS_UPDATE
+	bool
+	default y
+
 config ZONE_DMA32
 	bool
 	default y
Index: linux/arch/x86_64/kernel/time.c
===================================================================
--- linux.orig/arch/x86_64/kernel/time.c
+++ linux/arch/x86_64/kernel/time.c
@@ -80,8 +80,9 @@ EXPORT_SYMBOL(profile_pc);
  * sheet for details.
  */
 
-static void set_rtc_mmss(unsigned long nowtime)
+static int set_rtc_mmss(unsigned long nowtime)
 {
+	int retval = 0;
 	int real_seconds, real_minutes, cmos_minutes;
 	unsigned char control, freq_select;
 
@@ -121,6 +122,7 @@ static void set_rtc_mmss(unsigned long n
 	if (abs(real_minutes - cmos_minutes) >= 30) {
 		printk(KERN_WARNING "time.c: can't update CMOS clock "
 		       "from %d to %d\n", cmos_minutes, real_minutes);
+		retval = -1;
 	} else {
 		BIN_TO_BCD(real_seconds);
 		BIN_TO_BCD(real_minutes);
@@ -140,12 +142,17 @@ static void set_rtc_mmss(unsigned long n
 	CMOS_WRITE(freq_select, RTC_FREQ_SELECT);
 
 	spin_unlock(&rtc_lock);
+
+	return retval;
 }
 
+int update_persistent_clock(struct timespec now)
+{
+	return set_rtc_mmss(now.tv_sec);
+}
 
 void main_timer_handler(void)
 {
-	static unsigned long rtc_update = 0;
 /*
  * Here we are in the timer irq handler. We have irqs locally ...
From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:55 am

From: Thomas Gleixner <tglx@linutronix.de>
xtime can be initialized including the cmos update from the generic
timekeeping code. Remove the arch specific implementation.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86_64/kernel/time.c |   40 +---------------------------------------
 1 file changed, 1 insertion(+), 39 deletions(-)

Index: linux/arch/x86_64/kernel/time.c
===================================================================
--- linux.orig/arch/x86_64/kernel/time.c
+++ linux/arch/x86_64/kernel/time.c
@@ -193,7 +193,7 @@ static irqreturn_t timer_interrupt(int i
 	return IRQ_HANDLED;
 }
 
-static unsigned long get_cmos_time(void)
+unsigned long read_persistent_clock(void)
 {
 	unsigned int year, mon, day, hour, min, sec;
 	unsigned long flags;
@@ -367,11 +367,6 @@ void __init time_init(void)
 {
 	if (nohpet)
 		hpet_address = 0;
-	xtime.tv_sec = get_cmos_time();
-	xtime.tv_nsec = 0;
-
-	set_normalized_timespec(&wall_to_monotonic,
-	                        -xtime.tv_sec, -xtime.tv_nsec);
 
 	if (hpet_arch_init())
 		hpet_address = 0;
@@ -408,54 +403,21 @@ void __init time_init(void)
 	setup_irq(0, &irq0);
 }
 
-
-static long clock_cmos_diff;
-static unsigned long sleep_start;
-
 /*
  * sysfs support for the timer.
  */
 
 static int timer_suspend(struct sys_device *dev, pm_message_t state)
 {
-	/*
-	 * Estimate time zone so that set_time can update the clock
-	 */
-	long cmos_time =  get_cmos_time();
-
-	clock_cmos_diff = -cmos_time;
-	clock_cmos_diff += get_seconds();
-	sleep_start = cmos_time;
 	return 0;
 }
 
 static int timer_resume(struct sys_device *dev)
 {
-	unsigned long flags;
-	unsigned long sec;
-	unsigned long ctime = get_cmos_time();
-	long sleep_length = (ctime - sleep_start) * HZ;
-
-	if (sleep_length < 0) {
-		printk(KERN_WARNING "Time skew detected in timer ...
From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:55 am

From: Thomas Gleixner <tglx@linutronix.de>
Remove unused code and variables and do some codingstyle / whitespace
cleanups while at it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: john stultz <johnstul@us.ibm.com>
---

 arch/x86_64/kernel/tsc.c |   39 +++++++++++----------------------------
 1 file changed, 11 insertions(+), 28 deletions(-)

Index: linux/arch/x86_64/kernel/tsc.c
===================================================================
--- linux.orig/arch/x86_64/kernel/tsc.c
+++ linux/arch/x86_64/kernel/tsc.c
@@ -36,25 +36,9 @@ static inline int check_tsc_unstable(voi
  * first tick after the change will be slightly wrong.
  */
 
-#include <linux/workqueue.h>
-
-static unsigned int cpufreq_delayed_issched = 0;
-static unsigned int cpufreq_init = 0;
-static struct work_struct cpufreq_delayed_get_work;
-
-static void handle_cpufreq_delayed_get(struct work_struct *v)
-{
-	unsigned int cpu;
-	for_each_online_cpu(cpu) {
-		cpufreq_get(cpu);
-	}
-	cpufreq_delayed_issched = 0;
-}
-
-static unsigned int  ref_freq = 0;
-static unsigned long loops_per_jiffy_ref = 0;
-
-static unsigned long tsc_khz_ref = 0;
+static unsigned int  ref_freq;
+static unsigned long loops_per_jiffy_ref;
+static unsigned long tsc_khz_ref;
 
 static int time_cpufreq_notifier(struct notifier_block *nb, unsigned long val,
 				 void *data)
@@ -98,10 +82,8 @@ static struct notifier_block time_cpufre
 
 static int __init cpufreq_tsc(void)
 {
-	INIT_WORK(&cpufreq_delayed_get_work, handle_cpufreq_delayed_get);
-	if (!cpufreq_register_notifier(&time_cpufreq_notifier_block,
-				       CPUFREQ_TRANSITION_NOTIFIER))
-		cpufreq_init = 1;
+	cpufreq_register_notifier(&time_cpufreq_notifier_block,
+				  CPUFREQ_TRANSITION_NOTIFIER);
 	return 0;
 }
 
@@ -123,17 +105,18 @@ __cpuinit int unsynchronized_tsc(void)
 #endif
 	/* Most intel systems ...
From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:55 am

From: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andi Kleen <ak@suse.de>

---

 arch/x86_64/kernel/apic.c    |    4 ++--
 arch/x86_64/kernel/mce_amd.c |    6 +++---
 include/asm-x86_64/apic.h    |    4 ++--
 3 files changed, 7 insertions(+), 7 deletions(-)

Index: linux/arch/x86_64/kernel/apic.c
===================================================================
--- linux.orig/arch/x86_64/kernel/apic.c
+++ linux/arch/x86_64/kernel/apic.c
@@ -990,8 +990,8 @@ int setup_profiling_timer(unsigned int m
 	return -EINVAL;
 }
 
-void setup_APIC_extened_lvt(unsigned char lvt_off, unsigned char vector,
-			    unsigned char msg_type, unsigned char mask)
+void setup_APIC_extended_lvt(unsigned char lvt_off, unsigned char vector,
+			     unsigned char msg_type, unsigned char mask)
 {
 	unsigned long reg = (lvt_off << 4) + K8_APIC_EXT_LVT_BASE;
 	unsigned int  v   = (mask << 16) | (msg_type << 8) | vector;
Index: linux/arch/x86_64/kernel/mce_amd.c
===================================================================
--- linux.orig/arch/x86_64/kernel/mce_amd.c
+++ linux/arch/x86_64/kernel/mce_amd.c
@@ -157,9 +157,9 @@ void __cpuinit mce_amd_feature_init(stru
 			high |= K8_APIC_EXT_LVT_ENTRY_THRESHOLD << 20;
 			wrmsr(address, low, high);
 
-			setup_APIC_extened_lvt(K8_APIC_EXT_LVT_ENTRY_THRESHOLD,
-					       THRESHOLD_APIC_VECTOR,
-					       K8_APIC_EXT_INT_MSG_FIX, 0);
+			setup_APIC_extended_lvt(K8_APIC_EXT_LVT_ENTRY_THRESHOLD,
+						THRESHOLD_APIC_VECTOR,
+						K8_APIC_EXT_INT_MSG_FIX, 0);
 
 			threshold_defaults.address = address;
 			threshold_restart_bank(&threshold_defaults, 0, 0);
Index: linux/include/asm-x86_64/apic.h
===================================================================
--- linux.orig/include/asm-x86_64/apic.h
+++ linux/include/asm-x86_64/apic.h
@@ -83,8 +83,8 @@ extern void ...
From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:55 am

From: Thomas Gleixner <tglx@linutronix.de>
The hpet_rtc_interrupt handler still uses pt_regs. Fix it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86_64/kernel/hpet.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/arch/x86_64/kernel/hpet.c
===================================================================
--- linux.orig/arch/x86_64/kernel/hpet.c
+++ linux/arch/x86_64/kernel/hpet.c
@@ -439,7 +439,7 @@ int hpet_rtc_dropped_irq(void)
 	return 1;
 }
 
-irqreturn_t hpet_rtc_interrupt(int irq, void *dev_id, struct pt_regs *regs)
+irqreturn_t hpet_rtc_interrupt(int irq, void *dev_id)
 {
 	struct rtc_time curr_time;
 	unsigned long rtc_int_flag = 0;
-

From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:55 am

From: Thomas Gleixner <tglx@linutronix.de>
hpet.h in asm-i386 and asm-x86_64 contain tons of duplicated stuff.
Consolidate into one shared header file.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andi Kleen <ak@suse.de>

---
 include/asm-i386/hpet.h   |  124 +++++++++++++++++-----------------------------
 include/asm-x86_64/hpet.h |   61 ----------------------
 2 files changed, 48 insertions(+), 137 deletions(-)

Index: linux/include/asm-i386/hpet.h
===================================================================
--- linux.orig/include/asm-i386/hpet.h
+++ linux/include/asm-i386/hpet.h
@@ -4,112 +4,82 @@
 
 #ifdef CONFIG_HPET_TIMER
 
-#include <linux/errno.h>
-#include <linux/module.h>
-#include <linux/sched.h>
-#include <linux/kernel.h>
-#include <linux/param.h>
-#include <linux/string.h>
-#include <linux/mm.h>
-#include <linux/interrupt.h>
-#include <linux/time.h>
-#include <linux/delay.h>
-#include <linux/init.h>
-#include <linux/smp.h>
-
-#include <asm/io.h>
-#include <asm/smp.h>
-#include <asm/irq.h>
-#include <asm/msr.h>
-#include <asm/delay.h>
-#include <asm/mpspec.h>
-#include <asm/uaccess.h>
-#include <asm/processor.h>
-
-#include <linux/timex.h>
-
 /*
  * Documentation on HPET can be found at:
  *      http://www.intel.com/ial/home/sp/pcmmspec.htm
  *      ftp://download.intel.com/ial/home/sp/mmts098.pdf
  */
 
-#define HPET_MMAP_SIZE	1024
+#define HPET_MMAP_SIZE		1024
 
-#define HPET_ID		0x000
-#define HPET_PERIOD	0x004
-#define HPET_CFG	0x010
-#define HPET_STATUS	0x020
-#define HPET_COUNTER	0x0f0
-#define HPET_T0_CFG	0x100
-#define HPET_T0_CMP	0x108
-#define HPET_T0_ROUTE	0x110
-#define HPET_T1_CFG	0x120
-#define HPET_T1_CMP	0x128
-#define HPET_T1_ROUTE	0x130
-#define HPET_T2_CFG	0x140
-#define HPET_T2_CMP	0x148
-#define HPET_T2_ROUTE	0x150
-
-#define HPET_ID_LEGSUP	0x00008000
-#define ...
From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:55 am

From: Thomas Gleixner <tglx@linutronix.de>
Fix coding style, white space wreckage and remove unused code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86_64/kernel/apic.c |   73 ++++++++++++++++++----------------------------
 1 file changed, 30 insertions(+), 43 deletions(-)

Index: linux/arch/x86_64/kernel/apic.c
===================================================================
--- linux.orig/arch/x86_64/kernel/apic.c
+++ linux/arch/x86_64/kernel/apic.c
@@ -92,8 +92,9 @@ unsigned int safe_apic_wait_icr_idle(voi
 void enable_NMI_through_LVT0 (void * dummy)
 {
 	unsigned int v;
-	
-	v = APIC_DM_NMI;                        /* unmask and set to NMI */
+
+	/* unmask and set to NMI */
+	v = APIC_DM_NMI;
 	apic_write(APIC_LVT0, v);
 }
 
@@ -120,7 +121,7 @@ void ack_bad_irq(unsigned int irq)
 	 * holds up an irq slot - in excessive cases (when multiple
 	 * unexpected vectors occur) that might lock up the APIC
 	 * completely.
-  	 * But don't ack when the APIC is disabled. -AK
+	 * But don't ack when the APIC is disabled. -AK
 	 */
 	if (!disable_apic)
 		ack_APIC_irq();
@@ -616,7 +617,7 @@ early_param("apic", apic_set_verbosity);
  * Detect and enable local APICs on non-SMP boards.
  * Original code written by Keir Fraser.
  * On AMD64 we trust the BIOS - if it says no APIC it is likely
- * not correctly set up (usually the APIC timer won't work etc.) 
+ * not correctly set up (usually the APIC timer won't work etc.)
  */
 
 static int __init detect_init_APIC (void)
@@ -789,13 +790,13 @@ static void setup_APIC_timer(unsigned in
 	local_irq_save(flags);
 
 	/* wait for irq slice */
- 	if (hpet_address && hpet_use_timer) {
- 		int trigger = hpet_readl(HPET_T0_CMP);
- 		while (hpet_readl(HPET_COUNTER) >= trigger)
- 			/* do nothing */ ;
- 		while (hpet_readl(HPET_COUNTER) <  trigger)
- 			/* do nothing */ ;
- ...
From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:55 am

From: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86_64/kernel/time.c |   88 +++++++++++++++++++++++-----------------------
 1 file changed, 44 insertions(+), 44 deletions(-)

Index: linux/arch/x86_64/kernel/time.c
===================================================================
--- linux.orig/arch/x86_64/kernel/time.c
+++ linux/arch/x86_64/kernel/time.c
@@ -220,7 +220,7 @@ unsigned long read_persistent_clock(void
 	/*
 	 * We know that x86-64 always uses BCD format, no need to check the
 	 * config register.
- 	 */
+	 */
 
 	BCD_TO_BIN(sec);
 	BCD_TO_BIN(min);
@@ -233,11 +233,11 @@ unsigned long read_persistent_clock(void
 		BCD_TO_BIN(century);
 		year += century * 100;
 		printk(KERN_INFO "Extended CMOS year: %d\n", century * 100);
-	} else { 
+	} else {
 		/*
 		 * x86-64 systems only exists since 2002.
 		 * This will work up to Dec 31, 2100
-	 	 */
+		 */
 		year += 2000;
 	}
 
@@ -249,45 +249,45 @@ unsigned long read_persistent_clock(void
 #define TICK_COUNT 100000000
 static unsigned int __init tsc_calibrate_cpu_khz(void)
 {
-       int tsc_start, tsc_now;
-       int i, no_ctr_free;
-       unsigned long evntsel3 = 0, pmc3 = 0, pmc_now = 0;
-       unsigned long flags;
-
-       for (i = 0; i < 4; i++)
-               if (avail_to_resrv_perfctr_nmi_bit(i))
-                       break;
-       no_ctr_free = (i == 4);
-       if (no_ctr_free) {
-               i = 3;
-               rdmsrl(MSR_K7_EVNTSEL3, evntsel3);
-               wrmsrl(MSR_K7_EVNTSEL3, 0);
-               rdmsrl(MSR_K7_PERFCTR3, pmc3);
-       } else {
-               reserve_perfctr_nmi(MSR_K7_PERFCTR0 + i);
-               reserve_evntsel_nmi(MSR_K7_EVNTSEL0 + i);
-       }
-       local_irq_save(flags);
-       /* start meauring cycles, incrementing from 0 */
-       ...
From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:55 am

From: Ravikiran G Thirumalai <kiran@scalex86.org>
Too many remote cpu references due to /proc/stat.

On x86_64, with newer kernel versions, kstat_irqs is a bit of a problem.
On every call to kstat_irqs, the process brings in per-cpu data from all
online cpus.  Doing this for NR_IRQS, which is now 256 + 32 * NR_CPUS
results in (256+32*63) * 63 remote cpu references on a 64 cpu config.
/proc/stat is parsed by common commands like top, who etc, causing
lots of cacheline transfers

This statistic seems useless. Other 'big iron' arches disable this.
Can we disable computing/reporting this statistic?  This piece of
statistic is not human readable on x86_64 anymore,

If not, can we optimize computing this statistic so as to avoid
too many remote references (patch to follow)

Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Andi Kleen <ak@suse.de>

---
 fs/proc/proc_misc.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Index: linux/fs/proc/proc_misc.c
===================================================================
--- linux.orig/fs/proc/proc_misc.c
+++ linux/fs/proc/proc_misc.c
@@ -499,7 +499,8 @@ static int show_stat(struct seq_file *p,
 	}
 	seq_printf(p, "intr %llu", (unsigned long long)sum);
 
-#if !defined(CONFIG_PPC64) && !defined(CONFIG_ALPHA) && !defined(CONFIG_IA64)
+#if !defined(CONFIG_PPC64) && !defined(CONFIG_ALPHA) && !defined(CONFIG_IA64) \
+					&& !defined(CONFIG_X86_64)
 	for (i = 0; i < NR_IRQS; i++)
 		seq_printf(p, " %u", kstat_irqs(i));
 #endif
-

From: Christoph Hellwig
Date: Thursday, July 19, 2007 - 3:21 am

From: Andi Kleen
Date: Thursday, July 19, 2007 - 3:41 am

I guess it's fine on UP only architectures.  I will change it to !CONFIG_SMP
unless someone complains.

-Andi

-

From: Adrian Bunk
Date: Thursday, July 19, 2007 - 3:55 am

Making it depending on the kernel configuration will only cause 
surprises for users. And if you really need the data you can get 

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

-

From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:55 am

From: "Jan Beulich" <jbeulich@novell.com>
Consolidate the three 32-bit system call entry points so that they all
treat registers in similar ways.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>

 arch/x86_64/ia32/ia32entry.S |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Index: linux/arch/x86_64/ia32/ia32entry.S
===================================================================
--- linux.orig/arch/x86_64/ia32/ia32entry.S
+++ linux/arch/x86_64/ia32/ia32entry.S
@@ -104,7 +104,7 @@ ENTRY(ia32_sysenter_target)
 	pushq	%rax
 	CFI_ADJUST_CFA_OFFSET 8
 	cld
-	SAVE_ARGS 0,0,0
+	SAVE_ARGS 0,0,1
  	/* no need to do an access_ok check here because rbp has been
  	   32bit zero extended */ 
 1:	movl	(%rbp),%r9d
@@ -294,7 +294,7 @@ ia32_badarg:
  */ 				
 
 ENTRY(ia32_syscall)
-	CFI_STARTPROC	simple
+	CFI_STARTPROC32	simple
 	CFI_SIGNAL_FRAME
 	CFI_DEF_CFA	rsp,SS+8-RIP
 	/*CFI_REL_OFFSET	ss,SS-RIP*/
@@ -330,6 +330,7 @@ ia32_sysret:
 
 ia32_tracesys:			 
 	SAVE_REST
+	CLEAR_RREGS
 	movq $-ENOSYS,RAX(%rsp)	/* really needed? */
 	movq %rsp,%rdi        /* &pt_regs -> arg1 */
 	call syscall_trace_enter
-

From: Jeff Garzik
Date: Thursday, July 19, 2007 - 7:46 am

More comments and/or a less vague patch description would be nice.

What registers?  What behavior is being made common?  Why?

	Jeff


-

From: Jan Beulich
Date: Monday, August 6, 2007 - 3:43 am

I think the description says this quite well - which registers are being saved/
cleared is being made consistent (not common).

Jan

-

From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:55 am

From: "Jan Beulich" <jbeulich@novell.com>
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>

 arch/i386/kernel/sysenter.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Index: linux/arch/i386/kernel/sysenter.c
===================================================================
--- linux.orig/arch/i386/kernel/sysenter.c
+++ linux/arch/i386/kernel/sysenter.c
@@ -336,7 +336,9 @@ struct vm_area_struct *get_gate_vma(stru
 
 int in_gate_area(struct task_struct *task, unsigned long addr)
 {
-	return 0;
+	const struct vm_area_struct *vma = get_gate_vma(task);
+
+	return vma && addr >= vma->vm_start && addr < vma->vm_end;
 }
 
 int in_gate_area_no_task(unsigned long addr)
-

From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:55 am

From: "Jan Beulich" <jbeulich@novell.com>
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>

 arch/x86_64/mm/fault.c |    2 +-
 arch/x86_64/mm/init.c  |    2 --
 2 files changed, 1 insertion(+), 3 deletions(-)

Index: linux/arch/x86_64/mm/fault.c
===================================================================
--- linux.orig/arch/x86_64/mm/fault.c
+++ linux/arch/x86_64/mm/fault.c
@@ -301,7 +301,7 @@ static int vmalloc_fault(unsigned long a
 	return 0;
 }
 
-int page_fault_trace = 0;
+static int page_fault_trace;
 int exception_trace = 1;
 
 /*
Index: linux/arch/x86_64/mm/init.c
===================================================================
--- linux.orig/arch/x86_64/mm/init.c
+++ linux/arch/x86_64/mm/init.c
@@ -700,8 +700,6 @@ int kern_addr_valid(unsigned long addr) 
 #ifdef CONFIG_SYSCTL
 #include <linux/sysctl.h>
 
-extern int exception_trace, page_fault_trace;
-
 static ctl_table debug_table2[] = {
 	{
 		.ctl_name	= 99,
-

From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:55 am

From: "Jan Beulich" <jbeulich@novell.com>
.. and adjust documentation to properly reflect options that are
x86-64 specific.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>

 Documentation/x86_64/boot-options.txt |    6 ------
 arch/x86_64/kernel/mpparse.c          |    1 -
 2 files changed, 7 deletions(-)

Index: linux/Documentation/x86_64/boot-options.txt
===================================================================
--- linux.orig/Documentation/x86_64/boot-options.txt
+++ linux/Documentation/x86_64/boot-options.txt
@@ -134,12 +134,6 @@ Non Executable Mappings
 
 SMP
 
-  nosmp	Only use a single CPU
-
-  maxcpus=NUMBER only use upto NUMBER CPUs
-
-  cpumask=MASK   only use cpus with bits set in mask
-
   additional_cpus=NUM Allow NUM more CPUs for hotplug
 		 (defaults are specified by the BIOS, see Documentation/x86_64/cpu-hotplug-spec)
 
Index: linux/arch/x86_64/kernel/mpparse.c
===================================================================
--- linux.orig/arch/x86_64/kernel/mpparse.c
+++ linux/arch/x86_64/kernel/mpparse.c
@@ -32,7 +32,6 @@
 
 /* Have we found an MP table */
 int smp_found_config;
-unsigned int __initdata maxcpus = NR_CPUS;
 
 /*
  * Various Linux-internal data structures created from the
-

From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:55 am

From: "Jan Beulich" <jbeulich@novell.com>
Hence remove its handling in the opposite case.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>

 arch/i386/kernel/alternative.c |   14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

Index: linux/arch/i386/kernel/alternative.c
===================================================================
--- linux.orig/arch/i386/kernel/alternative.c
+++ linux/arch/i386/kernel/alternative.c
@@ -5,9 +5,8 @@
 #include <asm/alternative.h>
 #include <asm/sections.h>
 
-static int noreplace_smp     = 0;
-static int smp_alt_once      = 0;
-static int debug_alternative = 0;
+#ifdef CONFIG_HOTPLUG_CPU
+static int smp_alt_once;
 
 static int __init bootonly(char *str)
 {
@@ -15,6 +14,11 @@ static int __init bootonly(char *str)
 	return 1;
 }
 __setup("smp-alt-boot", bootonly);
+#else
+#define smp_alt_once 1
+#endif
+
+static int debug_alternative;
 
 static int __init debug_alt(char *str)
 {
@@ -23,6 +27,8 @@ static int __init debug_alt(char *str)
 }
 __setup("debug-alternative", debug_alt);
 
+static int noreplace_smp;
+
 static int __init setup_noreplace_smp(char *str)
 {
 	noreplace_smp = 1;
@@ -376,8 +382,6 @@ void __init alternative_instructions(voi
 #ifdef CONFIG_HOTPLUG_CPU
 	if (num_possible_cpus() < 2)
 		smp_alt_once = 1;
-#else
-	smp_alt_once = 1;
 #endif
 
 #ifdef CONFIG_SMP
-

From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:55 am

From: "Jan Beulich" <jbeulich@novell.com>
Constrain __supported_pte_mask and NX handling to just the PAE kernel.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>

 arch/i386/mm/init.c     |    7 ++++---
 include/asm-i386/page.h |    1 -
 2 files changed, 4 insertions(+), 4 deletions(-)

Index: linux/arch/i386/mm/init.c
===================================================================
--- linux.orig/arch/i386/mm/init.c
+++ linux/arch/i386/mm/init.c
@@ -471,6 +471,10 @@ void zap_low_mappings (void)
 	flush_tlb_all();
 }
 
+int nx_enabled = 0;
+
+#ifdef CONFIG_X86_PAE
+
 static int disable_nx __initdata = 0;
 u64 __supported_pte_mask __read_mostly = ~_PAGE_NX;
 EXPORT_SYMBOL_GPL(__supported_pte_mask);
@@ -500,9 +504,6 @@ static int __init noexec_setup(char *str
 }
 early_param("noexec", noexec_setup);
 
-int nx_enabled = 0;
-#ifdef CONFIG_X86_PAE
-
 static void __init set_nx(void)
 {
 	unsigned int v[4], l, h;
Index: linux/include/asm-i386/page.h
===================================================================
--- linux.orig/include/asm-i386/page.h
+++ linux/include/asm-i386/page.h
@@ -44,7 +44,6 @@
 extern int nx_enabled;
 
 #ifdef CONFIG_X86_PAE
-extern unsigned long long __supported_pte_mask;
 typedef struct { unsigned long pte_low, pte_high; } pte_t;
 typedef struct { unsigned long long pmd; } pmd_t;
 typedef struct { unsigned long long pgd; } pgd_t;
-

From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:55 am

From: Adrian Bunk <bunk@stusta.de>

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/i386/kernel/setup.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/arch/i386/kernel/setup.c
===================================================================
--- linux.orig/arch/i386/kernel/setup.c
+++ linux/arch/i386/kernel/setup.c
@@ -466,7 +466,7 @@ void __init setup_bootmem_allocator(void
  *
  * This should all compile down to nothing when NUMA is off.
  */
-void __init remapped_pgdat_init(void)
+static void __init remapped_pgdat_init(void)
 {
 	int nid;
 
-

From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:55 am

From: Adrian Bunk <bunk@stusta.de>

Every file should include the headers containing the prototypes for its
global functions.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>

---

 arch/i386/kernel/i8253.c |    1 +
 1 file changed, 1 insertion(+)

Index: linux/arch/i386/kernel/i8253.c
===================================================================
--- linux.orig/arch/i386/kernel/i8253.c
+++ linux/arch/i386/kernel/i8253.c
@@ -13,6 +13,7 @@
 #include <asm/delay.h>
 #include <asm/i8253.h>
 #include <asm/io.h>
+#include <asm/timer.h>
 
 #include "io_ports.h"
 
-

From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:55 am

From: Adrian Bunk <bunk@stusta.de>

timer_irq_works() needlessly became global.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>

---

 arch/i386/kernel/io_apic.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/arch/i386/kernel/io_apic.c
===================================================================
--- linux.orig/arch/i386/kernel/io_apic.c
+++ linux/arch/i386/kernel/io_apic.c
@@ -1902,7 +1902,7 @@ __setup("no_timer_check", notimercheck);
  *	- if this function detects that timer IRQs are defunct, then we fall
  *	  back to ISA timer IRQs
  */
-int __init timer_irq_works(void)
+static int __init timer_irq_works(void)
 {
 	unsigned long t1 = jiffies;
 
-

From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:55 am

From: Christoph Lameter <clameter@sgi.com>

This adds caching of pgds and puds, pmds, pte.  That way we can avoid costly
zeroing and initialization of special mappings in the pgd.

A second quicklist is useful to separate out PGD handling.  We can carry the
initialized pgds over to the next process needing them.

Also clean up the pgd_list handling to use regular list macros.  There is no
need anymore to avoid the lru field.

Move the add/removal of the pgds to the pgdlist into the constructor /
destructor.  That way the implementation is congruent with i386.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andi Kleen <ak@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Acked-by: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/x86_64/Kconfig          |    8 ++++
 arch/x86_64/kernel/process.c |    1 
 arch/x86_64/kernel/smp.c     |    2 -
 include/asm-x86_64/pgalloc.h |   73 ++++++++++++++++++++++++++++---------------
 include/asm-x86_64/pgtable.h |    1 
 5 files changed, 59 insertions(+), 26 deletions(-)

Index: linux/arch/x86_64/Kconfig
===================================================================
--- linux.orig/arch/x86_64/Kconfig
+++ linux/arch/x86_64/Kconfig
@@ -60,6 +60,14 @@ config ZONE_DMA
 	bool
 	default y
 
+config QUICKLIST
+	bool
+	default y
+
+config NR_QUICK
+	int
+	default 2
+
 config ISA
 	bool
 
Index: linux/arch/x86_64/kernel/process.c
===================================================================
--- linux.orig/arch/x86_64/kernel/process.c
+++ linux/arch/x86_64/kernel/process.c
@@ -207,6 +207,7 @@ void cpu_idle (void)
 			if (__get_cpu_var(cpu_idle_state))
 				__get_cpu_var(cpu_idle_state) = 0;
 
+			check_pgt_cache();
 			rmb();
 			idle = pm_idle;
 			if (!idle)
Index: ...
From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:55 am

From: David Rientjes <rientjes@google.com>

The logic in e820_find_active_regions() for determining the true active
regions for an e820 entry given a range of PFN's is needed for
e820_hole_size() as well.

e820_hole_size() is called from the NUMA emulation code to determine the
reserved area within an address range on a per-node basis.  Its logic should
duplicate that of finding active regions in an e820 entry because these are
the only true ranges we may register anyway.

[akpm@linux-foundation.org: cleanup]
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/x86_64/kernel/e820.c |   82 ++++++++++++++++++++++++++--------------------
 1 file changed, 48 insertions(+), 34 deletions(-)

Index: linux/arch/x86_64/kernel/e820.c
===================================================================
--- linux.orig/arch/x86_64/kernel/e820.c
+++ linux/arch/x86_64/kernel/e820.c
@@ -289,47 +289,61 @@ void __init e820_mark_nosave_regions(voi
 	}
 }
 
+/*
+ * Finds an active region in the address range from start_pfn to end_pfn and
+ * returns its range in ei_startpfn and ei_endpfn for the e820 entry.
+ */
+static int __init e820_find_active_region(const struct e820entry *ei,
+					  unsigned long start_pfn,
+					  unsigned long end_pfn,
+					  unsigned long *ei_startpfn,
+					  unsigned long *ei_endpfn)
+{
+	*ei_startpfn = round_up(ei->addr, PAGE_SIZE) >> PAGE_SHIFT;
+	*ei_endpfn = round_down(ei->addr + ei->size, PAGE_SIZE) >> PAGE_SHIFT;
+
+	/* Skip map entries smaller than a page */
+	if (*ei_startpfn >= *ei_endpfn)
+		return 0;
+
+	/* Check if end_pfn_map should be updated */
+	if (ei->type != E820_RAM && *ei_endpfn > end_pfn_map)
+		end_pfn_map = *ei_endpfn;
+
+	/* Skip if map is outside the node */
+	if (ei->type != E820_RAM || *ei_endpfn <= start_pfn ||
+				    *ei_startpfn >= end_pfn)
+		return ...
From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:55 am

From: David Rientjes <rientjes@google.com>

For NUMA emulation, our SLIT should represent the true NUMA topology of the
system but our proximity domain to node ID mapping needs to reflect the
emulated state.

When NUMA emulation has successfully setup fake nodes on the system, a new
function, acpi_fake_nodes() is called.  This function determines the proximity
domain (_PXM) for each true node found on the system.  It then finds which
emulated nodes have been allocated on this true node as determined by its
starting address.  The node ID to PXM mapping is changed so that each fake
node ID points to the PXM of the true node that it is located on.

If the machine failed to register a SLIT, then we assume there is no special
requirement for emulated node affinity so we use the default LOCAL_DISTANCE,
which is newly exported to this code, as our measurement if the emulated nodes
appear in the same PXM.  Otherwise, we use REMOTE_DISTANCE.

PXM_INVAL and NID_INVAL are also exported to the ACPI header file so that we
can compare node_to_pxm() results in generic code (in this case, the SRAT
code).

Cc: Andi Kleen <ak@suse.de>
Cc: Len Brown <lenb@kernel.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>

---

 arch/x86_64/mm/numa.c     |    1 
 arch/x86_64/mm/srat.c     |   76 ++++++++++++++++++++++++++++++++++++++++++++--
 drivers/acpi/numa.c       |   11 ++++--
 include/acpi/acpi_numa.h  |    1 
 include/asm-x86_64/acpi.h |   11 ++++++
 include/linux/acpi.h      |    3 +
 6 files changed, 96 insertions(+), 7 deletions(-)

Index: linux/arch/x86_64/mm/numa.c
===================================================================
--- linux.orig/arch/x86_64/mm/numa.c
+++ linux/arch/x86_64/mm/numa.c
@@ -484,6 +484,7 @@ out:
 						nodes[i].end >> PAGE_SHIFT);
  		setup_node_bootmem(i, nodes[i].start, nodes[i].end);
 	}
+	acpi_fake_nodes(nodes, num_nodes);
  ...
From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:55 am

From: David Rientjes <rientjes@google.com>

When we are in the emulated NUMA case, we need to make sure that all existing
apicid_to_node mappings that point to real node ID's now point to the
equivalent fake node ID's.

If we simply iterate over all apicid_to_node[] members for each node, we risk
remapping an entry if it shares a node ID with a real node.  Since apicid's
may not be consecutive, we're forced to create an automatic array of
apicid_to_node mappings and then copy it over once we have finished remapping
fake to real nodes.

Cc: Andi Kleen <ak@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>

---

 arch/x86_64/mm/srat.c |   13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

Index: linux/arch/x86_64/mm/srat.c
===================================================================
--- linux.orig/arch/x86_64/mm/srat.c
+++ linux/arch/x86_64/mm/srat.c
@@ -470,10 +470,13 @@ static int __init find_node_by_addr(unsi
  */
 void __init acpi_fake_nodes(const struct bootnode *fake_nodes, int num_nodes)
 {
-	int i;
+	int i, j;
 	int fake_node_to_pxm_map[MAX_NUMNODES] = {
 		[0 ... MAX_NUMNODES-1] = PXM_INVAL
 	};
+	unsigned char fake_apicid_to_node[MAX_LOCAL_APIC] = {
+		[0 ... MAX_LOCAL_APIC-1] = NUMA_NO_NODE
+	};
 
 	printk(KERN_INFO "Faking PXM affinity for fake nodes on real "
 			 "topology.\n");
@@ -487,9 +490,17 @@ void __init acpi_fake_nodes(const struct
 		if (pxm == PXM_INVAL)
 			continue;
 		fake_node_to_pxm_map[i] = pxm;
+		/*
+		 * For each apicid_to_node mapping that exists for this real
+		 * node, it must now point to the fake node ID.
+		 */
+		for (j = 0; j < MAX_LOCAL_APIC; j++)
+			if (apicid_to_node[j] == nid)
+				fake_apicid_to_node[j] = i;
 	}
 	for (i = 0; i < num_nodes; i++)
 		__acpi_map_pxm_to_node(fake_node_to_pxm_map[i], i);
+	memcpy(apicid_to_node, fake_apicid_to_node, sizeof(apicid_to_node));
 
 ...
From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:55 am

From: Aaron Durbin <adurbin@google.com>

Insert the unclaimed MMCONFIG resources into the resource tree without the
IORESOURCE_BUSY flag during late initialization.  This allows the MMCONFIG
regions to be visible in the iomem resource tree without interfering with
other system resources that were discovered during PCI initialization.

[akpm@linux-foundation.org: nanofixes]
Signed-off-by: Aaron Durbin <adurbin@google.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/i386/pci/mmconfig-shared.c |   48 +++++++++++++++++++++++++++++++++++++---
 1 file changed, 45 insertions(+), 3 deletions(-)

Index: linux/arch/i386/pci/mmconfig-shared.c
===================================================================
--- linux.orig/arch/i386/pci/mmconfig-shared.c
+++ linux/arch/i386/pci/mmconfig-shared.c
@@ -24,6 +24,9 @@
 
 DECLARE_BITMAP(pci_mmcfg_fallback_slots, 32*PCI_MMCFG_MAX_CHECK_BUS);
 
+/* Indicate if the mmcfg resources have been placed into the resource table. */
+static int __initdata pci_mmcfg_resources_inserted;
+
 /* K8 systems have some devices (typically in the builtin northbridge)
    that are only accessible using type1
    Normally this can be expressed in the MCFG by not listing them
@@ -170,7 +173,7 @@ static int __init pci_mmcfg_check_hostbr
 	return name != NULL;
 }
 
-static void __init pci_mmcfg_insert_resources(void)
+static void __init pci_mmcfg_insert_resources(unsigned long resource_flags)
 {
 #define PCI_MMCFG_RESOURCE_NAME_LEN 19
 	int i;
@@ -194,10 +197,13 @@ static void __init pci_mmcfg_insert_reso
 			 cfg->pci_segment);
 		res->start = cfg->address;
 		res->end = res->start + (num_buses << 20) - 1;
-		res->flags = IORESOURCE_MEM | IORESOURCE_BUSY;
+		res->flags = IORESOURCE_MEM | resource_flags;
 		insert_resource(&iomem_resource, res);
 		names += PCI_MMCFG_RESOURCE_NAME_LEN;
 	}
+
+	/* Mark that the ...
From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:55 am

From: Tim Hockin <thockin@google.com>

Background:
 /dev/mcelog is a clear-on-read interface.  It is currently possible for
 multiple users to open and read() the device.  Users are protected from
 each other during any one read, but not across reads.

Description:
 This patch adds support for O_EXCL to /dev/mcelog.  If a user opens the
 device with O_EXCL, no other user may open the device (EBUSY).  Likewise,
 any user that tries to open the device with O_EXCL while another user has
 the device will fail (EBUSY).

Result:
 Applications can get exclusive access to /dev/mcelog.  Applications that
 do not care will be unchanged.

Alternatives:
 A simpler choice would be to only allow one open() at all, regardless of
 O_EXCL.

Testing:
 I wrote an application that opens /dev/mcelog with O_EXCL and observed
 that any other app that tried to open /dev/mcelog would fail until the
 exclusive app had closed the device.

Caveats:
 None.

Signed-off-by: Tim Hockin <thockin@google.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/x86_64/kernel/mce.c |   36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

Index: linux/arch/x86_64/kernel/mce.c
===================================================================
--- linux.orig/arch/x86_64/kernel/mce.c
+++ linux/arch/x86_64/kernel/mce.c
@@ -465,6 +465,40 @@ void __cpuinit mcheck_init(struct cpuinf
  * Character device to read and clear the MCE log.
  */
 
+static DEFINE_SPINLOCK(mce_state_lock);
+static int open_count;	/* #times opened */
+static int open_exclu;	/* already open exclusive? */
+
+static int mce_open(struct inode *inode, struct file *file)
+{
+	spin_lock(&mce_state_lock);
+
+	if (open_exclu || (open_count && (file->f_flags & O_EXCL))) {
+		spin_unlock(&mce_state_lock);
+		return -EBUSY;
+	}
+
+	if (file->f_flags & O_EXCL)
+		open_exclu = ...
From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:55 am

From: Tim Hockin <thockin@google.com>

Background:
 /dev/mcelog is typically polled manually.  This is less than optimal for
 situations where accurate accounting of MCEs is important.  Calling
 poll() on /dev/mcelog does not work.

Description:
 This patch adds support for poll() to /dev/mcelog.  This results in
 immediate wakeup of user apps whenever the poller finds MCEs.  Because
 the exception handler can not take any locks, it can not call the wakeup
 itself.  Instead, it uses a thread_info flag (TIF_MCE_NOTIFY) which is
 caught at the next return from interrupt or exit from idle, calling the
 mce_user_notify() routine.  This patch also disables the "fake panic"
 path of the mce_panic(), because it results in printk()s in the exception
 handler and crashy systems.

 This patch also does some small cleanup for essentially unused variables,
 and moves the user notification into the body of the poller, so it is
 only called once per poll, rather than once per CPU.

Result:
 Applications can now poll() on /dev/mcelog.  When an error is logged
 (whether through the poller or through an exception) the applications are
 woken up promptly.  This should not affect any previous behaviors.  If no
 MCEs are being logged, there is no overhead.

Alternatives:
 I considered simply supporting poll() through the poller and not using
 TIF_MCE_NOTIFY at all.  However, the time between an uncorrectable error
 happening and the user application being notified is *the*most* critical
 window for us.  Many uncorrectable errors can be logged to the network if
 given a chance.

 I also considered doing the MCE poll directly from the idle notifier, but
 decided that was overkill.

Testing:
 I used an error-injecting DIMM to create lots of correctable DRAM errors
 and verified that my user app is woken up in sync with the polling interval.
 I also used the northbridge to inject uncorrectable ECC errors, and
 verified (printk() to the rescue) that the notify routine is called and the
 ...
From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:55 am

From: Tim Hockin <thockin@google.com>

Background:
 The MCE handler has several paths that it can take, depending on various
 conditions of the MCE status and the value of the 'tolerant' knob.  The
 exact semantics are not well defined and the code is a bit twisty.

Description:
 This patch makes the MCE handler's behavior more clear by documenting the
 behavior for various 'tolerant' levels.  It also fixes or enhances
 several small things in the handler.  Specifically:
     * If RIPV is set it is not safe to restart, so set the 'no way out'
       flag rather than the 'kill it' flag.
     * Don't panic() on correctable MCEs.
     * If the _OVER bit is set *and* the _UC bit is set (meaning possibly
       dropped uncorrected errors), set the 'no way out' flag.
     * Use EIPV for testing whether an app can be killed (SIGBUS) rather
       than RIPV.  According to docs, EIPV indicates that the error is
       related to the IP, while RIPV simply means the IP is valid to
       restart from.
     * Don't clear the MCi_STATUS registers until after the panic() path.
       This leaves the status bits set after the panic() so clever BIOSes
       can find them (and dumb BIOSes can do nothing).

 This patch also calls nonseekable_open() in mce_open (as suggested by akpm).

Result:
 Tolerant levels behave almost identically to how they always have, but
 not it's well defined.  There's a slightly higher chance of panic()ing
 when multiple errors happen (a good thing, IMHO).  If you take an MBE and
 panic(), the error status bits are not cleared.

Alternatives:
 None.

Testing:
 I used software to inject correctable and uncorrectable errors.  With
 tolerant = 3, the system usually survives.  With tolerant = 2, the system
 usually panic()s (PCC) but not always.  With tolerant = 1, the system
 always panic()s.  When the system panic()s, the BIOS is able to detect
 that the cause of death was an MC4.  I was not able to reproduce the
 case of a non-PCC error in userspace, with ...
From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:55 am

From: Truxton Fulton <trux@truxton.com>

59f4e7d572980a521b7bdba74ab71b21f5995538 fixed machine rebooting on Truxton's
machine (when no keyboard was present).  But it broke it on Lee's machine.

The patch reinstates the old (pre-59f4e7d572980a521b7bdba74ab71b21f5995538)
code and if that doesn't work out, try the new,
post-59f4e7d572980a521b7bdba74ab71b21f5995538 code instead.

Cc: Lee Garrett <lee-in-berlin@web.de>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>

---

 include/asm-i386/mach-default/mach_reboot.h |   25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

Index: linux/include/asm-i386/mach-default/mach_reboot.h
===================================================================
--- linux.orig/include/asm-i386/mach-default/mach_reboot.h
+++ linux/include/asm-i386/mach-default/mach_reboot.h
@@ -19,14 +19,37 @@ static inline void kb_wait(void)
 static inline void mach_reboot(void)
 {
 	int i;
+
+	/* old method, works on most machines */
 	for (i = 0; i < 10; i++) {
 		kb_wait();
 		udelay(50);
+		outb(0xfe, 0x64);	/* pulse reset low */
+		udelay(50);
+	}
+
+	/* New method: sets the "System flag" which, when set, indicates
+	 * successful completion of the keyboard controller self-test (Basic
+	 * Assurance Test, BAT).  This is needed for some machines with no
+	 * keyboard plugged in.  This read-modify-write sequence sets only the
+	 * system flag
+	 */
+	for (i = 0; i < 10; i++) {
+		int cmd;
+
+		outb(0x20, 0x64);	/* read Controller Command Byte */
+		udelay(50);
+		kb_wait();
+		udelay(50);
+		cmd = inb(0x60);
+		udelay(50);
+		kb_wait();
+		udelay(50);
 		outb(0x60, 0x64);	/* write Controller Command Byte */
 		udelay(50);
 		kb_wait();
 		udelay(50);
-		outb(0x14, 0x60);	/* set "System flag" */
+		outb(cmd | 0x04, 0x60);	/* set "System flag" */
 		udelay(50);
 		kb_wait();
 		udelay(50);
-

From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:55 am

From: Sam Ravnborg <sam@ravnborg.org>

Following section mismatch warnings were reported by Andrey Borzenkov:

WARNING: arch/i386/kernel/built-in.o - Section mismatch: reference to .init.text:amd_init_mtrr from .text between 'mtrr_bp_init' (at offset 0x967a) and 'mtrr_attrib_to_str'
WARNING: arch/i386/kernel/built-in.o - Section mismatch: reference to .init.text:cyrix_init_mtrr from .text between 'mtrr_bp_init' (at offset 0x967f) and 'mtrr_attrib_to_str'
WARNING: arch/i386/kernel/built-in.o - Section mismatch: reference to .init.text:centaur_init_mtrr from .text between 'mtrr_bp_init' (at offset 0x9684) and 'mtrr_attrib_to_str'
WARNING: arch/i386/kernel/built-in.o - Section mismatch: reference to .init.text: from .text between 'get_mtrr_state' (at offset 0xa735) and 'generic_get_mtrr'
WARNING: arch/i386/kernel/built-in.o - Section mismatch: reference to .init.text: from .text between 'get_mtrr_state' (at offset 0xa749) and 'generic_get_mtrr'
WARNING: arch/i386/kernel/built-in.o - Section mismatch: reference to .init.text: from .text between 'get_mtrr_state' (at offset 0xa770) and 'generic_get_mtrr'

It was tracked down to a few functions missing __init tag.
Compile tested only.

Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>

---

 arch/i386/kernel/cpu/mtrr/generic.c |    2 +-
 arch/i386/kernel/cpu/mtrr/main.c    |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

Index: linux/arch/i386/kernel/cpu/mtrr/generic.c
===================================================================
--- linux.orig/arch/i386/kernel/cpu/mtrr/generic.c
+++ linux/arch/i386/kernel/cpu/mtrr/generic.c
@@ -79,7 +79,7 @@ static void print_fixed(unsigned base, u
 }
 
 /*  Grab all of the MTRR state for this CPU into *state  */
-void get_mtrr_state(void)
+void __init get_mtrr_state(void)
 {
 	unsigned int i;
 	struct mtrr_var_range *vrs;
Index: ...
From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:55 am

From: Nigel Cunningham <nigel@nigel.suspend2.net>

Signed-off-by: Nigel Cunningham <nigel@nigel.suspend2.net>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Andi Kleen <ak@suse.de>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/x86_64/kernel/vmlinux.lds.S  |    7 +++++++
 drivers/base/power/trace.c        |    5 ++++-
 include/asm-i386/resume-trace.h   |   13 +++++++++++++
 include/asm-x86_64/resume-trace.h |   13 +++++++++++++
 include/linux/resume-trace.h      |   19 +++++--------------
 kernel/power/Kconfig              |    2 +-
 6 files changed, 43 insertions(+), 16 deletions(-)

Index: linux/arch/x86_64/kernel/vmlinux.lds.S
===================================================================
--- linux.orig/arch/x86_64/kernel/vmlinux.lds.S
+++ linux/arch/x86_64/kernel/vmlinux.lds.S
@@ -52,6 +52,13 @@ SECTIONS
 
   RODATA
 
+  . = ALIGN(4);
+  .tracedata : AT(ADDR(.tracedata) - LOAD_OFFSET) {
+  	__tracedata_start = .;
+	*(.tracedata)
+  	__tracedata_end = .;
+  }
+
   . = ALIGN(PAGE_SIZE);        /* Align data segment to page size boundary */
 				/* Data */
   .data : AT(ADDR(.data) - LOAD_OFFSET) {
Index: linux/drivers/base/power/trace.c
===================================================================
--- linux.orig/drivers/base/power/trace.c
+++ linux/drivers/base/power/trace.c
@@ -142,6 +142,7 @@ void set_trace_device(struct device *dev
 {
 	dev_hash_value = hash_string(DEVSEED, dev->bus_id, DEVHASH);
 }
+EXPORT_SYMBOL(set_trace_device);
 
 /*
  * We could just take the "tracedata" index into the .tracedata
@@ -162,6 +163,7 @@ void generate_resume_trace(void *traceda
 	file_hash_value = hash_string(lineno, file, FILEHASH);
 	set_magic_time(user_hash_value, file_hash_value, dev_hash_value);
 }
+EXPORT_SYMBOL(generate_resume_trace);
 
 extern char ...
From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:55 am

From: Alan Stern <stern@rowland.harvard.edu>

This patch (as921) adds code to the show_regs() routine in i386 and x86_64
to print the contents of the debug registers along with all the others.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/i386/kernel/process.c   |   12 ++++++++++++
 arch/x86_64/kernel/process.c |   10 ++++++++++
 2 files changed, 22 insertions(+)

Index: linux/arch/i386/kernel/process.c
===================================================================
--- linux.orig/arch/i386/kernel/process.c
+++ linux/arch/i386/kernel/process.c
@@ -300,6 +300,7 @@ early_param("idle", idle_setup);
 void show_regs(struct pt_regs * regs)
 {
 	unsigned long cr0 = 0L, cr2 = 0L, cr3 = 0L, cr4 = 0L;
+	unsigned long d0, d1, d2, d3, d6, d7;
 
 	printk("\n");
 	printk("Pid: %d, comm: %20s\n", current->pid, current->comm);
@@ -324,6 +325,17 @@ void show_regs(struct pt_regs * regs)
 	cr3 = read_cr3();
 	cr4 = read_cr4_safe();
 	printk("CR0: %08lx CR2: %08lx CR3: %08lx CR4: %08lx\n", cr0, cr2, cr3, cr4);
+
+	get_debugreg(d0, 0);
+	get_debugreg(d1, 1);
+	get_debugreg(d2, 2);
+	get_debugreg(d3, 3);
+	printk("DR0: %08lx DR1: %08lx DR2: %08lx DR3: %08lx\n",
+			d0, d1, d2, d3);
+	get_debugreg(d6, 6);
+	get_debugreg(d7, 7);
+	printk("DR6: %08lx DR7: %08lx\n", d6, d7);
+
 	show_trace(NULL, regs, &regs->esp);
 }
 
Index: linux/arch/x86_64/kernel/process.c
===================================================================
--- linux.orig/arch/x86_64/kernel/process.c
+++ linux/arch/x86_64/kernel/process.c
@@ -306,6 +306,7 @@ early_param("idle", idle_setup);
 void __show_regs(struct pt_regs * regs)
 {
 	unsigned long cr0 = 0L, cr2 = 0L, cr3 = 0L, cr4 = 0L, fs, gs, shadowgs;
+	unsigned long d0, d1, d2, d3, d6, d7;
 	unsigned int fsindex,gsindex;
 	unsigned int ds,cs,es; 
 
@@ ...
From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:55 am

From: Andrew Morton <akpm@linux-foundation.org>

Prevent stuff like this:

mm/vmalloc.c: In function 'unmap_kernel_range':
mm/vmalloc.c:75: warning: unused variable 'start'

Cc: Andi Kleen <ak@suse.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>

---

 include/asm-i386/tlbflush.h |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

Index: linux/include/asm-i386/tlbflush.h
===================================================================
--- linux.orig/include/asm-i386/tlbflush.h
+++ linux/include/asm-i386/tlbflush.h
@@ -160,7 +160,11 @@ DECLARE_PER_CPU(struct tlb_state, cpu_tl
 	native_flush_tlb_others(&mask, mm, va)
 #endif
 
-#define flush_tlb_kernel_range(start, end) flush_tlb_all()
+static inline void flush_tlb_kernel_range(unsigned long start,
+					unsigned long end)
+{
+	flush_tlb_all();
+}
 
 static inline void flush_tlb_pgtables(struct mm_struct *mm,
 				      unsigned long start, unsigned long end)
-

From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:55 am

From: Venki Pallipadi <venkatesh.pallipadi@intel.com>

This helps to reduce the frequency at which the CPU must be taken out of a
lower-power state.

Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Tim Hockin <thockin@hockin.org>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/i386/kernel/cpu/mcheck/non-fatal.c |    4 ++--
 arch/x86_64/kernel/mce.c                |    9 ++++++---
 2 files changed, 8 insertions(+), 5 deletions(-)

Index: linux/arch/i386/kernel/cpu/mcheck/non-fatal.c
===================================================================
--- linux.orig/arch/i386/kernel/cpu/mcheck/non-fatal.c
+++ linux/arch/i386/kernel/cpu/mcheck/non-fatal.c
@@ -57,7 +57,7 @@ static DECLARE_DELAYED_WORK(mce_work, mc
 static void mce_work_fn(struct work_struct *work)
 { 
 	on_each_cpu(mce_checkregs, NULL, 1, 1);
-	schedule_delayed_work(&mce_work, MCE_RATE);
+	schedule_delayed_work(&mce_work, round_jiffies_relative(MCE_RATE));
 } 
 
 static int __init init_nonfatal_mce_checker(void)
@@ -82,7 +82,7 @@ static int __init init_nonfatal_mce_chec
 	/*
 	 * Check for non-fatal errors every MCE_RATE s
 	 */
-	schedule_delayed_work(&mce_work, MCE_RATE);
+	schedule_delayed_work(&mce_work, round_jiffies_relative(MCE_RATE));
 	printk(KERN_INFO "Machine check exception polling timer started.\n");
 	return 0;
 }
Index: linux/arch/x86_64/kernel/mce.c
===================================================================
--- linux.orig/arch/x86_64/kernel/mce.c
+++ linux/arch/x86_64/kernel/mce.c
@@ -375,7 +375,8 @@ static void mcheck_timer(struct work_str
 	if (mce_notify_user()) {
 		next_interval = max(next_interval/2, HZ/100);
 	} else {
-		next_interval = min(next_interval*2, check_interval*HZ);
+		next_interval = min(next_interval*2,
+				(int)round_jiffies_relative(check_interval*HZ));
 	}
 
 	schedule_delayed_work(&mcheck_work, next_interval);
@@ -428,7 +429,8 @@ ...
From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:55 am

From: Eric W. Biederman <ebiederm@xmission.com>

On x86_64 kernel, level triggered irq migration gets initiated in the
context of that interrupt(after executing the irq handler) and following
steps are followed to do the irq migration.

1. mask IOAPIC RTE entry;     // write to IOAPIC RTE
2. EOI;                       // processor EOI write
3. reprogram IOAPIC RTE entry // write to IOAPIC RTE with new destination and
                              // and interrupt vector due to per cpu vector
                              // allocation.
4. unmask IOAPIC RTE entry;   // write to IOAPIC RTE

Because of the per cpu vector allocation in x86_64 kernels, when the irq
migrates to a different cpu, new vector(corresponding to the new cpu) will
get allocated.

An EOI write to local APIC has a side effect of generating an EOI write for
level trigger interrupts (normally this is a broadcast to all IOAPICs). 
The EOI broadcast generated as a side effect of EOI write to processor may
be delayed while the other IOAPIC writes (step 3 and 4) can go through.

Normally, the EOI generated by local APIC for level trigger interrupt
contains vector number.  The IOAPIC will take this vector number and search
the IOAPIC RTE entries for an entry with matching vector number and clear
the remote IRR bit (indicate EOI).  However, if the vector number is
changed (as in step 3) the IOAPIC will not find the RTE entry when the EOI
is received later.  This will cause the remote IRR to get stuck causing the
interrupt hang (no more interrupt from this RTE).

Current x86_64 kernel assumes that remote IRR bit is cleared by the time
IOAPIC RTE is reprogrammed.  Fix this assumption by checking for remote IRR
bit and if it still set, delay the irq migration to the next interrupt
arrival event(hopefully, next time remote IRR bit will get cleared before
the IOAPIC RTE is reprogrammed).

Initial analysis and patch from Nanhai.

Clean up patch from Suresh.

Rewritten to be less intrusive, and to contain a big fat ...
From: Andi Kleen
Date: Thursday, July 19, 2007 - 2:55 am

From: Adrian Bunk <bunk@stusta.de>

The Rise CPUs were only very short-lived, and there are no reports of
anyone both owning one and running Linux on it.

Googling for the printk string "CPU: Rise iDragon" didn't find any dmesg
available online.

If it turns out that against all expectations there are actually users
reverting this patch would be easy.

This patch will make the kernel images smaller by a few bytes for all
i386 users.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Dave Jones <davej@redhat.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/i386/kernel/cpu/Makefile  |    1 
 arch/i386/kernel/cpu/common.c  |    2 -
 arch/i386/kernel/cpu/rise.c    |   52 -----------------------------------------
 include/asm-i386/processor.h   |    1 
 include/asm-x86_64/processor.h |    1 
 5 files changed, 57 deletions(-)

Index: linux/arch/i386/kernel/cpu/Makefile
===================================================================
--- linux.orig/arch/i386/kernel/cpu/Makefile
+++ linux/arch/i386/kernel/cpu/Makefile
@@ -9,7 +9,6 @@ obj-y	+=	cyrix.o
 obj-y	+=	centaur.o
 obj-y	+=	transmeta.o
 obj-y	+=	intel.o intel_cacheinfo.o addon_cpuid_features.o
-obj-y	+=	rise.o
 obj-y	+=	nexgen.o
 obj-y	+=	umc.o
 
Index: linux/arch/i386/kernel/cpu/common.c
===================================================================
--- linux.orig/arch/i386/kernel/cpu/common.c
+++ linux/arch/i386/kernel/cpu/common.c
@@ -606,7 +606,6 @@ extern int nsc_init_cpu(void);
 extern int amd_init_cpu(void);
 extern int centaur_init_cpu(void);
 extern int transmeta_init_cpu(void);
-extern int rise_init_cpu(void);
 extern int nexgen_init_cpu(void);
 extern int umc_init_cpu(void);
 
@@ -618,7 +617,6 @@ void __init early_cpu_init(void)
 	amd_init_cpu();
 	centaur_init_cpu();
 	transmeta_init_cpu();
-	rise_init_cpu();
 	nexgen_init_cpu();
 	umc_init_cpu();
 	early_cpu_detect();
Index: ...
From: Alan Cox
Date: Thursday, July 19, 2007 - 3:45 am

Why bother. Its a tiny tiny amount of code and it requires no maintenance
so it achieves nothing by leaving it alone and risks (slight I admit)
breaking someones box.

Alan

-

From: Adrian Bunk
Date: Thursday, July 19, 2007 - 3:48 am

- It's not only code, it also bloats everyone's kernel image.
- All it did was to fiddle with capabilities - if any computer with
  a Rise cpu running Linux actually exists it should still work.

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

-

From: Alan Cox
Date: Thursday, July 19, 2007 - 4:13 am

> - It's not only code, it also bloats everyone's kernel image.

Its a miniscule piece of code that is discarded on boot. Yes it might
make the image 100 bytes longer, but have you priced a 160GB disk
recently. I don't think 100 bytes of disk and 0 of memory really is worth
saving for any risk at all. Its not even worth the time to apply the

But you've no idea if this is true. Probably some of our other drivers
are bigger and have less users.

Alan
-

From: Andi Kleen
Date: Thursday, July 19, 2007 - 5:03 am

The patch is already applied.

Besides the CPU will likely boot even without special handling.

-Andi
-

From: Jeff Garzik
Date: Thursday, July 19, 2007 - 7:56 am

You don't know this.  Why risk it?  Just leave the CPU magic as-is.

	Jeff


-

Previous thread: [PATCH][RFC] Remove R/W semaphore content from generic semaphore.h headers. by Robert P. J. Day on Thursday, July 19, 2007 - 2:37 am. (3 messages)

Next thread: [ANNOUNCE] RSBAC 1.3.5 released by Amon Ott on Thursday, July 19, 2007 - 2:37 am. (1 message)