Re: [PATCH 1/2] MN10300: Move asm-arm/cnt32_to_63.h to include/linux/

Previous thread: [patch 0/3] [x86 -tip/master] apic cleanup by Cyrill Gorcunov on Wednesday, September 24, 2008 - 9:46 am. (2 messages)

Next thread: ACPI errors in 2.6.26.5 and 2.6.27-rc by Paul Eddington on Wednesday, September 24, 2008 - 9:55 am. (1 message)
From: David Howells
Date: Wednesday, September 24, 2008 - 9:48 am

Move asm-arm/cnt32_to_63.h to include/linux/ so that MN10300 can make use of it
too.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 arch/arm/mach-pxa/time.c       |    2 +
 arch/arm/mach-sa1100/generic.c |    2 +
 arch/arm/mach-versatile/core.c |    2 +
 include/linux/cnt32_to_63.h    |   80 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 83 insertions(+), 3 deletions(-)
 create mode 100644 include/linux/cnt32_to_63.h


diff --git a/arch/arm/mach-pxa/time.c b/arch/arm/mach-pxa/time.c
index 67e1850..b0d6b32 100644
--- a/arch/arm/mach-pxa/time.c
+++ b/arch/arm/mach-pxa/time.c
@@ -17,9 +17,9 @@
 #include <linux/interrupt.h>
 #include <linux/clockchips.h>
 #include <linux/sched.h>
+#include <linux/cnt32_to_63.h>
 
 #include <asm/div64.h>
-#include <asm/cnt32_to_63.h>
 #include <asm/mach/irq.h>
 #include <asm/mach/time.h>
 #include <mach/pxa-regs.h>
diff --git a/arch/arm/mach-sa1100/generic.c b/arch/arm/mach-sa1100/generic.c
index 1362994..b422526 100644
--- a/arch/arm/mach-sa1100/generic.c
+++ b/arch/arm/mach-sa1100/generic.c
@@ -18,9 +18,9 @@
 #include <linux/ioport.h>
 #include <linux/sched.h>	/* just for sched_clock() - funny that */
 #include <linux/platform_device.h>
+#include <linux/cnt32_to_63.h>
 
 #include <asm/div64.h>
-#include <asm/cnt32_to_63.h>
 #include <mach/hardware.h>
 #include <asm/system.h>
 #include <asm/pgtable.h>
diff --git a/arch/arm/mach-versatile/core.c b/arch/arm/mach-versatile/core.c
index d75e795..b638f10 100644
--- a/arch/arm/mach-versatile/core.c
+++ b/arch/arm/mach-versatile/core.c
@@ -28,8 +28,8 @@
 #include <linux/amba/clcd.h>
 #include <linux/clocksource.h>
 #include <linux/clockchips.h>
+#include <linux/cnt32_to_63.h>
 
-#include <asm/cnt32_to_63.h>
 #include <asm/system.h>
 #include <mach/hardware.h>
 #include <asm/io.h>
diff --git a/include/linux/cnt32_to_63.h b/include/linux/cnt32_to_63.h
new file mode 100644
index 0000000..8c0f950
--- /dev/null
+++ b/include/linux/cnt32_to_63.h
@@ -0,0 ...
From: David Howells
Date: Wednesday, September 24, 2008 - 9:48 am

Make sched_clock() report time since boot rather than time since last timer
interrupt.

Make sched_clock() expand and scale the 32-bit TSC value running at IOCLK
speed (~33MHz) to a 64-bit nanosecond counter, using cnt32_to_63() acquired
from the ARM arch and without using slow DIVU instructions every call.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 arch/mn10300/kernel/time.c |   52 +++++++++++++++++++++++++++++++++++---------
 1 files changed, 41 insertions(+), 11 deletions(-)


diff --git a/arch/mn10300/kernel/time.c b/arch/mn10300/kernel/time.c
index babb7c2..e460658 100644
--- a/arch/mn10300/kernel/time.c
+++ b/arch/mn10300/kernel/time.c
@@ -1,6 +1,6 @@
 /* MN10300 Low level time management
  *
- * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Copyright (C) 2007-2008 Red Hat, Inc. All Rights Reserved.
  * Written by David Howells (dhowells@redhat.com)
  * - Derived from arch/i386/kernel/time.c
  *
@@ -16,6 +16,7 @@
 #include <linux/init.h>
 #include <linux/smp.h>
 #include <linux/profile.h>
+#include <linux/cnt32_to_63.h>
 #include <asm/irq.h>
 #include <asm/div64.h>
 #include <asm/processor.h>
@@ -40,27 +41,54 @@ static struct irqaction timer_irq = {
 	.name		= "timer",
 };
 
+static unsigned long sched_clock_multiplier;
+
 /*
  * scheduler clock - returns current time in nanosec units.
  */
 unsigned long long sched_clock(void)
 {
 	union {
-		unsigned long long l;
-		u32 w[2];
-	} quot;
+		unsigned long long ll;
+		unsigned l[2];
+	} tsc64, result;
+	unsigned long tsc, tmp;
+	unsigned product[3]; /* 96-bit intermediate value */
+
+	/* read the TSC value
+	 */
+	tsc = 0 - get_cycles(); /* get_cycles() counts down */
 
-	quot.w[0] = mn10300_last_tsc - get_cycles();
-	quot.w[1] = 1000000000;
+	/* expand to 64-bits.
+	 * - sched_clock() must be called once a minute or better or the
+	 *   following will go horribly wrong - see cnt32_to_63()
+	 */
+	tsc64.ll = cnt32_to_63(tsc) & 0x7fffffffffffffffULL;
 
-	asm("mulu ...
From: Peter Zijlstra
Date: Friday, September 26, 2008 - 3:58 am

Right, so since you have to up-scale the 63 bit counter obtained using
the smarty pants cnt32_to_63 code, you actually end up with a genuine
64bit counter.. sweet.

I'm not going to pretend to understand the NM10300 details, but the
generic idea makes sense.


--

From: Peter Zijlstra
Date: Friday, September 26, 2008 - 3:56 am

That code is way to smart :-)

Better make sure that non of its users are SMP capable though.


--

From: Nicolas Pitre
Date: Friday, September 26, 2008 - 5:03 am

Why would that matter?


Nicolas
--

From: Peter Zijlstra
Date: Friday, September 26, 2008 - 5:08 am

It has a local static variable and no synchronization. Also, if the
counters differ per cpu the value of __m_cnt_hi ought to be different
per cpu.



--

From: Nicolas Pitre
Date: Sunday, September 28, 2008 - 6:21 pm

[Empty message]
From: Peter Zijlstra
Date: Friday, September 26, 2008 - 5:20 am

#define cnt32_to_63(cnt_lo)
({
	static DEFINE_PER_CPU(u32, __m_cnt_hi);

       union cnt32_to_63 __x;
	u32 *__m_cnt_hi_ptr = &get_cpu_var(__m_cnt_hi);

	__x.hi = *__m_cnt_hi_ptr;
	__x.lo = (cnt_lo);

	if (unlikely((s32)(__x.hi ^ __x.lo) < 0))
		*__m_cnt_hi_ptr = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31);

	put_cpu_var(__m_cnt_hi);

	__x.val;
})

And I'm not quite sure why you had volatile in there.. but that should
be replaced by a barrier().

Also, we could probably use __get_cpu_var() and put BUG_ON(!in_atomic())
in there since if this stuff is per cpu, it'd better be atomic anyway -
you don't want to read a different cpu's counter than the one you're
using to upgrade.

--

From: Nicolas Pitre
Date: Sunday, September 28, 2008 - 6:42 pm

Nope.  That disables preemption which is against the purpose of this 

That's to ensure that __x.hi is always (re)read before __x.lo, and not 
factorized out of a possible outer loop if gcc felt like doing such an 
optimization.  However I didn't want other unrelated variables to be 
affected by an undiscriminating barrier().  But that's a weak argument 
as I suspect in most case inserting a barrier() between the 2 reads will 

But again, the original code had no such restriction.  It works with 
mixed contexts.

If the base counter is per CPU, then the static variable has to be per 
CPU of course.  But if the base counter is global then the static 
variable should be shared by all CPUs unless you can guarantee that all 
CPUs will always call this at least once per half period of the base 
counter.


Nicolas
--

From: David Howells
Date: Wednesday, October 1, 2008 - 7:34 am

I think you may be wrong on that.  MEI came up with the following point:

	I think either disable preemption or disable interrupt is really
	necessary for cnt32_to_63 macro, because there seems to be assumption
	that the series of the code (1)-(4) must be executed within a half
	period of the 32-bit counter.

	-------------------------------------------------
	#define cnt32_to_63(cnt_lo) 
	({ 
	       static volatile u32 __m_cnt_hi = 0; 
	       cnt32_to_63_t __x; 
	(1)    __x.hi = __m_cnt_hi; 
	(2)    __x.lo = (cnt_lo); 
	(3)    if (unlikely((s32)(__x.hi ^ __x.lo) < 0))
	(4)            __m_cnt_hi = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31); 
	       __x.val; 
	})
	-------------------------------------------------

	If a task is preempted while executing the series of the code and
	scheduled again after the half period of the 32-bit counter, the task
	may destroy __m_cnt_hi.

Their suggested remedy is:

	So I think it's better to disable interrupt the cnt32_to_63 and to
	ensure that the series of the code are executed within a short period.

I think this is excessive...  If we're sat there with interrupts disabled for
more than a half period (65s) then we've got other troubles.  I think
disabling preemption for the duration ought to be enough.  What do you think?

Now, I'm happy to put these in sched_clock() rather then cnt32_to_63() for my
purposes (see attached patch).

David
---
MN10300: Prevent cnt32_to_63() from being preempted in sched_clock()

From: David Howells <dhowells@redhat.com>

Prevent cnt32_to_63() from being preempted in sched_clock() because it may
read its internal counter, get preempted, get delayed for more than the half
period of the 'TSC' and then write the internal counter, thus corrupting it.

Whilst some callers of sched_clock() have interrupts disabled or hold
spinlocks, not all do, and so preemption must be held here.

Note that sched_clock() is called from lockdep, but that shouldn't be a problem
because although ...
From: Nicolas Pitre
Date: Wednesday, October 1, 2008 - 8:36 am

This is still wrong.  There is no need for any kind of locking what so 
ever, period.  That's the _point_ of the whole algorithm.  The only 
constraint is that the code has to be executed at least once per half 
period of the 32 bit counter, which is typically once every couple 

Oh, that's a possibility.  In that case __m_cnt_hi will be reverted to a 
previous state just like if cnt32_to_63() has not been called yet since 

I think this is excessive too.  If you're preempted away for that long 
you still have a problem.  And if that's a real concern because you run 
RT and have tasks frequently blocked for that long then don't use this 
interface for critical counters for which absolute correctness is 
essential, which is not the case for sched_clock() anyway.  A 
sched_clock() implementation is meant to be as lightweight as possible 
even if it lacks in absolute precision, and the worst that could happen 
if you actually manage to cause a glitch in the returned value from 
sched_clock() is possibly the scheduling of the wrong task in a non RT 
scheduler, and we determined that a RT scheduler is needed to cause this 

If you still feel paranoid about this then I can't stop you from adding 
extra locking in your own code.  On machines I've created cnt32_to_63() 
for, the critical half period delay can be like 9 minutes and worrying 
about races that could last that long is really about ignoring a much 
worse problem.


Nicolas
--

From: David Howells
Date: Thursday, October 2, 2008 - 3:23 pm

Surely that's not good enough.  There's a 50% chance that reversion to a
previous state will result in an extra increment of the __m_cnt_hi on the next
call.

David
--

From: Nicolas Pitre
Date: Friday, October 3, 2008 - 12:15 pm

If so then you're using this interface in an inappropriate way.

The _absolute_ minimum frequency with which this should be fully 
executed is once per half period of the base counter.  I hope that in 
practice it happens far more often than that.


Nicolas
--

From: David Howells
Date: Monday, October 6, 2008 - 3:44 am

I think you're misunderstanding my contention.

If preemption is enabled, cnt32_to_63() can be called with greater than
minimum frequency and yet reversions can still happen.

The problem is that a process that's half way through executing cnt32_to_63()
can be preempted for a period of time sufficient that when it is rescheduled
and writes __m_cnt_hi, it corrupts it with an out of date value.

David

--

From: Nicolas Pitre
Date: Monday, October 6, 2008 - 8:06 am

I think you're misunderstanding my objection.

if a process that's half way through executing cnt32_to_63()
is preempted for a period of time in the same ball park as the 
half period of the base counter then my point is that you have 
a much bigger problem.

For __m_cnt_hi to be written with an out-of-date value, you'd need 
this sequence of events:

(1)	__x.hi = __m_cnt_hi;
(2)	__x.lo = (cnt_lo);
(3)	if (unlikely((s32)(__x.hi ^ __x.lo) < 0))
(4)		__m_cnt_hi = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31);

cnt_lo		cnt_hi		instance A	instance B
---------------------------------------------------------------------
0x7fffffff	0x00000000	...
0x80000000	0x00000000	...
[the code isn't called for a _long_ time during which __m_cnt_hi
 is not in synch with the low part]
0xff000000	0x00000000	(1) __x.hi = 0
0xff000001	0x00000000	(2) __x.lo = 0xff000001
0xff000002	0x00000000	(3) condition is true
0xff000003	0x00000000	*** preemption ***
0xff000004	0x00000000			...
[a little while later]
0xffff0000	0x00000000			(1) __x.hi = 0
0xffff0001	0x00000000			(2) __x.lo = 0xffff0001
0xffff0002	0x00000000			(3) condition is true
0xffff0003	0x00000000			(4) update __m_cnt_hi
0xffff0004	0x80000000			...
[a little while later]
0xffffffff	0x80000000			...
0x00000000	0x80000000			...
0x00000001	0x80000000			(1) __x.hi = 0x80000000
0x00000002	0x80000000			(2) __x.lo = 0x00000002
0x00000003	0x80000000			(3) condition is true
0x00000004	0x80000000			(4) update __m_cnt_hi
0x00000005	0x00000001			...
0x00000006	0x00000001			*** preemption ***
0x00000007	0x00000001	(4) update update __m_cnt_hi
0x00000008	0x80000000	...

At this point, it is true that __m_cnt_hi is not up to date while it was 
a moment before.  But that has no consequence because it was reverted 
to the same state before the update at 0x00000004, meaning that the 
update will be redone as soon as the code is executed again.

Sure this could be defeated if 1) the execution in instance B didn't 
come soon enough to notice ...
From: David Howells
Date: Friday, September 26, 2008 - 6:27 am

Ah...  I did remove the old one... it's just that someone also moved the old
one to arch/arm/include and GIT's merge tool didn't notice.

David
--

Previous thread: [patch 0/3] [x86 -tip/master] apic cleanup by Cyrill Gorcunov on Wednesday, September 24, 2008 - 9:46 am. (2 messages)

Next thread: ACPI errors in 2.6.26.5 and 2.6.27-rc by Paul Eddington on Wednesday, September 24, 2008 - 9:55 am. (1 message)