Re: [PATCH 1/1] SGI X86 UV: Provide a System Activity Indicator driver

Previous thread: Thinkpad T60p does not recover from Suspend when docked. by Chris Rankin on Friday, September 19, 2008 - 7:30 am. (1 message)

Next thread: [PATCH 0/1] SGI X86 UV: Provide a System Activity Indicator driver by Mike Travis on Friday, September 19, 2008 - 7:37 am. (1 message)
From: Mike Travis
Date: Friday, September 19, 2008 - 7:37 am

The SGI UV system has no LEDS but uses one of the system controller
regs to indicate the online internal state of the cpu.  There is a
heartbeat bit indicating that the cpu is responding to interrupts,
and an idle bit indicating whether the cpu has been more or less than
50% idle each heartbeat period.  The current period is one second.

When a cpu panics, an error code is written by BIOS to this same reg.

So the reg has been renamed the "System Controller Interface Reg".

This patchset provides the following:

  * x86_64: Add base functionality for writing to the specific SCIR's
    for each cpu.

  * idle: Add an idle callback to measure the idle "on" and "off" times.

  * heartbeat: Invert "heartbeat" bit to indicate the cpu is "active".

  * if hotplug enabled, all bits are set (0xff) when the cpu is disabled.

Based on linux-2.6.tip/master.

Signed-off-by: Mike Travis <travis@sgi.com>
---
 arch/x86/kernel/genx2apic_uv_x.c |  138 +++++++++++++++++++++++++++++++++++++++
 include/asm-x86/uv/uv_hub.h      |   62 +++++++++++++++++
 2 files changed, 200 insertions(+)

--- linux-2.6.tip.orig/arch/x86/kernel/genx2apic_uv_x.c
+++ linux-2.6.tip/arch/x86/kernel/genx2apic_uv_x.c
@@ -10,6 +10,7 @@
 
 #include <linux/kernel.h>
 #include <linux/threads.h>
+#include <linux/cpu.h>
 #include <linux/cpumask.h>
 #include <linux/string.h>
 #include <linux/ctype.h>
@@ -18,6 +19,8 @@
 #include <linux/bootmem.h>
 #include <linux/module.h>
 #include <linux/hardirq.h>
+#include <linux/timer.h>
+#include <asm/idle.h>
 #include <asm/smp.h>
 #include <asm/ipi.h>
 #include <asm/genapic.h>
@@ -358,6 +361,139 @@ static __init void uv_rtc_init(void)
 		sn_rtc_cycles_per_second = ticks_per_sec;
 }
 
+/*
+ * percpu heartbeat timer
+ */
+static void uv_heartbeat(unsigned long ignored)
+{
+	struct timer_list *timer = &uv_hub_info->scir.timer;
+	unsigned char bits = uv_hub_info->scir.state;
+
+	/* flip heartbeat bit */
+	bits ^= SCIR_CPU_HEARTBEAT;
+
+	/* determine if we were ...
From: Ingo Molnar
Date: Monday, September 22, 2008 - 4:48 am

no, we really dont want such overhead in an idle notifier. (those 
interfaces will go away)

As i suggested in my previous mail about this topic, a low-frequency 
sampling method should be used instead, to indicate system status. I 
thought the leds drivers have all that in place already.

	Ingo
--

From: Mike Travis
Date: Monday, September 22, 2008 - 7:42 am

Hi Ingo,

The overhead is very low and I've not been able to figure out how to 
"estimate" per cpu usage without tons more code (duplicating "top" in
the kernel.)  The actual I/O (well O ;-) is once per second.  The other
important point is this is only on an SGI UV system, so no other systems
are affected.  Because of this, I'm a bit confused by your objection.

I tried a couple of other approaches, but because the timer wakeup
causes the idle state to be exited, it's difficult to determine if
the cpu was idle before the timer interrupt.  (I even tried putting a
"wasidle" in the pda to check the idle state prior to the last exit
from idle, but this did not appear to be reliable.)  The idle callback
solves this rather nicely.

Here's the same code that's been in ia64 for quite a while:

arch/ia64/sn/kernel/setup.c:
	void __init sn_setup(char **cmdline_p)
	{
		...
		ia64_mark_idle = &snidle;

arch/ia64/kernel/process.c:
	cpu_idle (void)
	{
        	void (*mark_idle)(int) = ia64_mark_idle;
		...

                if (mark_idle)
                        (*mark_idle)(1);

(The x86 callback approach is much cleaner.)

Another relevant point is that I will be adding a bit more functionality
to disable the timer interrupt on truly "idle" cpus (like have been idle
for some amount of seconds).  We would then use the "exit from idle"
callback to reestablish the timer interrupt.  [This would allow them to

It is low frequency (once per second), this is just setting what's to
be sampled.

As I mentioned, this is not for LED displays (human readable), it's for the
system controller to monitor how all parts of the system are running, and
this one is just the cpu parts.  The LED driver approach would have me
registering 4096 led devices, with all their callbacks, 4096 strings saying
"LED0001", etc., and I still cannot associate a specific register bit
(AKA LED if that's what it was), with a specific cpu using the LED driver.

The LED driver is fine for a couple of blinking ...
From: Pavel Machek
Date: Sunday, September 28, 2008 - 12:42 pm

So overhead from led driver is not okay, while overhead from messing
with idle loop is okay? Interesting...
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Mike Travis
Date: Wednesday, October 1, 2008 - 11:15 am

Thanks, I did look at it.  Quite complex.  Maybe I'm missing something
but I don't see how it fits in?  Are you saying I should be using data
in the percpu tick_sched to gather the idle information for the once
per second per cpu status update interrupt?  I see the @idle_active
entry but wouldn't this always be false during the timer interrupt?
Using any other entries would appear to be more complex than a simple
store byte and subtracting two longs.

Or perhaps I should somehow hook into the sched_timer interrupt instead
of having a separate once per second per cpu interrupt?  (Does this

The overhead is mainly the registration of descriptor blocks for the
4096 registers representing the 4096 cpus all at separate addresses.
The overhead in this patch for maintaining the "idle" state (prior to the
timer interrupt causing "exit_idle") is storing a byte and subtracting the
current jiffies from the jiffies at the last one second timer interrupt.
(Even this subtraction can be removed, the only *important* item is
whether the cpu is currently idle or not.)

This data is written to node local memory that's highly likely to be in
the cache, as the same memory block is used for all UV hub operations.

Unfortunately, I am experiencing a simulator problem at the moment or
I'd be able to quantify the exact amount of time added to the exit_idle()
function, but it's basically noise in the overall resumption of a thread.

One other factor, this overhead is *only* for UV systems, no other x86_64
systems or architectures are affected, so again I'm not understanding the
objection.  This request was made from our hardware and RAS engineers,
and is identical to what's been in the ia64 kernel for a few years now.

Perhaps the confusion is it's near relationship to real "LED" lights?
The original name "LED" is historical.  The bits are read by a system
controller that has the job of monitoring the entire system, including
both soft and hard errors and determining faulty [or near faulty]
system ...
From: Mike Travis
Date: Wednesday, October 1, 2008 - 12:41 pm

Actually, this comparing idle time vs. not idle time during the last
second is what gets around the problem that the system goes to not
idle servicing the timer interrupt, which hides the real idle state.
If anyone has a suggestion on how to get a once per second per cpu
timer callback which does not call exit_idle, (or any other means of
indicating whether the cpu is idle), I'd be more than happy to remove
the idle callback function.

In discussions with SGI's RAS engineering it's felt that this status
is very important for their current RAS analysis programs, making the
system overhead for UV more than worthwhile.

Thanks,

--

From: Pavel Machek
Date: Thursday, October 2, 2008 - 1:37 am

I seen your remark above about disabling timer interrupt on idle
cpus. That's exactly nohz functionality, right? Maybe I misunderstood
you and you meant "my monitoring interrupt" and not "generic system
timer interrupt"?

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Mike Travis
Date: Thursday, October 2, 2008 - 7:33 am

Ahh, yes, now I see... ;-)

And thanks for the pointer, I hadn't looked at this before.

In regards to "powering down" a UV system, it's not clear yet that it will
help much unless we can power down a whole lot more of the system than just
the cpus... ;-)  [but in reality, reducing power use any time you can is 
real necessity.]

Btw, are you ok with the remainder of the patch?

Thanks,
Mike
--

From: Mike Travis
Date: Monday, September 22, 2008 - 7:47 am

This keeps it in the uv_hub percpu area, a chunk of memory that is accessed
quite often for many of the basic UV operations.  This should increase the
probability that the data is already located in the cpu's cache.

Thanks,
Mike

--

Previous thread: Thinkpad T60p does not recover from Suspend when docked. by Chris Rankin on Friday, September 19, 2008 - 7:30 am. (1 message)

Next thread: [PATCH 0/1] SGI X86 UV: Provide a System Activity Indicator driver by Mike Travis on Friday, September 19, 2008 - 7:37 am. (1 message)