The SGI UV system has no LEDS but uses one of the system controller
regs to indicate the online internal state of the cpu. There is a
heartbeat bit indicating that the cpu is responding to interrupts,
and an idle bit indicating whether the cpu has been more or less than
50% idle each heartbeat period. The current period is one second.
When a cpu panics, an error code is written by BIOS to this same reg.
So the reg has been renamed the "System Controller Interface Reg".
This patchset provides the following:
* x86_64: Add base functionality for writing to the specific SCIR's
for each cpu.
* idle: Add an idle callback to measure the idle "on" and "off" times.
* heartbeat: Invert "heartbeat" bit to indicate the cpu is "active".
* if hotplug enabled, all bits are set (0xff) when the cpu is disabled.
Based on linux-2.6.tip/master.
Signed-off-by: Mike Travis <travis@sgi.com>
---
arch/x86/kernel/genx2apic_uv_x.c | 138 +++++++++++++++++++++++++++++++++++++++
include/asm-x86/uv/uv_hub.h | 62 +++++++++++++++++
2 files changed, 200 insertions(+)
--- linux-2.6.tip.orig/arch/x86/kernel/genx2apic_uv_x.c
+++ linux-2.6.tip/arch/x86/kernel/genx2apic_uv_x.c
@@ -10,6 +10,7 @@
#include <linux/kernel.h>
#include <linux/threads.h>
+#include <linux/cpu.h>
#include <linux/cpumask.h>
#include <linux/string.h>
#include <linux/ctype.h>
@@ -18,6 +19,8 @@
#include <linux/bootmem.h>
#include <linux/module.h>
#include <linux/hardirq.h>
+#include <linux/timer.h>
+#include <asm/idle.h>
#include <asm/smp.h>
#include <asm/ipi.h>
#include <asm/genapic.h>
@@ -358,6 +361,139 @@ static __init void uv_rtc_init(void)
sn_rtc_cycles_per_second = ticks_per_sec;
}
+/*
+ * percpu heartbeat timer
+ */
+static void uv_heartbeat(unsigned long ignored)
+{
+ struct timer_list *timer = &uv_hub_info->scir.timer;
+ unsigned char bits = uv_hub_info->scir.state;
+
+ /* flip heartbeat bit */
+ bits ^= SCIR_CPU_HEARTBEAT;
+
+ /* determine if we were ...no, we really dont want such overhead in an idle notifier. (those interfaces will go away) As i suggested in my previous mail about this topic, a low-frequency sampling method should be used instead, to indicate system status. I thought the leds drivers have all that in place already. Ingo --
Hi Ingo,
The overhead is very low and I've not been able to figure out how to
"estimate" per cpu usage without tons more code (duplicating "top" in
the kernel.) The actual I/O (well O ;-) is once per second. The other
important point is this is only on an SGI UV system, so no other systems
are affected. Because of this, I'm a bit confused by your objection.
I tried a couple of other approaches, but because the timer wakeup
causes the idle state to be exited, it's difficult to determine if
the cpu was idle before the timer interrupt. (I even tried putting a
"wasidle" in the pda to check the idle state prior to the last exit
from idle, but this did not appear to be reliable.) The idle callback
solves this rather nicely.
Here's the same code that's been in ia64 for quite a while:
arch/ia64/sn/kernel/setup.c:
void __init sn_setup(char **cmdline_p)
{
...
ia64_mark_idle = &snidle;
arch/ia64/kernel/process.c:
cpu_idle (void)
{
void (*mark_idle)(int) = ia64_mark_idle;
...
if (mark_idle)
(*mark_idle)(1);
(The x86 callback approach is much cleaner.)
Another relevant point is that I will be adding a bit more functionality
to disable the timer interrupt on truly "idle" cpus (like have been idle
for some amount of seconds). We would then use the "exit from idle"
callback to reestablish the timer interrupt. [This would allow them to
It is low frequency (once per second), this is just setting what's to
be sampled.
As I mentioned, this is not for LED displays (human readable), it's for the
system controller to monitor how all parts of the system are running, and
this one is just the cpu parts. The LED driver approach would have me
registering 4096 led devices, with all their callbacks, 4096 strings saying
"LED0001", etc., and I still cannot associate a specific register bit
(AKA LED if that's what it was), with a specific cpu using the LED driver.
The LED driver is fine for a couple of blinking ...So overhead from led driver is not okay, while overhead from messing with idle loop is okay? Interesting... Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
Thanks, I did look at it. Quite complex. Maybe I'm missing something but I don't see how it fits in? Are you saying I should be using data in the percpu tick_sched to gather the idle information for the once per second per cpu status update interrupt? I see the @idle_active entry but wouldn't this always be false during the timer interrupt? Using any other entries would appear to be more complex than a simple store byte and subtracting two longs. Or perhaps I should somehow hook into the sched_timer interrupt instead of having a separate once per second per cpu interrupt? (Does this The overhead is mainly the registration of descriptor blocks for the 4096 registers representing the 4096 cpus all at separate addresses. The overhead in this patch for maintaining the "idle" state (prior to the timer interrupt causing "exit_idle") is storing a byte and subtracting the current jiffies from the jiffies at the last one second timer interrupt. (Even this subtraction can be removed, the only *important* item is whether the cpu is currently idle or not.) This data is written to node local memory that's highly likely to be in the cache, as the same memory block is used for all UV hub operations. Unfortunately, I am experiencing a simulator problem at the moment or I'd be able to quantify the exact amount of time added to the exit_idle() function, but it's basically noise in the overall resumption of a thread. One other factor, this overhead is *only* for UV systems, no other x86_64 systems or architectures are affected, so again I'm not understanding the objection. This request was made from our hardware and RAS engineers, and is identical to what's been in the ia64 kernel for a few years now. Perhaps the confusion is it's near relationship to real "LED" lights? The original name "LED" is historical. The bits are read by a system controller that has the job of monitoring the entire system, including both soft and hard errors and determining faulty [or near faulty] system ...
Actually, this comparing idle time vs. not idle time during the last second is what gets around the problem that the system goes to not idle servicing the timer interrupt, which hides the real idle state. If anyone has a suggestion on how to get a once per second per cpu timer callback which does not call exit_idle, (or any other means of indicating whether the cpu is idle), I'd be more than happy to remove the idle callback function. In discussions with SGI's RAS engineering it's felt that this status is very important for their current RAS analysis programs, making the system overhead for UV more than worthwhile. Thanks, --
I seen your remark above about disabling timer interrupt on idle cpus. That's exactly nohz functionality, right? Maybe I misunderstood you and you meant "my monitoring interrupt" and not "generic system timer interrupt"? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
Ahh, yes, now I see... ;-) And thanks for the pointer, I hadn't looked at this before. In regards to "powering down" a UV system, it's not clear yet that it will help much unless we can power down a whole lot more of the system than just the cpus... ;-) [but in reality, reducing power use any time you can is real necessity.] Btw, are you ok with the remainder of the patch? Thanks, Mike --
This keeps it in the uv_hub percpu area, a chunk of memory that is accessed quite often for many of the basic UV operations. This should increase the probability that the data is already located in the cpu's cache. Thanks, Mike --
