login
Header Space

 
 

Reducing the Schedstat Memory Footprint

October 17, 2007 - 2:49pm
Submitted by Jeremy on October 17, 2007 - 2:49pm.
Linux news

Ken Chen submitted a patch to reduce the memory footprint of schedstat in a thread titled, "schedstat needs a diet". He explained, "schedstat is useful in investigating CPU scheduler behavior. Ideally, I think it is beneficial to have it on all the time. However, the cost of turning it on in production system is quite high, largely due to number of events it collects and also due to its large memory footprint." His patch converted numerous unsigned long variables to unsigned int, "most of the fields probably don't need to be [a] full 64-bits on 64-bit [architectures]. Rolling over 4 billion events will most likly take a long time and user space tools can be made to accommodate that."

Ingo Molnar merged the patch into his scheduler development tree suggesting there were further conversions that could be made, "note that current -git has a whole bunch of new schedstats fields in /proc//sched which can be used to track the exact balancing behavior of tasks. It can be cleared via echoing 0 to the file - so overflow is not an issue. Most of those new fields should probably be unsigned int too. (they are u64 right now.)"


From: Ken Chen <kenchen@...>
Subject: [patch] sched: schedstat needs a diet
Date: Oct 16, 4:37 pm 2007

schedstat is useful in investigating CPU scheduler behavior.  Ideally,
I think it is beneficial to have it on all the time.  However, the
cost of turning it on in production system is quite high, largely due
to number of events it collects and also due to its large memory
footprint.

Most of the fields probably don't need to be full 64-bit on 64-bit
arch.  Rolling over 4 billion events will most like take a long time
and user space tool can be made to accommodate that.  I'm proposing
kernel to cut back most of variable width on 64-bit system.  (note,
the following patch doesn't affect 32-bit system).


Signed-off-by: Ken Chen <kenchen@google.com>


diff --git a/include/linux/sched.h b/include/linux/sched.h
index 592e3a5..311a8bd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -562,7 +562,7 @@ struct sched_info {
 			   last_queued;	/* when we were last queued to run */
 #ifdef CONFIG_SCHEDSTATS
 	/* BKL stats */
-	unsigned long bkl_count;
+	unsigned int bkl_count;
 #endif
 };
 #endif /* defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) */
@@ -698,34 +698,34 @@ struct sched_domain {

 #ifdef CONFIG_SCHEDSTATS
 	/* load_balance() stats */
-	unsigned long lb_count[CPU_MAX_IDLE_TYPES];
-	unsigned long lb_failed[CPU_MAX_IDLE_TYPES];
-	unsigned long lb_balanced[CPU_MAX_IDLE_TYPES];
-	unsigned long lb_imbalance[CPU_MAX_IDLE_TYPES];
-	unsigned long lb_gained[CPU_MAX_IDLE_TYPES];
-	unsigned long lb_hot_gained[CPU_MAX_IDLE_TYPES];
-	unsigned long lb_nobusyg[CPU_MAX_IDLE_TYPES];
-	unsigned long lb_nobusyq[CPU_MAX_IDLE_TYPES];
+	unsigned int lb_count[CPU_MAX_IDLE_TYPES];
+	unsigned int lb_failed[CPU_MAX_IDLE_TYPES];
+	unsigned int lb_balanced[CPU_MAX_IDLE_TYPES];
+	unsigned int lb_imbalance[CPU_MAX_IDLE_TYPES];
+	unsigned int lb_gained[CPU_MAX_IDLE_TYPES];
+	unsigned int lb_hot_gained[CPU_MAX_IDLE_TYPES];
+	unsigned int lb_nobusyg[CPU_MAX_IDLE_TYPES];
+	unsigned int lb_nobusyq[CPU_MAX_IDLE_TYPES];

 	/* Active load balancing */
-	unsigned long alb_count;
-	unsigned long alb_failed;
-	unsigned long alb_pushed;
+	unsigned int alb_count;
+	unsigned int alb_failed;
+	unsigned int alb_pushed;

 	/* SD_BALANCE_EXEC stats */
-	unsigned long sbe_count;
-	unsigned long sbe_balanced;
-	unsigned long sbe_pushed;
+	unsigned int sbe_count;
+	unsigned int sbe_balanced;
+	unsigned int sbe_pushed;

 	/* SD_BALANCE_FORK stats */
-	unsigned long sbf_count;
-	unsigned long sbf_balanced;
-	unsigned long sbf_pushed;
+	unsigned int sbf_count;
+	unsigned int sbf_balanced;
+	unsigned int sbf_pushed;

 	/* try_to_wake_up() stats */
-	unsigned long ttwu_wake_remote;
-	unsigned long ttwu_move_affine;
-	unsigned long ttwu_move_balance;
+	unsigned int ttwu_wake_remote;
+	unsigned int ttwu_move_affine;
+	unsigned int ttwu_move_balance;
 #endif
 };

diff --git a/kernel/sched.c b/kernel/sched.c
index 0da2b26..5e7fce9 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -328,22 +328,22 @@ struct rq {
 	struct sched_info rq_sched_info;

 	/* sys_sched_yield() stats */
-	unsigned long yld_exp_empty;
-	unsigned long yld_act_empty;
-	unsigned long yld_both_empty;
-	unsigned long yld_count;
+	unsigned int yld_exp_empty;
+	unsigned int yld_act_empty;
+	unsigned int yld_both_empty;
+	unsigned int yld_count;

 	/* schedule() stats */
-	unsigned long sched_switch;
-	unsigned long sched_count;
-	unsigned long sched_goidle;
+	unsigned int sched_switch;
+	unsigned int sched_count;
+	unsigned int sched_goidle;

 	/* try_to_wake_up() stats */
-	unsigned long ttwu_count;
-	unsigned long ttwu_local;
+	unsigned int ttwu_count;
+	unsigned int ttwu_local;

 	/* BKL stats */
-	unsigned long bkl_count;
+	unsigned int bkl_count;
 #endif
 	struct lock_class_key rq_lock_key;
 };
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index a5e517e..e6fb392 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -137,7 +137,7 @@ void print_cfs_rq(struct seq_file *m, int cpu,
struct cfs_rq *cfs_rq)
 	SEQ_printf(m, "  .%-30s: %ld\n", "nr_running", cfs_rq->nr_running);
 	SEQ_printf(m, "  .%-30s: %ld\n", "load", cfs_rq->load.weight);
 #ifdef CONFIG_SCHEDSTATS
-	SEQ_printf(m, "  .%-30s: %ld\n", "bkl_count",
+	SEQ_printf(m, "  .%-30s: %d\n", "bkl_count",
 			rq->bkl_count);
 #endif
 	SEQ_printf(m, "  .%-30s: %ld\n", "nr_spread_over",
diff --git a/kernel/sched_stats.h b/kernel/sched_stats.h
index 1c08484..ef1a7df 100644
--- a/kernel/sched_stats.h
+++ b/kernel/sched_stats.h
@@ -21,7 +21,7 @@ static int show_schedstat(struct seq_file *seq, void *v)

 		/* runqueue-specific stats */
 		seq_printf(seq,
-		    "cpu%d %lu %lu %lu %lu %lu %lu %lu %lu %lu %llu %llu %lu",
+		    "cpu%d %u %u %u %u %u %u %u %u %u %llu %llu %lu",
 		    cpu, rq->yld_both_empty,
 		    rq->yld_act_empty, rq->yld_exp_empty, rq->yld_count,
 		    rq->sched_switch, rq->sched_count, rq->sched_goidle,
@@ -42,8 +42,7 @@ static int show_schedstat(struct seq_file *seq, void *v)
 			seq_printf(seq, "domain%d %s", dcount++, mask_str);
 			for (itype = CPU_IDLE; itype < CPU_MAX_IDLE_TYPES;
 					itype++) {
-				seq_printf(seq, " %lu %lu %lu %lu %lu %lu %lu "
-						"%lu",
+				seq_printf(seq, " %u %u %u %u %u %u %u %u",
 				    sd->lb_count[itype],
 				    sd->lb_balanced[itype],
 				    sd->lb_failed[itype],
@@ -53,8 +52,7 @@ static int show_schedstat(struct seq_file *seq, void *v)
 				    sd->lb_nobusyq[itype],
 				    sd->lb_nobusyg[itype]);
 			}
-			seq_printf(seq, " %lu %lu %lu %lu %lu %lu %lu %lu %lu"
-			    " %lu %lu %lu\n",
+			seq_printf(seq, " %u %u %u %u %u %u %u %u %u %u %u %u\n",
 			    sd->alb_count, sd->alb_failed, sd->alb_pushed,
 			    sd->sbe_count, sd->sbe_balanced, sd->sbe_pushed,
 			    sd->sbf_count, sd->sbf_balanced, sd->sbf_pushed,
-

From: Ingo Molnar <mingo@...> Subject: Re: [patch] sched: schedstat needs a diet Date: Oct 17, 3:23 am 2007 * Ken Chen <kenchen@google.com> wrote: > schedstat is useful in investigating CPU scheduler behavior. Ideally, > I think it is beneficial to have it on all the time. However, the > cost of turning it on in production system is quite high, largely due > to number of events it collects and also due to its large memory > footprint. > > Most of the fields probably don't need to be full 64-bit on 64-bit > arch. Rolling over 4 billion events will most like take a long time > and user space tool can be made to accommodate that. I'm proposing > kernel to cut back most of variable width on 64-bit system. (note, > the following patch doesn't affect 32-bit system). thanks, applied. note that current -git has a whole bunch of new schedstats fields in /proc/<PID>/sched which can be used to track the exact balancing behavior of tasks. It can be cleared via echoing 0 to the file - so overflow is not an issue. Most of those new fields should probably be unsigned int too. (they are u64 right now.) Ingo -


Documentation

October 18, 2007 - 4:37am
Anonymous (not verified)

OK,

And who is going to document this? I mean all the new stuff in /proc/sched. Procfs is a very useful filesystem, but most of the information that resides in it is undocumented.

This _should_ be done by the people developing the kernel cause digging into the code and trying to figure out what a number is, is not an option IMHO

Documentation was always a

October 18, 2007 - 3:34pm
Anonymous (not verified)

Documentation was always a problem...

and I think it will be in future too

this is not a micky mouse thing - the kernel is a big thing.

But this is what I liked alot at Linux - at Windows you'll have a lot of (useless) documentation, and it's quite... different to find a solution as you cannot digg in the heart of the system

In Linux you've got the find and try procedure...

But - if you got documentation, do you think it will be easier? My friend G**g*e finds everything much faster... And this is another benefit of opensource - there may be alot of other people who were trying ;)

To avoid overflow problems: shift right EVERY one bit.

October 18, 2007 - 7:40pm
Anonymous (not verified)

uint xxx;

... each 10 minutes polling ...:
if (any xxx_i >= 1073741824) then begin
for every xxx_i >>= 1;
end

Why is this useful? Why not

October 18, 2007 - 10:04pm

Why is this useful? Why not just let it overflow?

Usually programs will be using these counters to get the difference between the counter, a few seconds ago and now. Let's suppose the earlier value is (1<<32)-10000 = 4294957296 and the current is 10000 (e.g. the counter has just overflowed).

When you subtract 4294957296 from 10000 numbers, you get -4294947296, which overflows again, and you'll get the correct result: the value changed by 20000 between the two polls. It's dirt simple and intuitive to any low-level programmer.

So overflowing is actually preferable, and the Right Way to implement this. Only will you begin to lose information when the two samples are taken so wide apart that the counter has managed to overflow twice or more.

Depends what you want

October 21, 2007 - 5:53am

If you want to wander to some system that's been up for several months and see some long term relative averages of recent behavior, the shift method will do that.

If you want something like top, then just let things roll over, and detect roll-over in user space. Alternately, provide a reset mechanism from user space that allows resetting all the counts to 0. User-space can then clear the counts on every poll.

--
Program Intellivision and play Space Patrol!

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
speck-geostationary