logo
Published on KernelTrap (http://kerneltrap.org)

Feature: Robert Love Explains Variable HZ

By Jeremy
Created Oct 15 2002 - 23:17

Robert Love [interview [0]] recently backported the jiffies_to_clock_t() code from the 2.5 development kernel to the 2.4 stable kernel. This patch allows one to adjust the frequency of the timer interrupt, defined in the standard 2.4 kernel with HZ=100. In 2.5 this has been increased to HZ=1000.

I wrote Robert asking if he could explain the usefulness of his patch, and he replied in kind with a lengthy and very interesting email detailing what the patch is, how it works, and why it's useful. He explains, "The timer interrupt is at the heart of the system. Everything lives and dies based on it. Its period is basically the granularity of the system: timers hit on 10ms intervals, timeslices come due at 10ms intervals, etc."

Read on to learn what affect changing this value will have on your Linux server, and to see the actual patch...

Robert explains:

"I will start with what HZ is and then the ramifications of changing it... HZ is the frequency of the timer interrupt. In all versions of Linux until somewhere in the middle of 2.5, it was equal to 100 on x86 (most other architectures were also 100 - other are different, e.g alpha is 1024). So if HZ=100, the timer pops every 1/100 of a second or every 10ms.

"Recently in 2.5 we went to HZ=1000 on x86. Also, RedHat shipped their 8.0 kernel at HZ=512. So it is time to experiment...

"The timer interrupt is at the heart of the system. Everything lives and dies based on it. Its period is basically the granularity of the system: timers hit on 10ms intervals, timeslices come due at 10ms intervals, etc.

"There are important performance reasons for improving it: poll() and select() are timer-based, and you can get better performance by increasing HZ. I believe this was RedHat's rationale.

"There are also process response (i.e. latency) benefits. One good example would be process preemption: consider a process with 81ms remaining of its timeslice. It runs for 80ms, and now it has 1ms left. The timer interrupt hits. Now it continues running. In 1ms, it should be preempted, but it will continue running for 10ms, until the next timer interrupt hits.

"So there are a couple of reasons for changing HZ: better timer (and time in general) granularity, improved system latency, and improved poll()/select() performance. In general, the granularity of timing on the system improves whatever-fold you increase HZ by.

"There are downsides, however. The major one is the increased timer overhead. Going from HZ=100 to HZ=1000 you have 10x more timer interrupts and thus 10x overhead. Now, on a fast system, timer overhead is probably negligible to begin with. Ten times nothing is still nothing. But on a slower system (by slow I mean 386 or 486 slow) it may be an issue. The second issue is jiffie wraparound: the system uptime clock wraps around at about 497 days (32-bits holding 10 jiffies per second, 60 seconds in a minute, ...). If we go to HZ=1000, the uptime clock wraps in 49.7 days. We can fix this by backporting the 64-bit jiffies we have in 2.5 to 2.4, however.

"So with HZ=1000, the period of the timer interrupt is 1ms. System timing is 10x finer and things will have a granularity of 1ms. It seems to be showing an improvement in 2.5, and at least RedHat supposedly benched better poll/select performance.

"Without this patch, you cannot just change HZ, since the system exports various values via system call or /proc that are measured in ticks. In other words, things that are in the units of "ticks per seconds". And that depends on HZ. So we need a mechanism to scale the system HZ to the HZ user-space assumes, which is the traditional HZ=100. That is what jiffies_to_clock_t() does in 2.5 and what I backported."


From: Robert Love
To: linux-kernel mailing list
Subject: [PATCH] 2.4: variable HZ
Date: 	15 Oct 2002 02:03:02 -0400

I backported the jiffies_to_clock_t() code from 2.5 to 2.4, mostly just
for fun.

It works fine, and I have successfully used HZ=1000 on my machines.  It
is the same API as 2.5, used in the same places - we export a static
HZ=100 to user-space and convert from the real HZ as needed.  The only
difference is I added a CONFIG_HZ to allow the user to set the value.

There are probably some HZ values you cannot use due to NTP issues.  I
suggest HZ=1000, since I know that works, and 2.5 is using it.  RedHat
is supposedly shipping their kernel with HZ=512 but I do not know if
they did anything else special.

Oh, and I did not backport 64-bit jiffies yet.  So HZ=1000 will wrap
around in just under 50 days.  If you just cannot have that, stick with
a lower value.

Patch is against 2.4.20-pre10-ac2 but applies OK to all 2.4.20-pre and
2.4.19 kernels.

Enjoy,

	Robert Love

diff -urN linux-2.4.20-pre10-ac2/arch/i386/config.in linux/arch/i386/config.in
--- linux-2.4.20-pre10-ac2/arch/i386/config.in	2002-10-14 01:43:05.000000000 -0400
+++ linux/arch/i386/config.in	2002-10-14 18:24:48.000000000 -0400
@@ -244,6 +244,7 @@
 mainmenu_option next_comment
 comment 'General setup'
 
+int 'Timer frequency (HZ) (100)' CONFIG_HZ 100
 bool 'Networking support' CONFIG_NET
 
 # Visual Workstation support is utterly broken.
diff -urN linux-2.4.20-pre10-ac2/Documentation/Configure.help linux/Documentation/Configure.help
--- linux-2.4.20-pre10-ac2/Documentation/Configure.help	2002-10-14 01:43:06.000000000 -0400
+++ linux/Documentation/Configure.help	2002-10-14 18:32:38.000000000 -0400
@@ -2411,6 +2411,18 @@
   behaviour is platform-dependent, but normally the flash frequency is
   a hyperbolic function of the 5-minute load average.
 
+Timer frequency
+CONFIG_HZ
+  The frequency the system timer interrupt pops.  Higher tick values provide
+  improved granularity of timers, improved select() and poll() performance,
+  and lower scheduling latency.  Higher values, however, increase interrupt
+  overhead and will allow jiffie wraparound sooner.  For compatibility, the
+  tick count is always exported as if HZ=100.
+
+  The default value, which was the value for all of eternity, is 100.  If
+  you are looking to provide better timer granularity or increased desktop
+  performance, try 500 or 1000.  In unsure, go with the default of 100.
+
 Networking support
 CONFIG_NET
   Unless you really know what you are doing, you should say Y here.
diff -urN linux-2.4.20-pre10-ac2/fs/proc/array.c linux/fs/proc/array.c
--- linux-2.4.20-pre10-ac2/fs/proc/array.c	2002-10-14 01:43:10.000000000 -0400
+++ linux/fs/proc/array.c	2002-10-14 18:24:48.000000000 -0400
@@ -360,15 +360,15 @@
 		task->cmin_flt,
 		task->maj_flt,
 		task->cmaj_flt,
-		task->times.tms_utime,
-		task->times.tms_stime,
-		task->times.tms_cutime,
-		task->times.tms_cstime,
+		jiffies_to_clock_t(task->times.tms_utime),
+		jiffies_to_clock_t(task->times.tms_stime),
+		jiffies_to_clock_t(task->times.tms_cutime),
+		jiffies_to_clock_t(task->times.tms_cstime),
 		priority,
 		nice,
 		0UL /* removed */,
-		task->it_real_value,
-		task->start_time,
+		jiffies_to_clock_t(task->it_real_value),
+		jiffies_to_clock_t(task->start_time),
 		vsize,
 		mm ? mm->rss : 0, /* you might want to shift this left 3 */
 		task->rlim[RLIMIT_RSS].rlim_cur,
@@ -686,14 +686,14 @@
 
 	len = sprintf(buffer,
 		"cpu  %lu %lun",
-		task->times.tms_utime,
-		task->times.tms_stime);
+		jiffies_to_clock_t(task->times.tms_utime),
+		jiffies_to_clock_t(task->times.tms_stime));
 		
 	for (i = 0 ; i per_cpu_utime[cpu_logical_map(i)],
-			task->per_cpu_stime[cpu_logical_map(i)]);
+			jiffies_to_clock_t(task->per_cpu_utime[cpu_logical_map(i)]),
+			jiffies_to_clock_t(task->per_cpu_stime[cpu_logical_map(i)]));
 
 	return len;
 }
diff -urN linux-2.4.20-pre10-ac2/fs/proc/proc_misc.c linux/fs/proc/proc_misc.c
--- linux-2.4.20-pre10-ac2/fs/proc/proc_misc.c	2002-10-14 01:43:10.000000000 -0400
+++ linux/fs/proc/proc_misc.c	2002-10-14 18:40:08.000000000 -0400
@@ -317,7 +317,7 @@
 {
 	int i, len = 0;
 	extern unsigned long total_forks;
-	unsigned long jif = jiffies;
+	unsigned long jif = jiffies_to_clock_t(jiffies);
 	unsigned int sum = 0, user = 0, nice = 0, system = 0;
 	int major, disk;
 
@@ -334,16 +334,19 @@
 	}
 
 	proc_sprintf(page, &off, &len,
-		      "cpu  %u %u %u %lun", user, nice, system,
+		      "cpu  %u %u %u %lun",
+		      jiffies_to_clock_t(user),
+		      jiffies_to_clock_t(nice),
+		      jiffies_to_clock_t(system),
 		      jif * smp_num_cpus - (user + nice + system));
 	for (i = 0 ; i 
+
+#ifdef __KERNEL__
+# define HZ		CONFIG_HZ	/* internal kernel timer frequency */
+# define USER_HZ	100		/* some user interfaces are in ticks */
+# define CLOCKS_PER_SEC	(USER_HZ)	/* like times() */
+# define jiffies_to_clock_t(x)	((x) / ((HZ) / (USER_HZ)))
+#endif
+
 #ifndef HZ
-#define HZ 100
+#define HZ 100				/* if userspace cheats, give them 100 */
 #endif
 
 #define EXEC_PAGESIZE	4096
@@ -17,8 +26,4 @@
 
 #define MAXHOSTNAMELEN	64	/* max length of hostname */
 
-#ifdef __KERNEL__
-# define CLOCKS_PER_SEC	100	/* frequency at which times() counts */
-#endif
-
 #endif
diff -urN linux-2.4.20-pre10-ac2/kernel/signal.c linux/kernel/signal.c
--- linux-2.4.20-pre10-ac2/kernel/signal.c	2002-10-14 01:43:11.000000000 -0400
+++ linux/kernel/signal.c	2002-10-14 18:24:49.000000000 -0400
@@ -13,7 +13,7 @@
 #include 
 #include 
 #include 
-
+#include 
 #include 
 
 /*
@@ -775,8 +775,8 @@
 	info.si_uid = tsk->uid;
 
 	/* FIXME: find out whether or not this is supposed to be c*time. */
-	info.si_utime = tsk->times.tms_utime;
-	info.si_stime = tsk->times.tms_stime;
+	info.si_utime = jiffies_to_clock_t(tsk->times.tms_utime);
+	info.si_stime = jiffies_to_clock_t(tsk->times.tms_stime);
 
 	status = tsk->exit_code & 0x7f;
 	why = SI_KERNEL;	/* shouldn't happen */
diff -urN linux-2.4.20-pre10-ac2/kernel/sys.c linux/kernel/sys.c
--- linux-2.4.20-pre10-ac2/kernel/sys.c	2002-10-14 01:43:11.000000000 -0400
+++ linux/kernel/sys.c	2002-10-14 18:24:49.000000000 -0400
@@ -15,7 +15,7 @@
 #include 
 #include 
 #include 
-
+#include 
 #include 
 #include 
 
@@ -792,16 +792,23 @@
 
 asmlinkage long sys_times(struct tms * tbuf)
 {
+	struct tms temp;
+
 	/*
 	 *	In the SMP world we might just be unlucky and have one of
 	 *	the times increment as we use it. Since the value is an
 	 *	atomically safe type this is just fine. Conceptually its
 	 *	as if the syscall took an instant longer to occur.
 	 */
-	if (tbuf)
-		if (copy_to_user(tbuf, &current->times, sizeof(struct tms)))
+	if (tbuf) {
+		temp.tms_utime = jiffies_to_clock_t(current->times.tms_utime);
+		temp.tms_stime = jiffies_to_clock_t(current->times.tms_stime);
+		temp.tms_cutime = jiffies_to_clock_t(current->times.tms_cutime);
+		temp.tms_cstime = jiffies_to_clock_t(current->times.tms_cstime);
+		if (copy_to_user(tbuf, &temp, sizeof(struct tms)))
 			return -EFAULT;
-	return jiffies;
+	}
+	return jiffies_to_clock_t(jiffies);
 }
 
 /*


From: Robert Love Subject: Re: [PATCH] 2.4: variable HZ Date: 15 Oct 2002 02:50:56 -0400 On Tue, 2002-10-15 at 02:03, Robert Love wrote: > It works fine, and I have successfully used HZ=1000 on my machines. Except processor usage output was screwy. Attached is an updated patch which fixes the problem. Robert Love diff -urN linux-2.4.20-pre10-ac2/arch/i386/config.in linux/arch/i386/config.in --- linux-2.4.20-pre10-ac2/arch/i386/config.in 2002-10-14 01:43:05.000000000 -0400 +++ linux/arch/i386/config.in 2002-10-15 02:41:39.000000000 -0400 @@ -244,6 +244,7 @@ mainmenu_option next_comment comment 'General setup' +int 'Timer frequency (HZ) (100)' CONFIG_HZ 100 bool 'Networking support' CONFIG_NET # Visual Workstation support is utterly broken. diff -urN linux-2.4.20-pre10-ac2/Documentation/Configure.help linux/Documentation/Configure.help --- linux-2.4.20-pre10-ac2/Documentation/Configure.help 2002-10-14 01:43:06.000000000 -0400 +++ linux/Documentation/Configure.help 2002-10-15 02:41:40.000000000 -0400 @@ -2411,6 +2411,18 @@ behaviour is platform-dependent, but normally the flash frequency is a hyperbolic function of the 5-minute load average. +Timer frequency +CONFIG_HZ + The frequency the system timer interrupt pops. Higher tick values provide + improved granularity of timers, improved select() and poll() performance, + and lower scheduling latency. Higher values, however, increase interrupt + overhead and will allow jiffie wraparound sooner. For compatibility, the + tick count is always exported as if HZ=100. + + The default value, which was the value for all of eternity, is 100. If + you are looking to provide better timer granularity or increased desktop + performance, try 500 or 1000. In unsure, go with the default of 100. + Networking support CONFIG_NET Unless you really know what you are doing, you should say Y here. diff -urN linux-2.4.20-pre10-ac2/fs/proc/array.c linux/fs/proc/array.c --- linux-2.4.20-pre10-ac2/fs/proc/array.c 2002-10-14 01:43:10.000000000 -0400 +++ linux/fs/proc/array.c 2002-10-14 18:24:48.000000000 -0400 @@ -360,15 +360,15 @@ task->cmin_flt, task->maj_flt, task->cmaj_flt, - task->times.tms_utime, - task->times.tms_stime, - task->times.tms_cutime, - task->times.tms_cstime, + jiffies_to_clock_t(task->times.tms_utime), + jiffies_to_clock_t(task->times.tms_stime), + jiffies_to_clock_t(task->times.tms_cutime), + jiffies_to_clock_t(task->times.tms_cstime), priority, nice, 0UL /* removed */, - task->it_real_value, - task->start_time, + jiffies_to_clock_t(task->it_real_value), + jiffies_to_clock_t(task->start_time), vsize, mm ? mm->rss : 0, /* you might want to shift this left 3 */ task->rlim[RLIMIT_RSS].rlim_cur, @@ -686,14 +686,14 @@ len = sprintf(buffer, "cpu %lu %lun", - task->times.tms_utime, - task->times.tms_stime); + jiffies_to_clock_t(task->times.tms_utime), + jiffies_to_clock_t(task->times.tms_stime)); for (i = 0 ; i per_cpu_utime[cpu_logical_map(i)], - task->per_cpu_stime[cpu_logical_map(i)]); + jiffies_to_clock_t(task->per_cpu_utime[cpu_logical_map(i)]), + jiffies_to_clock_t(task->per_cpu_stime[cpu_logical_map(i)])); return len; } diff -urN linux-2.4.20-pre10-ac2/fs/proc/proc_misc.c linux/fs/proc/proc_misc.c --- linux-2.4.20-pre10-ac2/fs/proc/proc_misc.c 2002-10-14 01:43:10.000000000 -0400 +++ linux/fs/proc/proc_misc.c 2002-10-15 02:29:21.000000000 -0400 @@ -317,16 +317,16 @@ { int i, len = 0; extern unsigned long total_forks; - unsigned long jif = jiffies; + unsigned long jif = jiffies_to_clock_t(jiffies); unsigned int sum = 0, user = 0, nice = 0, system = 0; int major, disk; for (i = 0 ; i + +#ifdef __KERNEL__ +# define HZ CONFIG_HZ /* internal kernel timer frequency */ +# define USER_HZ 100 /* some user interfaces are in ticks */ +# define CLOCKS_PER_SEC (USER_HZ) /* like times() */ +# define jiffies_to_clock_t(x) ((x) / ((HZ) / (USER_HZ))) +#endif + #ifndef HZ -#define HZ 100 +#define HZ 100 /* if userspace cheats, give them 100 */ #endif #define EXEC_PAGESIZE 4096 @@ -17,8 +26,4 @@ #define MAXHOSTNAMELEN 64 /* max length of hostname */ -#ifdef __KERNEL__ -# define CLOCKS_PER_SEC 100 /* frequency at which times() counts */ -#endif - #endif diff -urN linux-2.4.20-pre10-ac2/kernel/signal.c linux/kernel/signal.c --- linux-2.4.20-pre10-ac2/kernel/signal.c 2002-10-14 01:43:11.000000000 -0400 +++ linux/kernel/signal.c 2002-10-14 18:24:49.000000000 -0400 @@ -13,7 +13,7 @@ #include #include #include - +#include #include /* @@ -775,8 +775,8 @@ info.si_uid = tsk->uid; /* FIXME: find out whether or not this is supposed to be c*time. */ - info.si_utime = tsk->times.tms_utime; - info.si_stime = tsk->times.tms_stime; + info.si_utime = jiffies_to_clock_t(tsk->times.tms_utime); + info.si_stime = jiffies_to_clock_t(tsk->times.tms_stime); status = tsk->exit_code & 0x7f; why = SI_KERNEL; /* shouldn't happen */ diff -urN linux-2.4.20-pre10-ac2/kernel/sys.c linux/kernel/sys.c --- linux-2.4.20-pre10-ac2/kernel/sys.c 2002-10-14 01:43:11.000000000 -0400 +++ linux/kernel/sys.c 2002-10-14 18:24:49.000000000 -0400 @@ -15,7 +15,7 @@ #include #include #include - +#include #include #include @@ -792,16 +792,23 @@ asmlinkage long sys_times(struct tms * tbuf) { + struct tms temp; + /* * In the SMP world we might just be unlucky and have one of * the times increment as we use it. Since the value is an * atomically safe type this is just fine. Conceptually its * as if the syscall took an instant longer to occur. */ - if (tbuf) - if (copy_to_user(tbuf, &current->times, sizeof(struct tms))) + if (tbuf) { + temp.tms_utime = jiffies_to_clock_t(current->times.tms_utime); + temp.tms_stime = jiffies_to_clock_t(current->times.tms_stime); + temp.tms_cutime = jiffies_to_clock_t(current->times.tms_cutime); + temp.tms_cstime = jiffies_to_clock_t(current->times.tms_cstime); + if (copy_to_user(tbuf, &temp, sizeof(struct tms))) return -EFAULT; - return jiffies; + } + return jiffies_to_clock_t(jiffies); } /*



Related links:

  • KernelTrap Interview With Robert Love [1]
  • Google Archive Of Above Thread [2]

  • Source URL:
    http://kerneltrap.org/node/464