Re: [PATCH] Make for_each_cpu_mask a bit smaller

Previous thread: Re: performance "regression" in cfq compared to anticipatory, deadline and noop by Daniel J Blueman on Sunday, May 11, 2008 - 6:14 am. (7 messages)

Next thread: Microblaze toolchain - libc by Michal Simek on Sunday, May 11, 2008 - 7:05 am. (3 messages)
From: Alexander van Heukelum
Date: Sunday, May 11, 2008 - 6:50 am

The for_each_cpu_mask loop is used quite often in the kernel. It
makes use of two functions: first_cpu and next_cpu. This patch
changes for_each_cpu_mask to use only one function: a newly
introduced find_next_cpu. Each use of the for_each_cpu_mask
constuct then becomes a few bytes smaller. An x86_64 defconfig
kernel is about 2000 bytes smaller with this patch applied:

   text	   data	    bss	    dec	    hex	filename
5395732	 976736	 734280	7106748	 6c70bc	vmlinux.orig
5393639	 976736	 734280	7104655	 6c688f	vmlinux

Runs fine on qemu UP/SMP x86_64/i386.

Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>

---

Hello Andrew,

Could you add this patch to -mm?

Greetings,
	Alexander

 include/linux/cpumask.h |   14 ++++++++------
 lib/cpumask.c           |    6 ++++++
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 9650806..a760e29 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -221,9 +221,11 @@ int __first_cpu(const cpumask_t *srcp);
 #define first_cpu(src) __first_cpu(&(src))
 int __next_cpu(int n, const cpumask_t *srcp);
 #define next_cpu(n, src) __next_cpu((n), &(src))
+int find_next_cpu_mask(int n, const cpumask_t *srcp);
 #else
-#define first_cpu(src)		({ (void)(src); 0; })
-#define next_cpu(n, src)	({ (void)(src); 1; })
+#define first_cpu(src)			({ (void)(src); 0; })
+#define next_cpu(n, src)		({ (void)(src); 1; })
+#define find_next_cpu_mask(n, src)	({ (void)(src); n; })
 #endif
 
 #ifdef CONFIG_HAVE_CPUMASK_OF_CPU_MAP
@@ -351,10 +353,10 @@ static inline void __cpus_fold(cpumask_t *dstp, const cpumask_t *origp,
 }
 
 #if NR_CPUS > 1
-#define for_each_cpu_mask(cpu, mask)		\
-	for ((cpu) = first_cpu(mask);		\
-		(cpu) < NR_CPUS;		\
-		(cpu) = next_cpu((cpu), (mask)))
+#define for_each_cpu_mask(cpu, mask)				\
+	for ((cpu) = 0;						\
+		(cpu) = find_next_cpu_mask((cpu), &(mask)),	\
+		(cpu) < NR_CPUS; (cpu)++)
 #else /* NR_CPUS == 1 */
 ...
From: Paul Jackson
Date: Sunday, May 11, 2008 - 6:57 am

Looks good, at first glance.

Could you include Mike Travis <travis@sgi.com> on
the CC list for this work?  He's been having
fun lately with cpumasks.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: Paul Jackson
Date: Sunday, May 11, 2008 - 7:14 am

I believe that it's for_each_cpu_mask which is newly introduced,
not find_next_cpu ... just a typo, granted.

Any chance that you could make the same change to nodemask.h?
Where practical, I like to keep cpumask.h and nodemask.h the same.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: Alexander van Heukelum
Date: Sunday, May 11, 2008 - 9:06 am

The for_each_node_mask loop makes use of two inlined functions:
first_node and next_node. This patch changes for_each_node_mask
to use only one out-of-line function: find_next_node_mask. An
x86_64 defconfig kernel is about 1500 bytes smaller with this
patch applied:

   text	   data	    bss	    dec	    hex	filename
5395732	 976736	 734280	7106748	 6c70bc	vmlinux.orig
5394174	 976736	 734280	7105190	 6c6aa6	vmlinux

Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>

---


I meant find_next_cpu_mask. That was a last-minute change of mind

Sure. This patch introduces lib/nodemask.c, but I'm not quite sure
if building it should depend on CONFIG_SMP or something else (NUMA?).
When is MAX_NUMNODES 1?

It seems to work on qemu x86_64-smp and i386-up.

I'ld be happy to take a stab at aligning the cpumask and nodemask
code even more by uninlining some more functions and using stubs
for the MAX_NUMNODES=1 case.

Greetings,
	Alexander

 include/linux/nodemask.h |   11 +++++++----
 lib/Makefile             |    2 +-
 lib/nodemask.c           |   10 ++++++++++
 3 files changed, 18 insertions(+), 5 deletions(-)
 create mode 100644 lib/nodemask.c

diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index 848025c..dbc80f0 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -239,6 +239,8 @@ static inline int __next_node(int n, const nodemask_t *srcp)
 	return min_t(int,MAX_NUMNODES,find_next_bit(srcp->bits, MAX_NUMNODES, n+1));
 }
 
+int find_next_node_mask(int n, const nodemask_t *srcp);
+
 #define nodemask_of_node(node)						\
 ({									\
 	typeof(_unused_nodemask_arg_) m;				\
@@ -347,10 +349,11 @@ static inline void __nodes_fold(nodemask_t *dstp, const nodemask_t *origp,
 }
 
 #if MAX_NUMNODES > 1
-#define for_each_node_mask(node, mask)			\
-	for ((node) = first_node(mask);			\
-		(node) < MAX_NUMNODES;			\
-		(node) = next_node((node), (mask)))
+#define for_each_node_mask(node, mask)				\
+	for ((node) = ...
From: Paul Jackson
Date: Sunday, May 11, 2008 - 2:01 pm

Well ... I'm pretty sure it made sense to depend on SMP, back when
it was first added.  However that might have changed.  I recall
vaguely that there has been discussion of this CONFIG_SMP dependency
every year or so, but I don't have the time right now to dig through
the archives and code to figure it out.


That could be good ... though could you co-ordinate with Mike Travis
first, to minimize the risks of merge conflicts with what he's doing?

You kernel text space saving in the first patch seemed worth going
ahead with even if it did conflict a little, and I liked the matching
nodemask patch, just to keep things in sync.  Other nodemask cleanup
is a little lower priority in my book, so should make a modest effort
to co-ordinate with more critical patches, to minimize conflict.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: Alexander van Heukelum
Date: Monday, May 12, 2008 - 5:04 am

I believe the x86#testing tree includes Mike's work? The two patches

Thanks for your guidance.

Greetings,
-- 
  Alexander van Heukelum
  heukelum@fastmail.fm

-- 
http://www.fastmail.fm - Accessible with your email software
                          or over the web

--

From: Mike Travis
Date: Monday, May 12, 2008 - 9:45 am

Yes, my only change right now is to use nr_cpu_ids instead of NR_CPUS
which short cuts a huge chunk out of the back end of the loop.  And my
changes are in sched/latest (which includes x86/latest).

Thanks,

--

From: Alexander van Heukelum
Date: Monday, May 12, 2008 - 12:00 pm

The for_each_cpu_mask loop is used quite often in the kernel. It
makes use of two functions: first_cpu and next_cpu. This patch
changes for_each_cpu_mask to use only the latter. Because next_cpu
finds the next eligible cpu _after_ the given one, the iteration
variable has to be initialized to -1 and next_cpu has to be
called with this value before the first iteration. An x86_64
defconfig kernel (from sched/latest) is about 2500 bytes smaller
with this patch applied:

   text	   data	    bss	    dec	    hex	filename
6222517	 917952	 749932	7890401	 7865e1	vmlinux.orig
6219922	 917952	 749932	7887806	 785bbe	vmlinux

The same size reduction is seen for defconfig+MAXSMP

   text	   data	    bss	    dec	    hex	filename
6241772	2563968	1492716	10298456	 9d2458	vmlinux.orig
6239211	2563968	1492716	10295895	 9d1a57	vmlinux

Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>

---

Hello Mike, Ingo,

Here is a version of the for_each_cpu_mask change that applies on
top of sched/latest. I think it is non-intrusive enough to put
it in sched/latest?

This version reuses the already-existing api next_cpu instead
of inventing a new one; initializing the iteration counter to
-1 was suggested by Matthew Wilcox.

Greetings,
	Alexander

 include/linux/cpumask.h |   16 ++++++++--------
 1 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 73434e5..74b748b 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -377,10 +377,10 @@ int __any_online_cpu(const cpumask_t *mask);
 #define first_cpu(src)		__first_cpu(&(src))
 #define next_cpu(n, src)	__next_cpu((n), &(src))
 #define any_online_cpu(mask) __any_online_cpu(&(mask))
-#define for_each_cpu_mask(cpu, mask)		\
-	for ((cpu) = first_cpu(mask);		\
-		(cpu) < NR_CPUS;		\
-		(cpu) = next_cpu((cpu), (mask)))
+#define for_each_cpu_mask(cpu, mask)			\
+	for ((cpu) = ~((typeof(cpu))0);			\
+		(cpu) = next_cpu((cpu), (mask)),	\
+		(cpu) < ...
From: Andreas Schwab
Date: Monday, May 12, 2008 - 2:45 pm

There is no need for such a complicated expression, -1 will work for
every (arithmetic) type.

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."
--

From: Alexander van Heukelum
Date: Tuesday, May 13, 2008 - 2:28 am

The for_each_cpu_mask loop is used quite often in the kernel. It
makes use of two functions: first_cpu and next_cpu. This patch
changes for_each_cpu_mask to use only the latter. Because next_cpu
finds the next eligible cpu _after_ the given one, the iteration
variable has to be initialized to -1 and next_cpu has to be
called with this value before the first iteration. An x86_64
defconfig kernel (from sched/latest) is about 2500 bytes smaller
with this patch applied:

   text	   data	    bss	    dec	    hex	filename
6222517	 917952	 749932	7890401	 7865e1	vmlinux.orig
6219922	 917952	 749932	7887806	 785bbe	vmlinux

The same size reduction is seen for defconfig+MAXSMP

   text	   data	    bss	    dec	    hex	filename
6241772	2563968	1492716	10298456	 9d2458	vmlinux.orig
6239211	2563968	1492716	10295895	 9d1a57	vmlinux

Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>

---


Indeed, thanks.

This version applies on top of sched/latest.

This version reuses the already-existing api next_cpu instead
of inventing a new one; initializing the iteration counter to
-1 was suggested by Matthew Wilcox. Now with a -1 instead of
an overly carefull ~((typeof(cpu))0). "-1" is properly sign-
extended even if cpu is u64 in a 32-bit environment.

Greetings,
	Alexander

 include/linux/cpumask.h |   16 ++++++++--------
 1 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 73434e5..c24a556 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -377,10 +377,10 @@ int __any_online_cpu(const cpumask_t *mask);
 #define first_cpu(src)		__first_cpu(&(src))
 #define next_cpu(n, src)	__next_cpu((n), &(src))
 #define any_online_cpu(mask) __any_online_cpu(&(mask))
-#define for_each_cpu_mask(cpu, mask)		\
-	for ((cpu) = first_cpu(mask);		\
-		(cpu) < NR_CPUS;		\
-		(cpu) = next_cpu((cpu), (mask)))
+#define for_each_cpu_mask(cpu, mask)			\
+	for ((cpu) = -1;				\
+		(cpu) = next_cpu((cpu), ...
From: Ingo Molnar
Date: Tuesday, May 13, 2008 - 5:02 am

applied for testing, thanks Alexander.

	Ingo
--

From: Matthew Wilcox
Date: Sunday, May 11, 2008 - 8:24 am

For anyone else having similar cognitive dissonance while reading this

Maybe a better name for this function would help.  I can't think of a
good one right now though.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--

From: Alexander van Heukelum
Date: Sunday, May 11, 2008 - 9:19 am

On Sun, 11 May 2008 09:24:40 -0600, "Matthew Wilcox" <matthew@wil.cx>


I can't think of a better name, and there is find_next_bit of which
find_next_cpu_mask is just a wrapper. I think the name is good enough.

Greetings,
    Alexander
-- 
  Alexander van Heukelum
  heukelum@fastmail.fm

-- 
http://www.fastmail.fm - Access all of your messages and folders
                          wherever you are

--

From: Matthew Wilcox
Date: Sunday, May 11, 2008 - 3:01 pm

How about doing it this way?

#define for_each_cpu_mask(cpu, mask)				\
	for ((cpu) = -1;					\
	     (cpu) < NR_CPUS;					\
	     (cpu) = find_next_cpu_mask((cpu), &(mask)))

int find_next_cpu_mask(int n, const cpumask_t *srcp)
{
	return find_next_bit(srcp->bits, NR_CPUS, ++n);
}

That actually behaves the way I'd expect a function called
'find_next_cpu_mask' to work.  It also abuses the 'for' condtion
less and might take a little less text space.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--

From: Alexander van Heukelum
Date: Monday, May 12, 2008 - 4:04 am

On Sun, 11 May 2008 16:01:12 -0600, "Matthew Wilcox" <matthew@wil.cx>

But it does not work.

It introduces a stray cpu=-1 iteration if cpu happens to be
(replaced by) a signed variable.

It skips the entire loop if cpu happens to be unsigned.

I don't think that using 'for' in a less conventional way
is bad if it is hidden in a macro, as long as the name of
the macro makes the intention sufficiently clear.

I think of find_next_cpu_mask(cpu, mask) as: "find next
cpu-index in mask, starting at index cpu". And similar
with find_next_bit.

As for the text-space argument, I think you might be right.
Just not on i386/x86_64 where initialising a register to -1
can be done in three bytes, initialising to 0 in two bytes
and an increment in one byte :-).

Greetings,
-- 
  Alexander van Heukelum
  heukelum@fastmail.fm

-- 
http://www.fastmail.fm - mmm... Fastmail...

--

From: Matthew Wilcox
Date: Monday, May 12, 2008 - 4:56 am

You're right, of course.  Either of these should work:

#define for_each_cpu_mask(cpu, mask)				\
	for ((cpu) = find_next_cpu_mask(-1, &(mask));		\
	     (cpu) < NR_CPUS;					\
	     (cpu) = find_next_cpu_mask((cpu), &(mask)))

#define for_each_cpu_mask(cpu, mask)				\
	for ((cpu) = -1;					\
	     (cpu) = find_next_cpu_mask((cpu), &(mask)), (cpu) < NR_CPUS; \
	     )

int find_next_cpu_mask(int n, const cpumask_t *srcp)
{
	return find_next_bit(srcp->bits, NR_CPUS, ++n);

If you're determined to stick to your original formulation, renaming it

While the initialisation may take one extra byte, the increment is
done in the function, and so takes at least one byte out of the loop.
Of course on other instruction sets, that's probably 4 bytes out of the
loop and they can initialise to -1 as cheaply as init to 0, so it's a
win on non-x86 and neutral on x86.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--

Previous thread: Re: performance "regression" in cfq compared to anticipatory, deadline and noop by Daniel J Blueman on Sunday, May 11, 2008 - 6:14 am. (7 messages)

Next thread: Microblaze toolchain - libc by Michal Simek on Sunday, May 11, 2008 - 7:05 am. (3 messages)