The for_each_cpu_mask loop is used quite often in the kernel. It
makes use of two functions: first_cpu and next_cpu. This patch
changes for_each_cpu_mask to use only one function: a newly
introduced find_next_cpu. Each use of the for_each_cpu_mask
constuct then becomes a few bytes smaller. An x86_64 defconfig
kernel is about 2000 bytes smaller with this patch applied:
text data bss dec hex filename
5395732 976736 734280 7106748 6c70bc vmlinux.orig
5393639 976736 734280 7104655 6c688f vmlinux
Runs fine on qemu UP/SMP x86_64/i386.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
---
Hello Andrew,
Could you add this patch to -mm?
Greetings,
Alexander
include/linux/cpumask.h | 14 ++++++++------
lib/cpumask.c | 6 ++++++
2 files changed, 14 insertions(+), 6 deletions(-)
diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 9650806..a760e29 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -221,9 +221,11 @@ int __first_cpu(const cpumask_t *srcp);
#define first_cpu(src) __first_cpu(&(src))
int __next_cpu(int n, const cpumask_t *srcp);
#define next_cpu(n, src) __next_cpu((n), &(src))
+int find_next_cpu_mask(int n, const cpumask_t *srcp);
#else
-#define first_cpu(src) ({ (void)(src); 0; })
-#define next_cpu(n, src) ({ (void)(src); 1; })
+#define first_cpu(src) ({ (void)(src); 0; })
+#define next_cpu(n, src) ({ (void)(src); 1; })
+#define find_next_cpu_mask(n, src) ({ (void)(src); n; })
#endif
#ifdef CONFIG_HAVE_CPUMASK_OF_CPU_MAP
@@ -351,10 +353,10 @@ static inline void __cpus_fold(cpumask_t *dstp, const cpumask_t *origp,
}
#if NR_CPUS > 1
-#define for_each_cpu_mask(cpu, mask) \
- for ((cpu) = first_cpu(mask); \
- (cpu) < NR_CPUS; \
- (cpu) = next_cpu((cpu), (mask)))
+#define for_each_cpu_mask(cpu, mask) \
+ for ((cpu) = 0; \
+ (cpu) = find_next_cpu_mask((cpu), &(mask)), \
+ (cpu) < NR_CPUS; (cpu)++)
#else /* NR_CPUS == 1 */
...Looks good, at first glance.
Could you include Mike Travis <travis@sgi.com> on
the CC list for this work? He's been having
fun lately with cpumasks.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
I believe that it's for_each_cpu_mask which is newly introduced,
not find_next_cpu ... just a typo, granted.
Any chance that you could make the same change to nodemask.h?
Where practical, I like to keep cpumask.h and nodemask.h the same.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
The for_each_node_mask loop makes use of two inlined functions:
first_node and next_node. This patch changes for_each_node_mask
to use only one out-of-line function: find_next_node_mask. An
x86_64 defconfig kernel is about 1500 bytes smaller with this
patch applied:
text data bss dec hex filename
5395732 976736 734280 7106748 6c70bc vmlinux.orig
5394174 976736 734280 7105190 6c6aa6 vmlinux
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
---
I meant find_next_cpu_mask. That was a last-minute change of mind
Sure. This patch introduces lib/nodemask.c, but I'm not quite sure
if building it should depend on CONFIG_SMP or something else (NUMA?).
When is MAX_NUMNODES 1?
It seems to work on qemu x86_64-smp and i386-up.
I'ld be happy to take a stab at aligning the cpumask and nodemask
code even more by uninlining some more functions and using stubs
for the MAX_NUMNODES=1 case.
Greetings,
Alexander
include/linux/nodemask.h | 11 +++++++----
lib/Makefile | 2 +-
lib/nodemask.c | 10 ++++++++++
3 files changed, 18 insertions(+), 5 deletions(-)
create mode 100644 lib/nodemask.c
diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index 848025c..dbc80f0 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -239,6 +239,8 @@ static inline int __next_node(int n, const nodemask_t *srcp)
return min_t(int,MAX_NUMNODES,find_next_bit(srcp->bits, MAX_NUMNODES, n+1));
}
+int find_next_node_mask(int n, const nodemask_t *srcp);
+
#define nodemask_of_node(node) \
({ \
typeof(_unused_nodemask_arg_) m; \
@@ -347,10 +349,11 @@ static inline void __nodes_fold(nodemask_t *dstp, const nodemask_t *origp,
}
#if MAX_NUMNODES > 1
-#define for_each_node_mask(node, mask) \
- for ((node) = first_node(mask); \
- (node) < MAX_NUMNODES; \
- (node) = next_node((node), (mask)))
+#define for_each_node_mask(node, mask) \
+ for ((node) = ...Well ... I'm pretty sure it made sense to depend on SMP, back when
it was first added. However that might have changed. I recall
vaguely that there has been discussion of this CONFIG_SMP dependency
every year or so, but I don't have the time right now to dig through
the archives and code to figure it out.
That could be good ... though could you co-ordinate with Mike Travis
first, to minimize the risks of merge conflicts with what he's doing?
You kernel text space saving in the first patch seemed worth going
ahead with even if it did conflict a little, and I liked the matching
nodemask patch, just to keep things in sync. Other nodemask cleanup
is a little lower priority in my book, so should make a modest effort
to co-ordinate with more critical patches, to minimize conflict.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
I believe the x86#testing tree includes Mike's work? The two patches Thanks for your guidance. Greetings, -- Alexander van Heukelum heukelum@fastmail.fm -- http://www.fastmail.fm - Accessible with your email software or over the web --
Yes, my only change right now is to use nr_cpu_ids instead of NR_CPUS which short cuts a huge chunk out of the back end of the loop. And my changes are in sched/latest (which includes x86/latest). Thanks, --
The for_each_cpu_mask loop is used quite often in the kernel. It makes use of two functions: first_cpu and next_cpu. This patch changes for_each_cpu_mask to use only the latter. Because next_cpu finds the next eligible cpu _after_ the given one, the iteration variable has to be initialized to -1 and next_cpu has to be called with this value before the first iteration. An x86_64 defconfig kernel (from sched/latest) is about 2500 bytes smaller with this patch applied: text data bss dec hex filename 6222517 917952 749932 7890401 7865e1 vmlinux.orig 6219922 917952 749932 7887806 785bbe vmlinux The same size reduction is seen for defconfig+MAXSMP text data bss dec hex filename 6241772 2563968 1492716 10298456 9d2458 vmlinux.orig 6239211 2563968 1492716 10295895 9d1a57 vmlinux Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm> --- Hello Mike, Ingo, Here is a version of the for_each_cpu_mask change that applies on top of sched/latest. I think it is non-intrusive enough to put it in sched/latest? This version reuses the already-existing api next_cpu instead of inventing a new one; initializing the iteration counter to -1 was suggested by Matthew Wilcox. Greetings, Alexander include/linux/cpumask.h | 16 ++++++++-------- 1 files changed, 8 insertions(+), 8 deletions(-) diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h index 73434e5..74b748b 100644 --- a/include/linux/cpumask.h +++ b/include/linux/cpumask.h @@ -377,10 +377,10 @@ int __any_online_cpu(const cpumask_t *mask); #define first_cpu(src) __first_cpu(&(src)) #define next_cpu(n, src) __next_cpu((n), &(src)) #define any_online_cpu(mask) __any_online_cpu(&(mask)) -#define for_each_cpu_mask(cpu, mask) \ - for ((cpu) = first_cpu(mask); \ - (cpu) < NR_CPUS; \ - (cpu) = next_cpu((cpu), (mask))) +#define for_each_cpu_mask(cpu, mask) \ + for ((cpu) = ~((typeof(cpu))0); \ + (cpu) = next_cpu((cpu), (mask)), \ + (cpu) < ...
There is no need for such a complicated expression, -1 will work for every (arithmetic) type. Andreas. -- Andreas Schwab, SuSE Labs, schwab@suse.de SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." --
The for_each_cpu_mask loop is used quite often in the kernel. It makes use of two functions: first_cpu and next_cpu. This patch changes for_each_cpu_mask to use only the latter. Because next_cpu finds the next eligible cpu _after_ the given one, the iteration variable has to be initialized to -1 and next_cpu has to be called with this value before the first iteration. An x86_64 defconfig kernel (from sched/latest) is about 2500 bytes smaller with this patch applied: text data bss dec hex filename 6222517 917952 749932 7890401 7865e1 vmlinux.orig 6219922 917952 749932 7887806 785bbe vmlinux The same size reduction is seen for defconfig+MAXSMP text data bss dec hex filename 6241772 2563968 1492716 10298456 9d2458 vmlinux.orig 6239211 2563968 1492716 10295895 9d1a57 vmlinux Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm> --- Indeed, thanks. This version applies on top of sched/latest. This version reuses the already-existing api next_cpu instead of inventing a new one; initializing the iteration counter to -1 was suggested by Matthew Wilcox. Now with a -1 instead of an overly carefull ~((typeof(cpu))0). "-1" is properly sign- extended even if cpu is u64 in a 32-bit environment. Greetings, Alexander include/linux/cpumask.h | 16 ++++++++-------- 1 files changed, 8 insertions(+), 8 deletions(-) diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h index 73434e5..c24a556 100644 --- a/include/linux/cpumask.h +++ b/include/linux/cpumask.h @@ -377,10 +377,10 @@ int __any_online_cpu(const cpumask_t *mask); #define first_cpu(src) __first_cpu(&(src)) #define next_cpu(n, src) __next_cpu((n), &(src)) #define any_online_cpu(mask) __any_online_cpu(&(mask)) -#define for_each_cpu_mask(cpu, mask) \ - for ((cpu) = first_cpu(mask); \ - (cpu) < NR_CPUS; \ - (cpu) = next_cpu((cpu), (mask))) +#define for_each_cpu_mask(cpu, mask) \ + for ((cpu) = -1; \ + (cpu) = next_cpu((cpu), ...
applied for testing, thanks Alexander. Ingo --
For anyone else having similar cognitive dissonance while reading this Maybe a better name for this function would help. I can't think of a good one right now though. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." --
On Sun, 11 May 2008 09:24:40 -0600, "Matthew Wilcox" <matthew@wil.cx>
I can't think of a better name, and there is find_next_bit of which
find_next_cpu_mask is just a wrapper. I think the name is good enough.
Greetings,
Alexander
--
Alexander van Heukelum
heukelum@fastmail.fm
--
http://www.fastmail.fm - Access all of your messages and folders
wherever you are
--
How about doing it this way?
#define for_each_cpu_mask(cpu, mask) \
for ((cpu) = -1; \
(cpu) < NR_CPUS; \
(cpu) = find_next_cpu_mask((cpu), &(mask)))
int find_next_cpu_mask(int n, const cpumask_t *srcp)
{
return find_next_bit(srcp->bits, NR_CPUS, ++n);
}
That actually behaves the way I'd expect a function called
'find_next_cpu_mask' to work. It also abuses the 'for' condtion
less and might take a little less text space.
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--
On Sun, 11 May 2008 16:01:12 -0600, "Matthew Wilcox" <matthew@wil.cx> But it does not work. It introduces a stray cpu=-1 iteration if cpu happens to be (replaced by) a signed variable. It skips the entire loop if cpu happens to be unsigned. I don't think that using 'for' in a less conventional way is bad if it is hidden in a macro, as long as the name of the macro makes the intention sufficiently clear. I think of find_next_cpu_mask(cpu, mask) as: "find next cpu-index in mask, starting at index cpu". And similar with find_next_bit. As for the text-space argument, I think you might be right. Just not on i386/x86_64 where initialising a register to -1 can be done in three bytes, initialising to 0 in two bytes and an increment in one byte :-). Greetings, -- Alexander van Heukelum heukelum@fastmail.fm -- http://www.fastmail.fm - mmm... Fastmail... --
You're right, of course. Either of these should work:
#define for_each_cpu_mask(cpu, mask) \
for ((cpu) = find_next_cpu_mask(-1, &(mask)); \
(cpu) < NR_CPUS; \
(cpu) = find_next_cpu_mask((cpu), &(mask)))
#define for_each_cpu_mask(cpu, mask) \
for ((cpu) = -1; \
(cpu) = find_next_cpu_mask((cpu), &(mask)), (cpu) < NR_CPUS; \
)
int find_next_cpu_mask(int n, const cpumask_t *srcp)
{
return find_next_bit(srcp->bits, NR_CPUS, ++n);
If you're determined to stick to your original formulation, renaming it
While the initialisation may take one extra byte, the increment is
done in the function, and so takes at least one byte out of the loop.
Of course on other instruction sets, that's probably 4 bytes out of the
loop and they can initialise to -1 as cheaply as init to 0, so it's a
win on non-x86 and neutral on x86.
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--
