Signed-of-by: Mike Travis <travis@sgi.com>
---
cpumask.txt | 73 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 73 insertions(+)
--- /dev/null
+++ struct-cpumasks/cpumask.txt
@@ -0,0 +1,73 @@
+
+ CPUMASKS
+ --------
+
+ [PRELIMINARY]
+
+Introduction
+
+Cpumask variables are used to represent (with a single bit) all the
+CPU's in a system. With the prolific increase in the number of CPU's
+in a single system image (SSI), managing this large number of cpus has
+become an ordeal. When the limit of CPU's in a system was small, the
+cpumask could fit within an integer. Even with the increase to 128, a
+bit pattern was well within a managable size. Now with systems available
+with 2048 cores, with hyperthreading, there are 4096 cpu threads. And the
+number of cpus in an SSI is growing, 16,384 is right around the corner.
+Even desktop systems with only 2 or 4 sockets, a new generation of Intel
+processors that have 128 cores per socket will increase the count of CPU
+threads tremendously.
+
+Thus the handling of the cpumask that represents this 4096 limit needs
+512 bytes, putting pressure on the amount of stack space needed to
+accomodate temporary cpumask variables.
+
+The primary goal of the cpumasks API is to accomodate these large number
+of cpus without effecting the compactness of Linux for small, compact
+systems.
+
+
+The Changes
+
+Provide new cpumask interface API. The relevant change is basically
+cpumask_t becomes an opaque object. This should result in the minimum
+amount of modifications while still allowing the inline cpumask functions,
+and the ability to declare static cpumask objects.
+
+
+ /* raw declaration */
+ struct __cpumask_data_s { DECLARE_BITMAP(bits, NR_CPUS); };
+
+ /* cpumask_map_t used for declaring static cpumask maps */
+ typedef struct __cpumask_data_s cpumask_map_t[1];
+
+ /* cpumask_t used for function args and return pointers */
+ typedef struct __cpumask_data_s ...Hi Mike,
I have several problems with this patch series. First, it's a flag day
change, which means it isn't bisectable and can't go through linux-next.
Secondly, we still can't hide the definition of the cpumask struct as long as
they're passed as cpumask_t, so it's going to be hard to find assignments
(illegal once we allocate nr_cpu_ids bits rather than NR_CPUS), and on-stack
users.
Finally, we end up with code which is slightly more opaque than the
current code, with two new typedefs. And that's an ongoing problem.
I took a slightly divergent line with my patch series, and introduced a
parallel cpumask system which always passes and returns masks by pointer:
cpumask_t -> struct cpumask
on-stack cpumask_t -> cpumask_var_t (same as your patch)
cpus_and(dst, src1, src2) etc -> cpumask_and(&dst, &src1, &src2)
cpumask_t cpumask_of_cpu(cpu) -> const struct cpumask *cpumask_of(cpu)
cpumask_t cpu_online_map etc -> const struct cpumask *cpu_online_mask etc.
The old ops are expressed in terms of the new ops, and can be phased out over
time.
In addition, I added some new twists:
static cpumasks and cpumasks in structures
-> DECLARE_BITMAP(foo, NR_CPUS) and to_cpumask()
This means we can eventually obscure the actual definition of struct cpumask,
to catch abuse.
cpus_and(tmp, mask, online_mask); for_each_cpu(i, tmp)
-> for_each_cpu_both(i, mask, online_mask)
This helper saves numerous on-stack temporaries.
NR_CPUS -> CONFIG_NR_CPUS
The config option is now valid for UP as well. This cleanup allows us to
audit users of NR_CPUS (which might be used incorrectly now cpumask_
iterators only go to nr_cpu_ids).
The patches are fairly uninteresting, but here is the summary:
x86: remove noop cpus_and() with CPU_MASK_ALL.
x86: clean up speedctep-centrino and reduce cpumask_t usage
cpumask: remove min from first_cpu/next_cpu
cpumask: introduce struct cpumask.
cpumask: change cpumask_scnprintf, cpumask_parse_user, ...that looks very sane to me. IMHO, an infrastructure change of this magnitude should absolutely be done via the Git space. This needs a ton of testing and needs bisection, a real Git track record, etc. Ingo --
Thanks, it's reasonably nice. The task of hitting all those cpumask_t users Not yet. Committing untested patches into git is the enemy of bisection; if one of my patches breaks an architecture, they lose the ability to bisect until its fixed. If it's a series of patches, we can go back and fix it. Now, once it's been tested a little, it's better for you to git-ize it and I'll send you patches instead. But I want some more people banging on it, and a run through linux-next first... If Mike's happy to work on these as a basis, we should be able to get there soon; the patches are sitting in my tree at http://ozlabs.org/~rusty/kernel/ (see rr-latest symlink). Thanks, Rusty. PS. To emphasize, I haven't actually *booted* this kernel. My test machines are still in transit as I move (and ADSL not connected yet... Grr...) --
while the initial series might be rebased once or twice, beyond the 1-2 days of initial integration and testing i dont think that's true, and i'm doing up to 3-4 bisections a day just fine, on an append-mostly tree. if you have trouble turning a Git tree into a bisectable tree then your testing-fu is not strong enough ;-) [ the only plausible danger is to architectures that are not used by testers all that much (so that breakages can linger a lot longer unnoticed) - but why should the other 99% of Linux users be put at a disadvantage for them. ] Ingo --
Absolutely! I may have my own concerns and preferences but the end goal is far more important. I'll take a look at it today. [My only other pressing matter is convincing Ingo to accept the SCIR driver (or tell me how I need Since our approaches are not different in concept, I can assure you that it works... ;-) And as Ingo and others have noted, the infrastructure is easy to verify, it's the allocation of the temporary cpumasks that will be more difficult to test. Cheers, Mike --
it's getting off topic, but i really dont get it why you cannot go via the standard LEDS framework, and why you have to hook into the x86 idle notifiers. (which we are hoping to get rid of) RAS does not need that precise accounting. It just needs a heartbeat timer that tells it how to do the pretty lights and to report whether the CPU is still alive. Something that seems to be fully within the scope of LEDS. What am i missing? Ingo --
could you please bring these arguments up in the public thread, with
LEDS people Cc:-ed?
Ingo
[Changed the Cc list to whom I think may be interested, particularly
Richard Purdie <rpurdie@rpsys.net> for comments on the LED system,
and Thomas Gleixner <tglx@linutronix.de> for comments on using
the hi-res timer to interrupt each cpu every second.]
Hi Ingo,
The LED framework is fine for monitoring system activity with a few
LED's. It can quantify system activity to provide a variably lit LED
and display disk activity. Each LED requires registration similar to:
/* For the leds-gpio driver */
struct gpio_led {
const char *name;
char *default_trigger;
unsigned gpio;
u8 active_low;
};
struct gpio_led_platform_data {
int num_leds;
struct gpio_led *leds;
int (*gpio_blink_set)(unsigned gpio,
unsigned long *delay_on,
unsigned long *delay_off);
};
I would need an array of up to 4096 of the led_info structs allocated
on Node 0 at bootup time based on the number of cpus. Registration of
these 4096 leds will allocate another (up to) 4096 array similar
to this struct on Node 0:
struct gpio_led_data {
struct led_classdev cdev;
unsigned gpio;
struct work_struct work;
u8 new_level;
u8 can_sleep;
u8 active_low;
int (*platform_gpio_blink_set)(unsigned gpio,
unsigned long *delay_on, unsigned long *delay_off);
};
After registration there will be (up to) 4096 nodes in /sys/class/leds/
using the naming convention: "devicename:colour:function". I'm not sure
of the total number of sysfs leaves but there's at least a brightness
and a trigger leaf under each. This would add up to 12288 new entries
created in the sysfs filesystem. (And none of these are useful.)
...