Re: [PATCH 01/31] cpumask: Documentation

Previous thread: [PATCH 03/31] cpumask: remove min from first_cpu/next_cpu by Mike Travis on Monday, September 29, 2008 - 11:02 am. (1 message)

Next thread: [PATCH 04/31] cpumask: move cpu_alloc to separate file by Mike Travis on Monday, September 29, 2008 - 11:02 am. (1 message)
From: Mike Travis
Date: Monday, September 29, 2008 - 11:02 am

Signed-of-by: Mike Travis <travis@sgi.com>
---
 cpumask.txt |   73 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 73 insertions(+)

--- /dev/null
+++ struct-cpumasks/cpumask.txt
@@ -0,0 +1,73 @@
+
+		CPUMASKS
+		--------
+
+	      [PRELIMINARY]
+
+Introduction
+
+Cpumask variables are used to represent (with a single bit) all the
+CPU's in a system.  With the prolific increase in the number of CPU's
+in a single system image (SSI), managing this large number of cpus has
+become an ordeal.  When the limit of CPU's in a system was small, the
+cpumask could fit within an integer.  Even with the increase to 128, a
+bit pattern was well within a managable size.  Now with systems available
+with 2048 cores, with hyperthreading, there are 4096 cpu threads.  And the
+number of cpus in an SSI is growing, 16,384 is right around the corner.
+Even desktop systems with only 2 or 4 sockets, a new generation of Intel
+processors that have 128 cores per socket will increase the count of CPU
+threads tremendously.
+
+Thus the handling of the cpumask that represents this 4096 limit needs
+512 bytes, putting pressure on the amount of stack space needed to
+accomodate temporary cpumask variables.
+
+The primary goal of the cpumasks API is to accomodate these large number
+of cpus without effecting the compactness of Linux for small, compact
+systems.
+
+
+The Changes
+
+Provide new cpumask interface API.  The relevant change is basically
+cpumask_t becomes an opaque object.  This should result in the minimum
+amount of modifications while still allowing the inline cpumask functions,
+and the ability to declare static cpumask objects.
+
+
+    /* raw declaration */
+    struct __cpumask_data_s { DECLARE_BITMAP(bits, NR_CPUS); };
+
+    /* cpumask_map_t used for declaring static cpumask maps */
+    typedef struct __cpumask_data_s cpumask_map_t[1];
+
+    /* cpumask_t used for function args and return pointers */
+    typedef struct __cpumask_data_s ...
From: Rusty Russell
Date: Tuesday, September 30, 2008 - 3:49 pm

Hi Mike,

    I have several problems with this patch series.  First, it's a flag day 
change, which means it isn't bisectable and can't go through linux-next.  
Secondly, we still can't hide the definition of the cpumask struct as long as 
they're passed as cpumask_t, so it's going to be hard to find assignments 
(illegal once we allocate nr_cpu_ids bits rather than NR_CPUS), and on-stack 
users.

    Finally, we end up with code which is slightly more opaque than the 
current code, with two new typedefs.  And that's an ongoing problem.

    I took a slightly divergent line with my patch series, and introduced a 
parallel cpumask system which always passes and returns masks by pointer:

	cpumask_t -> struct cpumask
	on-stack cpumask_t -> cpumask_var_t (same as your patch)
	cpus_and(dst, src1, src2) etc -> cpumask_and(&dst, &src1, &src2)
	cpumask_t cpumask_of_cpu(cpu) -> const struct cpumask *cpumask_of(cpu)
	cpumask_t cpu_online_map etc -> const struct cpumask *cpu_online_mask etc.

The old ops are expressed in terms of the new ops, and can be phased out over 
time.

In addition, I added some new twists:

	static cpumasks and cpumasks in structures
		-> DECLARE_BITMAP(foo, NR_CPUS) and to_cpumask()

This means we can eventually obscure the actual definition of struct cpumask, 
to catch abuse.

	cpus_and(tmp, mask, online_mask); for_each_cpu(i, tmp)
		-> for_each_cpu_both(i, mask, online_mask)

This helper saves numerous on-stack temporaries.

	NR_CPUS -> CONFIG_NR_CPUS

The config option is now valid for UP as well.  This cleanup allows us to 
audit users of NR_CPUS (which might be used incorrectly now cpumask_ 
iterators only go to nr_cpu_ids).

The patches are fairly uninteresting, but here is the summary:

x86: remove noop cpus_and() with CPU_MASK_ALL.
x86: clean up speedctep-centrino and reduce cpumask_t usage
cpumask: remove min from first_cpu/next_cpu
cpumask: introduce struct cpumask.
cpumask: change cpumask_scnprintf, cpumask_parse_user, ...
From: Ingo Molnar
Date: Wednesday, October 1, 2008 - 2:13 am

that looks very sane to me.


IMHO, an infrastructure change of this magnitude should absolutely be 
done via the Git space. This needs a ton of testing and needs bisection, 
a real Git track record, etc.

	Ingo
--

From: Rusty Russell
Date: Wednesday, October 1, 2008 - 5:36 pm

Thanks, it's reasonably nice.  The task of hitting all those cpumask_t users 

Not yet.  Committing untested patches into git is the enemy of bisection; if 
one of my patches breaks an architecture, they lose the ability to bisect 
until its fixed.  If it's a series of patches, we can go back and fix it.

Now, once it's been tested a little, it's better for you to git-ize it and 
I'll send you patches instead.  But I want some more people banging on it, 
and a run through linux-next first...

If Mike's happy to work on these as a basis, we should be able to get there 
soon; the patches are sitting in my tree at http://ozlabs.org/~rusty/kernel/ 
(see rr-latest symlink).

Thanks,
Rusty.
PS.  To emphasize, I haven't actually *booted* this kernel.  My test machines 
are still in transit as I move (and ADSL not connected yet... Grr...)
--

From: Ingo Molnar
Date: Thursday, October 2, 2008 - 2:32 am

while the initial series might be rebased once or twice, beyond the 1-2 
days of initial integration and testing i dont think that's true, and 
i'm doing up to 3-4 bisections a day just fine, on an append-mostly 
tree.

if you have trouble turning a Git tree into a bisectable tree then your 
testing-fu is not strong enough ;-)

[ the only plausible danger is to architectures that are not used by 
  testers all that much (so that breakages can linger a lot longer 
  unnoticed) - but why should the other 99% of Linux users be put at a 
  disadvantage for them. ]

	Ingo
--

From: Mike Travis
Date: Thursday, October 2, 2008 - 5:54 am

Absolutely!  I may have my own concerns and preferences but the end goal is
far more important.  I'll take a look at it today.  [My only other pressing
matter is convincing Ingo to accept the SCIR driver (or tell me how I need

Since our approaches are not different in concept, I can assure you that it
works... ;-)  And as Ingo and others have noted, the infrastructure is easy
to verify, it's the allocation of the temporary cpumasks that will be more
difficult to test.

Cheers,
Mike

--

From: Ingo Molnar
Date: Friday, October 3, 2008 - 2:04 am

it's getting off topic, but i really dont get it why you cannot go via 
the standard LEDS framework, and why you have to hook into the x86 idle 
notifiers. (which we are hoping to get rid of)

RAS does not need that precise accounting. It just needs a heartbeat 
timer that tells it how to do the pretty lights and to report whether 
the CPU is still alive. Something that seems to be fully within the 
scope of LEDS. What am i missing?

	Ingo
--

From: Mike Travis
Date: Monday, October 6, 2008 - 8:02 am

could you please bring these arguments up in the public thread, with 
LEDS people Cc:-ed?

	Ingo

[Changed the Cc list to whom I think may be interested, particularly
Richard Purdie <rpurdie@rpsys.net> for comments on the LED system,
and Thomas Gleixner <tglx@linutronix.de> for comments on using
the hi-res timer to interrupt each cpu every second.]


Hi Ingo,

The LED framework is fine for monitoring system activity with a few
LED's.  It can quantify system activity to provide a variably lit LED
and display disk activity.  Each LED requires registration similar to:

/* For the leds-gpio driver */
struct gpio_led {
        const char *name;
        char *default_trigger;
        unsigned        gpio;
        u8              active_low;
};

struct gpio_led_platform_data {
        int             num_leds;
        struct gpio_led *leds;
        int             (*gpio_blink_set)(unsigned gpio,
                                        unsigned long *delay_on,
                                        unsigned long *delay_off);
};

I would need an array of up to 4096 of the led_info structs allocated
on Node 0 at bootup time based on the number of cpus.  Registration of
these 4096 leds will allocate another (up to) 4096 array similar
to this struct on Node 0:

struct gpio_led_data {
        struct led_classdev cdev;
        unsigned gpio;
        struct work_struct work;
        u8 new_level;
        u8 can_sleep;
        u8 active_low;
        int (*platform_gpio_blink_set)(unsigned gpio,
                        unsigned long *delay_on, unsigned long *delay_off);
};

After registration there will be (up to) 4096 nodes in /sys/class/leds/
using the naming convention: "devicename:colour:function".  I'm not sure
of the total number of sysfs leaves but there's at least a brightness
and a trigger leaf under each.  This would add up to 12288 new entries
created in the sysfs filesystem.  (And none of these are useful.)
                                                     ...
Previous thread: [PATCH 03/31] cpumask: remove min from first_cpu/next_cpu by Mike Travis on Monday, September 29, 2008 - 11:02 am. (1 message)

Next thread: [PATCH 04/31] cpumask: move cpu_alloc to separate file by Mike Travis on Monday, September 29, 2008 - 11:02 am. (1 message)