Re: [rfc][patch] dynamic data structure switching

Previous thread: Re: "GPL weasels and the atheros stink" by Jonathan A. George on Sunday, September 2, 2007 - 1:46 pm. (6 messages)

Next thread: Hang in 2.6.23-rc5 by daryll q on Sunday, September 2, 2007 - 2:42 pm. (16 messages)
To: Linux Kernel Mailing List <linux-kernel@...>, David Miller <davem@...>, Paul McKenney <paulmck@...>
Date: Sunday, September 2, 2007 - 2:27 pm

Hi,

This is my "data structure switch" algorithm that can convert one data
structure into another, with just a single unlikely branch in fastpaths
and no locking or atomic operations (the branch is only triggered when
the data structure is in the process of being converted). A pointer
indirection is generally also needed when converting a global data
structure.

I posted this algorithm a while back and converted the dcache hash to
be dynamically resized at runtime[*], and David Miller started on a
patch to make one of the networking hashes dynamic too.

[*] The algorithm need not only convert between two different sized
hashes, but that's just the most obvious and useful thing to do.
It _could_ convert between a hash and a tree or something crazy
like that too :)

Anyway, the problem with this is that I hadn't found a really nice
way to abstract this functionality and package it in a nice API. It
is almost small enough that open-coding it would be reasonable...
but it would be much nicer to be able to abstract it.

Anyway, here is the algorithm nicely packaged behind a really crappy
API :P I think the idea might be useful enough not to lose it, so
I'm posting this WIP just as a request for comments.

The algorithm implementation is basically optimal, but the API means
that it requires and extra function call and indirect function call
to work, which probably makes it unusable for important implementations.
(C preprocessor magic to avoid the indirect function calls and allow
some level of templating while retaining the types might be the way
to do it).

Dumb pidhash conversion also provided to help people tinker.

Index: linux-2.6/lib/dyndata.c
===================================================================
--- /dev/null
+++ linux-2.6/lib/dyndata.c
@@ -0,0 +1,80 @@
+#include <linux/dyndata.h>
+#include <linux/mutex.h>
+#include <linux/rcupdate.h>
+#include <linux/seqlock.h>
+
+void dyn_data_init(struct dyn_data *dd, void *cur...

To: Nick Piggin <npiggin@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, David Miller <davem@...>, Paul McKenney <paulmck@...>, Ingo Molnar <mingo@...>
Date: Monday, September 10, 2007 - 7:55 am

With the seqlock protecting the whole xfer a lookup for a non-existing
item might spin here for a very long time.

Maybe we should create this sleeping seqlock we talked and provide both

So its up to the caller to take resize_mutex, but there is no mention of

-

To: Peter Zijlstra <peterz@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, David Miller <davem@...>, Paul McKenney <paulmck@...>, Ingo Molnar <mingo@...>
Date: Monday, September 10, 2007 - 9:39 am

Yeah, it isn't the best this way. One thing that could be done is to
fine grain the write-side seqlock (eg. take it once per object moved,
or have multiple seqlocks eg. per bucket). It would need some careful

Yeah, I didn't want to document the API yet because it is to horrible :)
In the case of the dynamic pid hash, this mutex serialises resizers, and
so after you take the lock, you can check whether the hash still needs
-

To: Linux Kernel Mailing List <linux-kernel@...>, David Miller <davem@...>, Paul McKenney <paulmck@...>
Date: Sunday, September 2, 2007 - 2:36 pm

Dumb, slightly incomplete, dynamic pidhash resizing. I'm just sending this
as a reference and testing tool. While the pid-hash might be the least
problematic one we have, RAM saving / collision minimisation aren't the
only reasons to size a hash optimally: in a really lookup intensive
workload, a small dense hash table will have between 4-32x better cache
efficiency than a very sparse one, depending on line size and pointer
size. So it is possible that resizing even reasonably small hashes can
be a good idea.

Index: linux-2.6/kernel/pid.c
===================================================================
--- linux-2.6.orig/kernel/pid.c
+++ linux-2.6/kernel/pid.c
@@ -28,13 +28,25 @@
#include <linux/hash.h>
#include <linux/pid_namespace.h>
#include <linux/init_task.h>
+#include <linux/dyndata.h>

-#define pid_hashfn(nr) hash_long((unsigned long)nr, pidhash_shift)
-static struct hlist_head *pid_hash;
-static int pidhash_shift;
+struct pid_hash {
+ struct hlist_head *table;
+ unsigned int shift;
+};
+
+static struct pid_hash pid_hashes[2];
+static unsigned int ph_cur_idx;
+static struct dyn_data dyn_pidhash;
+static unsigned int nr_pids;
+
+/* cur_pid_hash should be used under rcu_read_lock() */
+#define cur_pid_hash ((struct pid_hash *)dyn_data_current_dstruct(&dyn_pidhash))
+#define pid_hashfn(ph, nr) hash_long((unsigned long)nr, ph->shift)
static struct kmem_cache *pid_cachep;
struct pid init_struct_pid = INIT_STRUCT_PID;

+
int pid_max = PID_MAX_DEFAULT;

#define RESERVED_PIDS 300
@@ -190,21 +202,91 @@ static void delayed_put_pid(struct rcu_h
put_pid(pid);
}

+int dd_transfer_pids(void *old_ds, void *new_ds)
+{
+ struct pid_hash *old = old_ds;
+ struct pid_hash *new = new_ds;
+ unsigned int size, i;
+
+ size = 1UL << old->shift;
+
+ BUG_ON(old == new);
+
+ for (i = 0; i < size; i++) {
+ struct hlist_node *elem;
+ struct pid *pid;
+
+ spin_lock_irq(&pidmap_lock);
+again:
+ hlist_for_eac...

To: Nick Piggin <npiggin@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, David Miller <davem@...>, Paul McKenney <paulmck@...>
Date: Wednesday, September 5, 2007 - 7:05 pm

I see, that in many places all pre-checks are done in negative form
with resulting return or jump out. In this case, if function was called,

that one or this?

==
if (system_state == SYSTEM_RUNNING) {
old_shift = cur_pid_hash->shift;
new_shift = ilog2(nr_pids * 2 - 1);
if (new_shift != old_shift && mutex_trylock(&dyn_pidhash.resize_mutex)) {
==
> + old_shift = cur_pid_hash->shift;
> + new_shift = ilog2(nr_pids * 2 - 1);

/* hope this repetition is needed by design */

...

> + mutex_unlock(&dyn_pidhash.resize_mutex);
}

What is more efficient in general sense,

Thanks.
____
-

To: Oleg Verych <olecom@...>
Cc: Nick Piggin <npiggin@...>, Linux Kernel Mailing List <linux-kernel@...>, David Miller <davem@...>, Paul McKenney <paulmck@...>
Date: Thursday, September 6, 2007 - 7:35 am

In general it shouldn't matter at all on any reasonably
modern CPU (let's say less than 10 years old) unless you do
it tens of thousands of times in a loop. Also gcc has reasonable
default heuristics for this kind of stuff anyways
(e.g. return NULL or return negative value is predicted
unlikely by default)

Cache misses are far more important to care about because
they cost hundreds of cycles.

-Andi
-

To: Oleg Verych <olecom@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, David Miller <davem@...>, Paul McKenney <paulmck@...>
Date: Thursday, September 6, 2007 - 3:14 am

I'm not too sure, but I'd guess that most of the time the compiler will
be able to figure out they are the same.

resize_pid_hash() fortunately isn't a fastpath anyway -- it calls
dyn_data_replace which ends up calling synchronize_rcu() 3 times,
each of which is likely to take a long time!

-

Previous thread: Re: "GPL weasels and the atheros stink" by Jonathan A. George on Sunday, September 2, 2007 - 1:46 pm. (6 messages)

Next thread: Hang in 2.6.23-rc5 by daryll q on Sunday, September 2, 2007 - 2:42 pm. (16 messages)