Each time a software event triggers, we need to walk through
the entire list of events from the current cpu and task contexts
to retrieve a running perf event that matches.
We also need to check a matching perf event is actually counting.
This walk is wasteful and makes the event fast path scaling
down with a growing number of events running on the same
contexts.
To solve this, we store the running perf events in a hashlist to
get an immediate access to them against their type:event_id when
they trigger.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
---
include/linux/perf_event.h | 12 +++
kernel/perf_event.c | 226 +++++++++++++++++++++++++++++++------------
2 files changed, 175 insertions(+), 63 deletions(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 6e96cc8..03e87c4 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -589,6 +589,14 @@ enum perf_group_flag {
PERF_GROUP_SOFTWARE = 0x1,
};
+#define SWEVENT_HLIST_BITS 8
+#define SWEVENT_HLIST_SIZE ((1 << (SWEVENT_HLIST_BITS + 1)) - 1)
+
+struct swevent_hlist {
+ struct hlist_head heads[SWEVENT_HLIST_SIZE];
+ struct rcu_head rcu_head;
+};
+
/**
* struct perf_event - performance event kernel representation:
*/
@@ -597,6 +605,7 @@ struct perf_event {
struct list_head group_entry;
struct list_head event_entry;
struct list_head sibling_list;
+ struct hlist_node hlist_entry;
int nr_siblings;
int group_flags;
struct perf_event *group_leader;
@@ -744,6 +753,9 @@ struct perf_cpu_context {
int active_oncpu;
int max_pertask;
int exclusive;
+ struct swevent_hlist *swevent_hlist;
+ struct mutex hlist_mutex;
+ struct kref hlist_refcount;
/*
* Recursion avoidance:
diff --git a/kernel/perf_event.c ...So we have a hash-table per-cpu, each event takes a ref on the hash table, when the thing is empty we free it. When the event->cpu == -1 (all cpus) we take a ref on all possible cpu's hash-table (should be online I figure, but that requires adding a hotplug handler). Then on event enable/disable we actually add the event to the hash-table belonging to the cpu the event/task gets scheduled on, since each event can only ever be active on one cpu. Right? So looks good, altough I think we want to do that online/hotplug thing. --
Alternatively, you can simply but the hash table into the per-cpu structure and not allocate it, its only a single page (half a page if you use 32bit or actually use 8 bits. --
As you prefer. This would indeed make it more simple, but that would also make these pages unused most of the time. --
True I suppose,.. have a go at that hotplug stuff, the only thing I worried about was that refcount stuff getting rather ugly. --
Each time a software event triggers, we need to walk through
the entire list of events from the current cpu and task contexts
to retrieve a running perf event that matches.
We also need to check a matching perf event is actually counting.
This walk is wasteful and makes the event fast path scaling
down with a growing number of events running on the same
contexts.
To solve this, we store the running perf events in a hashlist to
get an immediate access to them against their type:event_id when
they trigger.
v2: - Fix SWEVENT_HLIST_SIZE definition (and re-learn some basic
maths along the way)
- Only allocate hlist for online cpus, but keep track of the
refcount on offline possible cpus too, so that we allocate it
if needed when it becomes online.
- Drop the kref use as it's not adapted to our tricks anymore.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
---
include/linux/perf_event.h | 12 ++
kernel/perf_event.c | 251 +++++++++++++++++++++++++++++++++-----------
2 files changed, 200 insertions(+), 63 deletions(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 6e96cc8..bf896d0 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -589,6 +589,14 @@ enum perf_group_flag {
PERF_GROUP_SOFTWARE = 0x1,
};
+#define SWEVENT_HLIST_BITS 8
+#define SWEVENT_HLIST_SIZE (1 << SWEVENT_HLIST_BITS)
+
+struct swevent_hlist {
+ struct hlist_head heads[SWEVENT_HLIST_SIZE];
+ struct rcu_head rcu_head;
+};
+
/**
* struct perf_event - performance event kernel representation:
*/
@@ -597,6 +605,7 @@ struct perf_event {
struct list_head group_entry;
struct list_head event_entry;
struct list_head sibling_list;
+ struct hlist_node hlist_entry;
int nr_siblings;
int group_flags;
struct ...Le jeudi 08 avril 2010 à 21:39 +0200, Frederic Weisbecker a écrit : --
That would let us allocate on online cpus instead of possibles? Yeah right. --
Right, so if you want to go this route (and not simply embed it in the percpu data), the complication I thought of is that you want the refcount on offline cpus but not the allocation, since its very hard to reconstruct the number of events that had event->cpu == -1. --
That seems to result in 9 bits worth, doesn't it? --
Oh right, I was confused, will zapp the + 1. --
Also, what you're computing seems to be a mask, not a size So with BITS = 8, you want SIZE = 1<<BITS = 256 MASK = SIZE-1 = 255 right? --
Oh right... damn I suck in maths... --
