This is my end of year branch annotation cleanups. I compiled my
kernel with:
CONFIG_PROFILE_ANNOTATED_BRANCHES
And booted this on my main work station, as well as a couple of
test boxes. Letting it run for a while I then looked at the results.
cat /debug/tracing/trace_stat/branch_annotated
Which gives a result looking something like this:
correct incorrect % Function File Line
------- --------- - -------- ---- ----
0 366631256 100 sched_info_switch sched_stats.h 269
0 211488512 100 sched_info_queued sched_stats.h 222
0 197083456 100 sched_info_dequeued sched_stats.h 177
0 130545377 100 syscall_trace_enter ptrace.c 1375
0 130545340 100 syscall_trace_leave ptrace.c 1403
0 326960 100 pre_schedule_rt sched_rt.c 1476
0 5214 100 signal_pending sched.h 2313
0 1305 100 trylock_page pagemap.h 316
It sorts the list with those that have the highest percentage of
incorrect uses and the largest number of them.
The sched_info_switch() is part of the sched_stats which falls into:
if (unlikely(1))
And gcc is smart enough to ignore it, but because it uses a static
inline function that returns 1, the branch profiler misses that it
is indeed a constant.
I also ignored those that only had a few hits, like the signal_pending
above.
Anyway, there are 10 areas that I found that really do not need to have
an annotated branch, or the annotation is simply the opposite of
what it should be.
If people are fine with these, I'm hoping to get acks and then push this
out as a real clean up for 2.6.38.
The following patches are in:
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace.git
branch: unlikely/core
Steven ...From: Steven Rostedt <srostedt@redhat.com> The mapping_unevictable() has a likely() around the mapping parameter. This mapping parameter comes from page_mapping() which has an unlikely() that the page will be set as PAGE_MAPPING_ANON, and if so, it will return NULL. One would think that this unlikely() means that the mapping returned by page_mapping() would not be NULL, but where page_mapping() is used just above mapping_unevictable(), that unlikely() is incorrect most of the time. This means that the "likely(mapping)" in mapping_unevictable() is incorrect most of the time. Running the annotated branch profiler on my main box which runs firefox, evolution, xchat and is part of my distcc farm, I had this: correct incorrect % Function File Line ------- --------- - -------- ---- ---- 12872836 1269443893 98 mapping_unevictable pagemap.h 51 35935762 1270265395 97 page_mapping mm.h 659 1306198001 143659 0 page_mapping mm.h 657 203131478 121586 0 page_mapping mm.h 657 5415491 1116 0 page_mapping mm.h 657 74899487 1116 0 page_mapping mm.h 657 203132845 224 0 page_mapping mm.h 659 5415464 27 0 page_mapping mm.h 659 13552 0 0 page_mapping mm.h 657 13552 0 0 page_mapping mm.h 659 242630 0 0 page_mapping mm.h 657 242630 0 0 page_mapping mm.h 659 74899487 0 0 page_mapping mm.h 659 The page_mapping() is a static inline, which is why it shows up multiple times. The mapping_unevictable() is also a ...
[ Resending to Nick's real email ] From: Steven Rostedt <srostedt@redhat.com> The mapping_unevictable() has a likely() around the mapping parameter. This mapping parameter comes from page_mapping() which has an unlikely() that the page will be set as PAGE_MAPPING_ANON, and if so, it will return NULL. One would think that this unlikely() means that the mapping returned by page_mapping() would not be NULL, but where page_mapping() is used just above mapping_unevictable(), that unlikely() is incorrect most of the time. This means that the "likely(mapping)" in mapping_unevictable() is incorrect most of the time. Running the annotated branch profiler on my main box which runs firefox, evolution, xchat and is part of my distcc farm, I had this: correct incorrect % Function File Line ------- --------- - -------- ---- ---- 12872836 1269443893 98 mapping_unevictable pagemap.h 51 35935762 1270265395 97 page_mapping mm.h 659 1306198001 143659 0 page_mapping mm.h 657 203131478 121586 0 page_mapping mm.h 657 5415491 1116 0 page_mapping mm.h 657 74899487 1116 0 page_mapping mm.h 657 203132845 224 0 page_mapping mm.h 659 5415464 27 0 page_mapping mm.h 659 13552 0 0 page_mapping mm.h 657 13552 0 0 page_mapping mm.h 659 242630 0 0 page_mapping mm.h 657 242630 0 0 page_mapping mm.h 659 74899487 0 0 page_mapping mm.h 659 The page_mapping() is a static inline, which is why it shows up ...
I think so. We don't have good guidelines or empirical numbers on this, unfortunately, but I think it would be somewhere over 90%, given that it tends to add jumps to and from out of line icache in the incorrect case. Acked-by: Nick Piggin <npiggin@kernel.dk> --
OK, so I'll add the removal of the unlikely() in page_mapping in my next Thanks! -- Steve --
Acked-by: Rik van Riel <riel@redhat.com> -- All rights reversed --
I think you are right. Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> --
It'd be better to use if (!mapping) return 0; return test_bit(AS_UNEVICTABLE, &mapping->flags); to avoid the unnecessary !! --
How about return mapping && test_bit(AS_UNEVICTABLE, &mapping->flags); instead...? -miles -- Yossarian was moved very deeply by the absolute simplicity of this clause of Catch-22 and let out a respectful whistle. "That's some catch, that Catch-22," he observed. "It's the best there is," Doc Daneeka agreed. --
I'll just add my current patch as a "likely()" clean up. Any other change should be separate. Thanks, -- Steve --
From: Steven Rostedt <srostedt@redhat.com>
There's an unlikely() in fget_light() that assumes the file ref count
will be 1. Running the annotate branch profiler on a desktop that is
performing daily tasks (running firefox, evolution, xchat and is also part
of a distcc farm), it shows that the ref count is not 1 that often.
correct incorrect % Function File Line
------- --------- - -------- ---- ----
1035099358 6209599193 85 fget_light file_table.c 315
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
fs/file_table.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/fs/file_table.c b/fs/file_table.c
index c3dee38..c3e89ad 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -311,7 +311,7 @@ struct file *fget_light(unsigned int fd, int *fput_needed)
struct files_struct *files = current->files;
*fput_needed = 0;
- if (likely((atomic_read(&files->count) == 1))) {
+ if (atomic_read(&files->count) == 1) {
file = fcheck_files(files, fd);
} else {
rcu_read_lock();
--
1.7.2.3
--
From: Steven Rostedt <srostedt@redhat.com>
Running the annotate branch profiler on three boxes, including my
main box that runs firefox, evolution, xchat, and is part of the distcc farm,
showed this with the likelys in the workqueue code:
correct incorrect % Function File Line
------- --------- - -------- ---- ----
96 996253 99 wq_worker_sleeping workqueue.c 703
96 996247 99 wq_worker_waking_up workqueue.c 677
The likely()s in this case were assuming that WORKER_NOT_RUNNING will
most likely be false. But this is not the case. The reason is
(and shown by adding trace_printks and testing it) that most of the time
WORKER_PREP is set.
In worker_thread() we have:
worker_clr_flags(worker, WORKER_PREP);
[ do work stuff ]
worker_set_flags(worker, WORKER_PREP, false);
(that 'false' means not to wake up an idle worker)
The wq_worker_sleeping() is called from schedule when a worker thread
is putting itself to sleep. Which happens most of the time outside
of that [ do work stuff ].
The wq_worker_waking_up is called by the wakeup worker code, which
is also callod outside that [ do work stuff ].
Thus, the likely and unlikely used by those two functions are actually
backwards.
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
kernel/workqueue.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 90db1bd..b3afcce 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -661,7 +661,7 @@ void wq_worker_waking_up(struct task_struct *task, unsigned int cpu)
{
struct worker *worker = kthread_data(task);
- if (likely(!(worker->flags & WORKER_NOT_RUNNING)))
+ if (unlikely(!(worker->flags & WORKER_NOT_RUNNING)))
atomic_inc(get_gcwq_nr_running(cpu));
}
@@ -687,7 +687,7 @@ struct task_struct ...Hello, Steven. Yeah, I was lost thinking about the busiest case where workers are busy processing works consecutively. Usually workers are of course switching in and out of idle state all the time. How about just dropping likely/unlikely? Thanks. -- tejun --
I have no problem in dropping them too. I'll do that in my next version. Thanks, -- Steve --
OK, if I just remove it, can I add your Acked-by? Thanks, -- Steve --
Sure, then, I'll route it through my tree. Thanks. -- tejun --
OK, I'll post a new patch series again. Just pluck it from that, and let me know so I can remove it from my tree. Thanks, -- Steve --
From: Steven Rostedt <srostedt@redhat.com>
Running the annotated branch profiler on a box doing average work
(firefox, evolution, xchat, distcc farm), the likely() used in
grab_cache_page_write_begin() was incorrect most of the time:
correct incorrect % Function File Line
------- --------- - -------- ---- ----
1924262 71332401 97 grab_cache_page_write_begin filemap.c 2206
Adding a trace_printk() and running the function tracer limited to
just this function I can see:
gconfd-2-2696 [000] 4467.268935: grab_cache_page_write_begin: page= (null) mapping=ffff8800676a9460 index=7
gconfd-2-2696 [000] 4467.268946: grab_cache_page_write_begin <-ext3_write_begin
gconfd-2-2696 [000] 4467.268947: grab_cache_page_write_begin: page= (null) mapping=ffff8800676a9460 index=8
gconfd-2-2696 [000] 4467.268959: grab_cache_page_write_begin <-ext3_write_begin
gconfd-2-2696 [000] 4467.268960: grab_cache_page_write_begin: page= (null) mapping=ffff8800676a9460 index=9
gconfd-2-2696 [000] 4467.268972: grab_cache_page_write_begin <-ext3_write_begin
gconfd-2-2696 [000] 4467.268973: grab_cache_page_write_begin: page= (null) mapping=ffff8800676a9460 index=10
gconfd-2-2696 [000] 4467.268991: grab_cache_page_write_begin <-ext3_write_begin
gconfd-2-2696 [000] 4467.268992: grab_cache_page_write_begin: page= (null) mapping=ffff8800676a9460 index=11
gconfd-2-2696 [000] 4467.269005: grab_cache_page_write_begin <-ext3_write_begin
Which shows that a lot of calls from ext3_write_begin will result in
the page returned by "find_lock_page" will be NULL.
Cc: Nick Piggin <npiggin@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
mm/filemap.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git ...[ Resending to Nick's real email address ]
From: Steven Rostedt <srostedt@redhat.com>
Running the annotated branch profiler on a box doing average work
(firefox, evolution, xchat, distcc farm), the likely() used in
grab_cache_page_write_begin() was incorrect most of the time:
correct incorrect % Function File Line
------- --------- - -------- ---- ----
1924262 71332401 97 grab_cache_page_write_begin filemap.c 2206
Adding a trace_printk() and running the function tracer limited to
just this function I can see:
gconfd-2-2696 [000] 4467.268935: grab_cache_page_write_begin: page= (null) mapping=ffff8800676a9460 index=7
gconfd-2-2696 [000] 4467.268946: grab_cache_page_write_begin <-ext3_write_begin
gconfd-2-2696 [000] 4467.268947: grab_cache_page_write_begin: page= (null) mapping=ffff8800676a9460 index=8
gconfd-2-2696 [000] 4467.268959: grab_cache_page_write_begin <-ext3_write_begin
gconfd-2-2696 [000] 4467.268960: grab_cache_page_write_begin: page= (null) mapping=ffff8800676a9460 index=9
gconfd-2-2696 [000] 4467.268972: grab_cache_page_write_begin <-ext3_write_begin
gconfd-2-2696 [000] 4467.268973: grab_cache_page_write_begin: page= (null) mapping=ffff8800676a9460 index=10
gconfd-2-2696 [000] 4467.268991: grab_cache_page_write_begin <-ext3_write_begin
gconfd-2-2696 [000] 4467.268992: grab_cache_page_write_begin: page= (null) mapping=ffff8800676a9460 index=11
gconfd-2-2696 [000] 4467.269005: grab_cache_page_write_begin <-ext3_write_begin
Which shows that a lot of calls from ext3_write_begin will result in
the page returned by "find_lock_page" will be NULL.
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
mm/filemap.c | 2 +-
1 files changed, 1 ...Looks good, Acked-by: Nick Piggin <npiggin@kernel.dk> --
From: Steven Rostedt <srostedt@redhat.com>
The unlikely() used in ttwu_post_activation() tests if the rq->idle_stamp
is set. But since this is for a wakeup, and wakeups happen when tasks
block on IO, and blocking tasks on IO may put the system into idle,
this can actually be a common occurence.
Running the annotated branch profiler on an average desktop running
firefox, evolution, xchat and distcc, the report shows:
correct incorrect % Function File Line
------- --------- - -------- ---- ----
34884862 146110926 80 ttwu_post_activation sched.c 2309
80% of the time, this unlikely is incorrect. Best not to assume what the
result is, and just remove the branch annotation.
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
kernel/sched.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 269a045..6d24b2e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2458,7 +2458,7 @@ static inline void ttwu_post_activation(struct task_struct *p, struct rq *rq,
if (p->sched_class->task_woken)
p->sched_class->task_woken(rq, p);
- if (unlikely(rq->idle_stamp)) {
+ if (rq->idle_stamp) {
u64 delta = rq->clock - rq->idle_stamp;
u64 max = 2*sysctl_sched_migration_cost;
--
1.7.2.3
--
From: Steven Rostedt <srostedt@redhat.com>
On a 64bit distro, the chances of having a process using segment registers
is very unlikely. But if the userspace is 32bit running on top of
a 64bit kernel (very common), then this will be very likely that
processes have segment registers in use.
Running on my main desktop (which is a 32bit userspace on top of
a 64bit kernel) the annotated branch profiler showed the following:
correct incorrect % Function File Line
------- --------- - -------- ---- ----
25522442 304125815 92 __switch_to process_64.c 408
25522430 304123341 92 __switch_to process_64.c 412
25743877 303891250 92 __switch_to process_64.c 464
Instead of punishing 32bit userspace systems with an unlikely, just remove
the unlikely and let gcc optimize for what it thinks is good and let
the branch prediction (hopefully) work.
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
arch/x86/kernel/process_64.c | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index b3d7a3a..22de90c 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -405,11 +405,11 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
* This won't pick up thread selector changes, but I guess that is ok.
*/
savesegment(es, prev->es);
- if (unlikely(next->es | prev->es))
+ if (next->es | prev->es)
loadsegment(es, next->es);
savesegment(ds, prev->ds);
- if (unlikely(next->ds | prev->ds))
+ if (next->ds | prev->ds)
loadsegment(ds, next->ds);
@@ -461,7 +461,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
wrmsrl(MSR_FS_BASE, next->fs);
prev->fsindex = fsindex;
- if (unlikely(gsindex | next->gsindex | prev->gs)) {
+ if (gsindex | ...From: Steven Rostedt <srostedt@redhat.com> The if (unlikely(!rt_rq->rt_nr_running)) test in pick_next_task_rt() tests if there is another rt task ready to run. If so, then pick it. In most systems, only one RT task runs at a time most of the time. Running the branch unlikely annotator profiler on a system doing average work "running firefox, evolution, xchat, distcc builds, etc", it showed the following: correct incorrect % Function File Line ------- --------- - -------- ---- ---- 324344 135104992 99 _pick_next_task_rt sched_rt.c 1064 99% of the time the condition is true. When an RT task schedules out, it is unlikely that another RT task is waiting to run on that same run queue. Cc:Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Gregory Haskins <ghaskins@novell.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> --- kernel/sched_rt.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c index 7a5c4db..a249ae3 100644 --- a/kernel/sched_rt.c +++ b/kernel/sched_rt.c @@ -1062,7 +1062,7 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq) rt_rq = &rq->rt; - if (unlikely(!rt_rq->rt_nr_running)) + if (likely(!rt_rq->rt_nr_running)) return NULL; if (rt_rq_throttled(rt_rq)) -- 1.7.2.3 --
>>> On 12/6/2010 at 08:58 PM, in message <20101207021329.185936860@goodmis.org>, My feeling is that generally speaking, if the branch is workload dependent, we should probably not annotate it at all and let the CPUs branch-predictor do its thing. I guess what I am not 100% clear on is how these annotations affect the BPU. I.e. is it a static decision point or can the BPU still "learn" if the annotation is particularly wrong for a given workload? For the former, I think we should just remove this particular annotation (and there are others that need review). For the latter, this is obviously the right annotation we should be using in this particular case. --
I'm fine with just removing the likely() here. It is a bit workload dependent. We can't assume that we are running a lot of RT tasks on a single CPU, but we also should not assume that we are not doing so. Considering that we are just coming out of an RT task, I doubt it will hurt anything to just keep the annotation off. I don't think gcc does anything special for x86 with regards to the BPU. But I do believe that gcc will add special ops for likely()'s on other archs. -- Steve --
You need to get a better email client ;-) OK, if I just remove the likely() do you Ack it? -- Steve --
From: Steven Rostedt <srostedt@redhat.com>
In fput_light(), there's an unlikely(fput_needed), which running on
my normal desktop doing firefox, xchat, evolution and part of my distcc farm,
and running the annotate branch profiler shows that the unlikely is not
very unlikely.
correct incorrect % Function File Line
------- --------- - -------- ---- ----
0 48 100 fput_light file.h 26
115828710 897415279 88 fput_light file.h 26
865271179 5286128445 85 fput_light file.h 26
19568539 8923664 31 fput_light file.h 26
12353677 3562279 22 fput_light file.h 26
267691 67062 20 fput_light file.h 26
15014853 348172 2 fput_light file.h 26
209258 205 0 fput_light file.h 26
1364164 0 0 fput_light file.h 26
Which gives 1032903812 times it was correct and 6203351846 times it was
incorrect, or 85% incorrect.
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
include/linux/file.h | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/include/linux/file.h b/include/linux/file.h
index b1e1297..e85baeb 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -23,7 +23,7 @@ extern struct file *alloc_file(struct path *, fmode_t mode,
static inline void fput_light(struct file *file, int fput_needed)
{
- if (unlikely(fput_needed))
+ if (fput_needed)
fput(file);
}
--
1.7.2.3
--
From: Steven Rostedt <srostedt@redhat.com>
As found with the branch annotation profiler, the unlikely(rt_task(prev))
in pre_schedule_rt() is always wrong. In fact it should be likely due to
the fact that we got to this function because we used prev's scheduling
class (which had to be rt).
Change the unlikely(rt_task(prev)) to likely(rt_task(prev))
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
kernel/sched_rt.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index bea7d79..7a5c4db 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -1474,7 +1474,7 @@ skip:
static void pre_schedule_rt(struct rq *rq, struct task_struct *prev)
{
/* Try to pull RT tasks here if we lower this rq's prio */
- if (unlikely(rt_task(prev)) && rq->rt.highest_prio.curr > prev->prio)
+ if (likely(rt_task(prev)) && rq->rt.highest_prio.curr > prev->prio)
pull_rt_task(rq);
}
--
1.7.2.3
--
I have sent a more radical patch before but it get ignored. IMHO, we can delete the checking for rt_task(prev). Or am I missing something? Thanks, Yong --
I like yours better. I'll add that one to my queue instead. Thanks, -- Steve --
