Re: [Bugme-new] [Bug 15618] New: 2.6.18->2.6.32->2.6.33 huge regression in performance

Previous thread: Contact Me Urgently by Mr. Wang Qin on Tuesday, March 23, 2010 - 9:51 am. (1 message)

Next thread: [PATCH] perf_events: fix remapped count support by Stephane Eranian on Tuesday, March 23, 2010 - 9:25 am. (5 messages)
From: Andrew Morton
Date: Tuesday, March 23, 2010 - 7:22 am

(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).


lolz.  Catastrophic meltdown.  Thanks for doing all that work - at a
guess I'd say it's mmap_sem.  Perhaps with some assist from the CPU
scheduler.

If you change the config to set CONFIG_RWSEM_GENERIC_SPINLOCK=n,
CONFIG_RWSEM_XCHGADD_ALGORITHM=y does it help?

Anyway, there's a testcase in bugzilla and it looks like we got us some
--

From: Ingo Molnar
Date: Tuesday, March 23, 2010 - 10:34 am

Looks like we dont need to guess, just look at the call graph profile (a'ka 

It shows a very brutal amount of page fault invoked mmap_sem spinning 

Doesnt look like it, the perf stat numbers show that the scheduler is only 
very lightly involved:

  > > 129875.554435 task-clock-msecs # 10.210 CPUs 
  > >          1883 context-switches # 0.000 M/sec 
 
a context switch only every ~68 milliseconds.

	Ingo
	Ingo
--

From: Linus Torvalds
Date: Tuesday, March 23, 2010 - 10:45 am

Isn't this already fixed? It's the same old "x86-64 rwsemaphores are using 
the shit-for-brains generic version" thing, and it's fixed by

	1838ef1 x86-64, rwsem: 64-bit xadd rwsem implementation
	5d0b723 x86: clean up rwsem type system
	59c33fa x86-32: clean up rwsem inline asm statements

NOTE! None of those are in 2.6.33 - they were merged afterwards. But they 
are in 2.6.34-rc1 (and obviously current -git). So Anton would have to 
compile his own kernel to test his load.

We could mark them as stable material if the load in question is a real 
load rather than just a test-case. On one of the random page-fault 
benchmarks the rwsem fix was something like a 400% performance 
improvement, and it was apparently visible in real life on some crazy SGI 
"initialize huge heap concurrently on lots of threads" load.

Side note: the reason the spinlock sucks is because of the fair ticket 
locks, it really does all the wrong things for the rwsem code. That's why 
old kernels don't show it - the old unfair locks didn't show the same kind 
of behavior.

			Linus
--

From: Anton Starikov
Date: Tuesday, March 23, 2010 - 10:57 am

It is not just a test-case, it is real-life code. With real-life problems on 2.6.32 and later :)


Anton.--

From: Ingo Molnar
Date: Tuesday, March 23, 2010 - 11:00 am

another option is to run the rawhide kernel via something like:

	yum update --enablerepo=development kernel

this will give kernel-2.6.34-0.13.rc1.git1.fc14.x86_64, which has those 
changes included.

OTOH that kernel has debugging [lockdep] enabled so it might not be 

Yeah.

	Ingo
--

From: Anton Starikov
Date: Tuesday, March 23, 2010 - 11:03 am

I will apply this commits to 2.6.32, I afraid current OFED (which I need also) will not work on 2.6.33+.

Anton.--

From: Andrew Morton
Date: Tuesday, March 23, 2010 - 11:21 am

On Tue, 23 Mar 2010 19:03:36 +0100

You should be able to simply set CONFIG_RWSEM_GENERIC_SPINLOCK=n,
CONFIG_RWSEM_XCHGADD_ALGORITHM=y by hand, as I mentioned earlier?
--

From: Anton Starikov
Date: Tuesday, March 23, 2010 - 11:25 am

Hm. I tried, but when I do "make oldconfig", then it gets rewritten, so I assume that it conflicts with some other setting from default fedora kernel config. trying to figure out which one exactly.

Anton.--

From: Robin Holt
Date: Tuesday, March 23, 2010 - 12:22 pm

Have you tracked this down yet?  I just got the patches applied against
an older kernel and am running into the same issue.

Thanks,
Robin
--

From: Anton Starikov
Date: Tuesday, March 23, 2010 - 12:30 pm

I decided to not track down this issue and just applied patches. I understood that with this patches there is no need to change this config options. Am I wrong?

Anton--

From: Robin Holt
Date: Tuesday, March 23, 2010 - 12:49 pm

We might need to also apply:
bafaecd11df15ad5b1e598adc7736afcd38ee13d

Robin
--

From: Robin Holt
Date: Tuesday, March 23, 2010 - 12:57 pm

For the record, these are the patches I have applied to a 2.6.32 kernel from a vendor:

59c33fa7791e9948ba467c2b83e307a0d087ab49
5d0b7235d83eefdafda300656e97d368afcafc9a
1838ef1d782f7527e6defe87e180598622d2d071
0d1622d7f526311d87d7da2ee7dd14b73e45d3fc
bafaecd11df15ad5b1e598adc7736afcd38ee13d

A quick look at the disassembly makes it look like we are using the
rwsem_64, et al.

Robin
--

From: Anton Starikov
Date: Tuesday, March 23, 2010 - 12:50 pm

I think you can prevent overwriting this options if you set them in arch/x86/configs/x86_64_defconfig

Anton

--

From: Linus Torvalds
Date: Tuesday, March 23, 2010 - 12:52 pm

No. Doesn't work. The XADD code simply never worked on x86-64, which is 
why those three commits I pointed at are required.

Oh, and you need one more commit (at least) in addition to the three I 
already mentioned - the one that actually adds the x86-64 wrappers and 
Kconfig option:

	bafaecd x86-64: support native xadd rwsem implementation

so the minimal list of commits (on top of 2.6.33) is at least

	59c33fa x86-32: clean up rwsem inline asm statements
	5d0b723 x86: clean up rwsem type system
	bafaecd x86-64: support native xadd rwsem implementation
	1838ef1 x86-64, rwsem: 64-bit xadd rwsem implementation

and I just verified that they at least cherry-pick cleanly (in that 
order). I _think_ it would be good to also do

	0d1622d x86-64, rwsem: Avoid store forwarding hazard in __downgrade_write

but that one is a small detail, not anything fundamentally important.

			Linus
--

From: Roland Dreier
Date: Wednesday, March 24, 2010 - 9:40 am

> I will apply this commits to 2.6.32, I afraid current OFED (which I
 > need also) will not work on 2.6.33+.

What do you need from OFED that is not in 2.6.34-rc1?
-- 
Roland Dreier  <rolandd@cisco.com>
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--

From: Anton Starikov
Date: Thursday, March 25, 2010 - 8:24 pm

I didn't go too 2.6.34-rc1.
I tried 2.6.33, mlx4 driver which comes with kernel produces panic on my hardwire. And OFED-1.5 doesn't support this kernel (probably it still can be compiled, didn't check).

Anton.

--

From: Anton Starikov
Date: Tuesday, March 23, 2010 - 12:14 pm

Applied mentioned patches. Things didn't improve too much.

before:
prog: Total exploration time 9.880 real 60.620 user 76.970 sys

after:
prog: Total exploration time 9.020 real 59.430 user 66.190 sys

perf report:

    38.58%             prog  [kernel]                                           [k] _spin_lock_irqsave
    37.42%             prog  ./prog                                             [.] DBSLLlookup_ret
     6.22%             prog  ./prog                                             [.] SuperFastHash
     3.65%             prog  /lib64/libc-2.11.1.so                              [.] __GI_memcpy
     2.09%             prog  ./anderson.6.dve2C                                 [.] get_successors
     1.75%             prog  [kernel]                                           [k] clear_page_c
     1.73%             prog  ./prog                                             [.] index_next_dfs
     0.71%             prog  [kernel]                                           [k] handle_mm_fault
     0.38%             prog  ./prog                                             [.] cb_hook
     0.33%             prog  ./prog                                             [.] get_local
     0.32%             prog  [kernel]                                           [k] page_fault

Anton.

--

From: Peter Zijlstra
Date: Tuesday, March 23, 2010 - 12:17 pm

Could you verify with a callgraph profile what that spin_lock_irqsave()
is? If those rwsem patches were successfull mmap_sem should no longer
have a spinlock to content on, in which case it might be another lock.

If not, something went wrong with backporting those patches.
--

From: Anton Starikov
Date: Tuesday, March 23, 2010 - 12:42 pm

I attach here callgraph.

Also I checked kernel source, actual code which was compiled is exactly what should be after patches.

Do I miss something?

From: Linus Torvalds
Date: Tuesday, March 23, 2010 - 12:54 pm

Yeah, I missed at least one commit, namely

	bafaecd x86-64: support native xadd rwsem implementation

which is the one that actually makes x86-64 able to use the xadd version.

		Linus
--

From: Anton Starikov
Date: Tuesday, March 23, 2010 - 1:43 pm

I think we got a winner!

Problem seems to be fixed.

Just for record, I used next patches:

59c33fa7791e9948ba467c2b83e307a0d087ab49
5d0b7235d83eefdafda300656e97d368afcafc9a
1838ef1d782f7527e6defe87e180598622d2d071
4126faf0ab7417fbc6eb99fb0fd407e01e9e9dfe
bafaecd11df15ad5b1e598adc7736afcd38ee13d
0d1622d7f526311d87d7da2ee7dd14b73e45d3fc


Thanks,
Anton.


--

From: Linus Torvalds
Date: Tuesday, March 23, 2010 - 4:04 pm

Ok. If you have performance numbers for before/after these patches for 
your actual workload, I'd suggest posting them to stable@kernel.org, and 
maybe those rwsem fixes will get back-ported.

The patches are pretty small, and should be fairly safe. So they are 
certainly stable material.

		Linus
--

From: Anton Starikov
Date: Tuesday, March 23, 2010 - 4:19 pm

Tomorrow I will try to patch and check 2.6.33 and see are this patches enough to restore performance or not, because on 2.6.33 kernel performance issue also used to involve somehow crgoup business (and performance was terrible even comparing to broken 2.6.32). If it will not fix 2.6.33, then I will ask to reopen the bug, otherwise I will post to stable@.

Thanks again for help,
Anton.


--

From: Ingo Molnar
Date: Tuesday, March 23, 2010 - 4:36 pm

We havent had any stability problems with them, except one trivial build bug, 
so -stable would be nice.

	Ingo
--

From: Linus Torvalds
Date: Tuesday, March 23, 2010 - 4:55 pm

Oh, you're right. There was that UML build bug. But I think that was 
included in the list of commits Anton had - commit 4126faf0ab ("x86: Fix 
breakage of UML from the changes in the rwsem system").

		Linus
--

From: Anton Starikov
Date: Tuesday, March 23, 2010 - 5:03 pm

Yes, it is included into my list.
When I will submit it into stable, I will include it also.

Anton


--

From: Andi Kleen
Date: Tuesday, March 23, 2010 - 7:15 pm

It would be also nice to get that change into 2.6.32 stable. That is
widely used on larger systems.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Linus Torvalds
Date: Tuesday, March 23, 2010 - 8:00 pm

Looking at the changes to the files in question, it looks like it should 
all apply cleanly to 2.6.32, so I don't see any reason not to backport 
further back.

Somebody should double-check, though.

		Linus
--

From: Greg KH
Date: Monday, April 19, 2010 - 11:19 am

I have queued them all up for .33 and .32-stable kernel releases now.

thanks,

greg k-h
--

From: Andrew Morton
Date: Tuesday, March 23, 2010 - 11:13 am

On Tue, 23 Mar 2010 18:34:09 +0100

Yes.  Note that we fall off a cliff at nine threads on a 16-way.  As
soon as a core gets two threads scheduled onto it?  Probably triggered
by an MM change, possibly triggered by a sched change which tickled a
preexisting MM shortcoming.  Who knows.

Anton, we have an executable binary in the bugzilla report but it would
be nice to also have at least a description of what that code is
actually doing.  A quick strace shows quite a lot of mprotect activity.
A pseudo-code walkthrough, perhaps?

Thanks.
--

From: Anton Starikov
Date: Tuesday, March 23, 2010 - 11:19 am

Right now can't say too much about the code (we just gave a chance to neighbor group to run their code on our cluster, so I'm totally unfriendly with this code). I will forward your question to them.

But probably right now you can get more information (including sources) here http://fmt.cs.utwente.nl/tools/ltsmin/

Anton--

From: Ingo Molnar
Date: Tuesday, March 23, 2010 - 11:27 am

it's AMD Opterons so no SMT.

My (wild) guess would be that 8 cpus can still do cacheline ping-pong 
reasonably efficiently, but it starts breaking down very seriously with 9 or 
more cores bouncing the same single cache-line.

Breakdowns in scalability are usually very non-linear, for hardware and 
software reasons. '8 threads' sounds like a hw limit to me. From the scheduler 
POV there's no big difference between 8 or 9 CPUs used [this is non-HT] - with 
8 or 7 cores still idle.

	Ingo
--

From: Anton Starikov
Date: Tuesday, March 23, 2010 - 2:19 pm

Although case is solved, I will post description for testcase program.
Just in case someone wonder or would like to keep it for some later tests.

------------------------------------------------------------------------
It is a parallel model checker. The command line you used does reachability
on the state space of mode anderson.6, meaning that it searches through all
possible states (int vectors). Each thread gets a vector from the queue,
calculates its successor states and puts them in a lock-less static hash
table (pseudo BFS exploration because the threads each have there own
queue).

How did ingo run the binary? Because the static table size should be chosen
to fit into memory. "-s 27" allocates 2^27 * (|vector| + 1 ) * sizeof(int)
bytes. |vector| is equal to 19 for anderson.6, ergo the table size is 10GB.
This could explain the huge number of page faults ingo gets.

But anyway, you can imagine that the code is quiet jumpy and has a big
memory footprint, so the page faults may also be normal.
------------------------------------------------------------------------


--

From: Lee Schermerhorn
Date: Friday, April 2, 2010 - 11:57 am

<snip>

I had an "opportunity" to investigate page fault behavior on 2.6.18+
[RHEL5.4] on an 8-socket Istanbul system earlier this year.  When I saw
this mail, I collected up the data I had from that adventure and ran
additional tests on 2.6.33 and 2.6.34-rc1.  I have attached plots for
what "per node" and "system wide" page fault scalability.

The per node plot [#1] shows the page fault rate of 1 to 6
[nr_cores_per_socket] tasks [processes] and threads faulting in a fixed
GB/task at the same time on a single socket.  The system wide plot [#3]
show 1 to 48 [nr_sockets * nr_cores_per_socket] tasks and threads again
faulting in a fixed GB/task...   For the latter test, I load one core
per socket at at time, then add the 2nd core per socket, ...  In all
cases, the individual tasks/threads are fork()ed/pthread_create()d by a
parent bound to the cpu where they'll run to obtain node-local kernel
data structures.  The tests run with SCHED_FIFO.

I plot both "faults per wall clock second"--the aggregate rate--and
"faults per cpu second" or normalized rate.  The per node scalability
doesn't look all that different across the 3 releases, especially the
faults per cpu seconds curves.  However, in the system wide
multi-threaded tests, 2.6.33 is an anomaly compared to both 2.6.18+ and
2.6.34-rc1.  The 2.6.18+ and 2.6.34.rc1 multi-threaded tests show a lot
of noise and, of course, a lot lower fault rate relative the the
multi-task tests.  I aborted the 2.6.33 system wide multi-threaded test
at 32 threads because it was just taking too long.

Unfortunately, with this many curves, the legends obscure much of the
plot.  So, rather than bloat this message any more, I've packaged up the
raw data along with plots with and without legends and placed the
tarball here:

	http://free.linux.hp.com/~lts/Pft/

That directory also contains the source for the version of the pft test
used, along with the scripts used to run the tests and plot the results.
Note that some manual editing of the "plot ...
Previous thread: Contact Me Urgently by Mr. Wang Qin on Tuesday, March 23, 2010 - 9:51 am. (1 message)

Next thread: [PATCH] perf_events: fix remapped count support by Stephane Eranian on Tuesday, March 23, 2010 - 9:25 am. (5 messages)