Following a complaint that the PIV had terrible syscall times in Linux compared to a much slower PIII (this is more an architecture property rather than software problem), Linus implemented a "syscall vsyscall"! vsyscall is basically a kernel supplied page in user address space which a program can call into in order to perform some function. This particular vsyscall determines the fastest syscall mechanism (on P6+ this is SYSENTER, i386 a traditional INT 80, AMD has SYSCALL which could possibly be implemented) and runs it to enter kernel mode. Linus found that this triples the speed of the "NULL" syscall on a PIV. There is a great deal of interesting (100+ posts) implementation discussion following the linked post (somehow eventually degenerating into a BK flamewar again!).
Some people have been concerned about the feature freeze, however it could be argued that this isn't a new feature. It isn't core code with no complicated interdependancies. And the old syscall method is exactly the same, which is possibly slightly slower than Windows XP using SYSENTER in some tests.
some benchmarks
Dave Jones have some benchmarks in his blog:
looks good.
--
:wq
maybe seconds are more useful than #cicles?
Why I'm wrong and Dave Jones is right? :-)
No
cycles are probably more useful as they are independant of processor MHz. Either way I doubt an Athlon could could do 1 million calls to getppid in 170 cycles. Maybe 170 million cycles!
Yeah, those are probably amortized averages
Yeah, those are probably average cycles per call, measured over 1 million calls.
As for seconds vs. cycles: Both are useful. Cycle count is an indication of architectural efficiency. Seconds are an indication of absolute cost. I'm on a team that makes architectural decisions for an embedded device. We make cycle vs. clock rate tradeoffs as part of our job. Clock rate floats all boats, but it can cause important tasks to take way too many cycles if we get too aggressive. Thus, it's important to balance between the two and to measure both.
For a given type of CPU (eg. AMD Athlon XP), the cycle count is probably going to be pretty close regardless of clock rate. (Any variation is most likely due to the memory system, and doing 1 million repeated calls will generally minimize those effects since you stay in L1 cache.) Thus, you get a pretty good measure of how efficient you are on that architecture by measuring clock rates.
To benchmark various CPUs against each other, you need to look at iterations per second. This lets you know the raw power of one CPU vs. another.
In the automotive world, this is similar to looking at torque vs. horsepower. The torque curve gives you an indication of how efficient the engine is over a wide operating range. Horsepower is the final performance of the system, taking into account the actual RPMs involved. (In this case, torque is like cycle count, RPMs is like clock rate, and horsepower is like elapsed time.)
> As for seconds vs. cycles:
> As for seconds vs. cycles: Both are useful. Cycle
> count is an indication of architectural efficiency.
If that is how you define "architectural efficiency", yes. I could define it as how many MHz it can make at a specific heat/voltage/blah.
> In the automotive world, this is similar to looking at
> torque vs. horsepower. The torque curve gives you an
> indication of ow efficient the engine is over a wide
> operating range. Horsepower is the final performance of
> the system, taking into account the actual RPMs involved.
> (In this case, torque is like cycle count, RPMs is like
> clock rate, and horsepower is like elapsed time.)
Only because the mathematical equations are similar, otherwise meaningless. It could also be like "steps per second", "step length", "velocity", or anything else you can think of. Its stupid trying to complicate the matter more. If someone can't work out how computer performance works, they won't be able to understand it given a weak analogy to something they probably understand even less about.
> If that is how you define
> If that is how you define "architectural efficiency", yes. I could define it as how many MHz it can make at a specific heat/voltage/blah.
That's what I think of as microarchitectural efficiency. That is, how well the transistors are doing their jobs. I was referring more to the higher-level structures. In this case: How efficiently does the architecture recognize a software-triggered exception/interrupt that vectors them into the operating system?
The two are no doubt related: Architectural decisions constrain the microarchitecture. But even a highly exposed architecture (such as a VLIW) can still have a lot hidden in the microarchitecture.
It may well be...
> That's what I think of as microarchitectural efficiency. That is,
> how well the transistors are doing their jobs. I was referring
> more to the higher-level structures.
But what if I said the PIV is very efficient because its architecture allows such a high clock speed while maintaining a decent number of IPC and therefore is faster than any of AMD's less efficient processors which cannot sustain an equivalently fast IPC * MHz.
Interesting, but not complete
Does he have benchmarks including the vsyscall overhead? I'm especially interested in the overhead on CPUs without sysenter support.
Does anyone have more benchmarks?
A related question: Are there any plans to benchmark the new syscall entry code on Pentium, PII, K6, and other older processors?
I feel dumber by the minute
So this means that Linux is even faster now? good... but could someone explain to a dumbass like myself exactly what the problem was and how it was fixed ?
Making system calls was too s
Making system calls was too slow. Now its a bit faster.
Recall Linux was originaly designed for the 386. 386 only has one way to enter kernal mode from user mode: the INT opcode. (specificaly, int 80).
As you can see in the earlier comments, int takes way too long on P3s and P4s. But intel created another instruction, SYSENTER, which does a similar job only faster. So it would be a good idea if linux programs could use this opcode. But there are problems. Firstly, any chip older than a P3 doesn't have this instruction (and it didn't work properly in the first revision of the P3). Secondly, AMD implemented a different instruction, SYSCALL, which again is faster and similar but with different details.
So: how do you support all three? You can't compile them into the program, they have different opcodes. Conditional branches on all syscalls would be too slow. Distributing a different binary for each CPU type would be a PITA.
The solution is vsyscalls. The kernal provides a page of read-only memory to user programs. In this page is code which does the right thing. Instead of having INT, SYSCALL, SYSENTER, whatever directly coded into the binaries, you have a call to this code. You have to take some extra cycles for the function call, but since all the operands are in CPU registers its very fast, much faster than INT.
Thank you
Just as I expected... but thank you for clearing it up for the moron in me..
Could it be faster?
If there is an extra call for pre-PIII machines does this mean that older hardware would run slower, while newer hardware runs faster?
If so specific libc for each architecture would simply do a direct machine
instruction call. Something like:
Since libc is build for each architecture anyway then most can be fixed
at compile time.
Maybe this is how it works anyhow?
sig = 0xda1e;Re: Could it be faster?
First of all, forget PPC, that is a different ball game.
Second of all, the problem wasn't being able to implement SYSENTER, (that was easy) it was being able to do it and work on P1,P2,P3,P4,K6,K6-2,Athlon, etc. WITHOUT RECOMPILING. They did that.
As for it running slower on older hardware... most of the time glibc is dynamically linked. So if you want your glibc to use the old INT0x80 method, it'll probably be a compile option. so no, it doesn't mean older hardware is slower. Its just a new feature, the old way isn't going away for a LONG LONG LONG time.
Re: Could it be faster?
Thanks for your feedback.
So to use the feature a specific version of glibc is needed, is it possible
for runtime optimisation? For example for the link/loader to use a specific processor version if available say trying i686 then i585 -> i486 -> i386.
I'm sure we will see the glibc people making some clever use of this.
sig = 0xda1e;