The recent Hyper-Threading vulnerability announcement [story] was discussed on the Linux Kernel Mailing List. Reactions were mixed. Andi Kleen suggested, "disabling HT for this would the totally wrong approach, like throwing out the baby with the bath water." Alan Cox [interview] commented that it was a moot issue, "HT for most users is pretty irrelevant, its a neat idea but the benchmarks don't suggest its too big a hit."
Regarding the usefulness of Hyper-Threading, Linus Torvalds had a different opinion, "HT is _wonderful_ for latency reduction." He went on to add, "why people think 'performace' means 'throughput' is something I'll never understand. Throughput is _always_ secondary to latency, and really only becomes interesting when it becomes a latency number." He then went on to discuss the proposed vulnerability:
"As to the HT 'vulnerability', it really seems to be not a whole lot different than what people saw with early SMP and (small) direct-mapped caches. Thank God those days are gone.
"I'd be really surprised if somebody is actually able to get a real-world attack on a real-world pgp key usage or similar out of it (and as to the covert channel, nobody cares). It's a fairly interesting approach, but it's certainly neither new nor HT-specific, [nor does it] necessarily seem all that worrying in real life."
From: Gabor MICSKO [email blocked] To: linux-kernel Subject: Hyper-Threading Vulnerability Date: Fri, 13 May 2005 07:51:20 +0200 Hi! From http://kerneltrap.org/node/5103 ``Hyper-Threading, as currently implemented on Intel Pentium Extreme Edition, Pentium 4, Mobile Pentium 4, and Xeon processors, suffers from a serious security flaw," Colin explains. "This flaw permits local information disclosure, including allowing an unprivileged user to steal an RSA private key being used on the same machine. Administrators of multi-user systems are strongly advised to take action to disable Hyper-Threading immediately." ``More'' info here: http://www.daemonology.net/hyperthreading-considered-harmful/ Is this flaw affects the current stable Linux kernels? Workaround? Patch? Thanks. - MG
From: Andi Kleen [email blocked] Subject: Re: Hyper-Threading Vulnerability Date: Fri, 13 May 2005 20:03:58 +0200 Gabor MICSKO [email blocked] writes: > Hi! > > From http://kerneltrap.org/node/5103 > > ``Hyper-Threading, as currently implemented on Intel Pentium Extreme > Edition, Pentium 4, Mobile Pentium 4, and Xeon processors, suffers from > a serious security flaw," Colin explains. "This flaw permits local > information disclosure, including allowing an unprivileged user to steal > an RSA private key being used on the same machine. Administrators of > multi-user systems are strongly advised to take action to disable > Hyper-Threading immediately." > > ``More'' info here: > http://www.daemonology.net/hyperthreading-considered-harmful/ > > Is this flaw affects the current stable Linux kernels? Workaround? > Patch? This is not a kernel problem, but a user space problem. The fix is to change the user space crypto code to need the same number of cache line accesses on all keys. Disabling HT for this would the totally wrong approach, like throwing out the baby with the bath water. -Andi
From: Alan Cox [email blocked] Subject: Re: Hyper-Threading Vulnerability Date: Fri, 13 May 2005 19:35:48 +0100 > This is not a kernel problem, but a user space problem. The fix > is to change the user space crypto code to need the same number of cache line > accesses on all keys. You actually also need to hit the same cache line sequence on all keys if you take a bit more care about it. > Disabling HT for this would the totally wrong approach, like throwing > out the baby with the bath water. HT for most users is pretty irrelevant, its a neat idea but the benchmarks don't suggest its too big a hit
From: Scott Robert Ladd [email blocked] Subject: Re: Hyper-Threading Vulnerability Date: Fri, 13 May 2005 14:49:25 -0400 Alan Cox wrote: > HT for most users is pretty irrelevant, its a neat idea but the > benchmarks don't suggest its too big a hit On real-world applications, I haven't seen HT boost performance by more than 15% on a Pentium 4 -- and the usual gain is around 5%, if anything at all. HT is a nice idea, but I don't enable it on my systems. ..Scott
From: Linus Torvalds [email blocked] Subject: Re: Hyper-Threading Vulnerability Date: Mon, 16 May 2005 10:00:20 -0700 (PDT) On Fri, 13 May 2005, Scott Robert Ladd wrote: > > Alan Cox wrote: > > HT for most users is pretty irrelevant, its a neat idea but the > > benchmarks don't suggest its too big a hit > > On real-world applications, I haven't seen HT boost performance by more > than 15% on a Pentium 4 -- and the usual gain is around 5%, if anything > at all. HT is a nice idea, but I don't enable it on my systems. HT is _wonderful_ for latency reduction. Why people think "performace" means "throughput" is something I'll never understand. Throughput is _always_ secondary to latency, and really only becomes interesting when it becomes a latency number (ie "I need higher throughput in order to process these jobs in 4 hours instead of 8" - notice how the real issue was again about _latency_). Now, Linux tends to have pretty good CPU latency anyway, so it's not usually that big of a deal, but I definitely enjoyed having a HT machine over a regular UP one. I'm told the effect was even more pronounced on XP. Of course, these days I enjoy having dual cores more, though, and with multiple cores, the latency advantages of HT become much less pronounced. As to the HT "vulnerability", it really seems to be not a whole lot different than what people saw with early SMP and (small) direct-mapped caches. Thank God those days are gone. I'd be really surprised if somebody is actually able to get a real-world attack on a real-world pgp key usage or similar out of it (and as to the covert channel, nobody cares). It's a fairly interesting approach, but it's certainly neither new nor HT-specific, or necessarily seem all that worrying in real life. (HT and modern CPU speeds just means that the covert channel is _faster_ than it has been before, since you can test the L1 at core speeds. I doubt it helps the key attack much, though, since faster in that case cuts both ways: the speed of testing the cache eviction may have gone up, but so has the speed of the operation you're trying to follow, and you'd likely have a really hard time trying to catch things in real life). It does show that if you want to hide key operations, you want to be careful. I don't think HT is at fault per se. Linus
Linus is missing the point
Linus is completely missing the point. If he thinks that nobody cares about covert channels, he should talk to people involved in SELinux -- they very much care about covert channels.
But even returning to the sort of security which concerns most of us, Linus simply doesn't get it. He is right that the covert channel is "only" faster than earlier covert channels, but he fails to realize that making this channel 1000x faster is precisely why it can also function as a cryptographic side channel to break RSA. In this way, the difference becomes qualitative, rather than merely quantitative.
It is at times like this that Linux really suffers from having a single dictator in charge; when Linus doesn't understand a problem, he won't fix it, even if all the cryptographers in the world are standing against him.
From bits I have heard from p
From bits I have heard from people who have implemented secure systems, SELinux probably won't stand up as a truly secure system. That type of system is designed from the ground up. SELinux goes a long way tward making Linux better for many lower levels of users... But it probably isn't tight enough to stop many forms of covert channels and the like. Worring about fixing one small hole in a system with bigger ones may not be the best usage of time.
SELinux is not a secure syste
SELinux is not a secure system, nor is is SUPPOSED to be a secure system. It is an addon for the kernel designed to enhance the kernel security, and in that respect it does an awsome job. Although I personally have had limited experience in setting SELinux up, from what I have heard and read, it is effective if the attack/attacker goes after kernel level weaknesses, such as system calls that don't take into account privliage levels and so forth.
Although admittedly I am NOT a security expert, from what I read in the paper, there are several levels where this (pseudo)attack could be stopped, including several hardware levels (involving reengineering), the kernel level system calls (which needs improvments in SELinux, PAX, or the general SMP code in the kernels), and the application levels (meaning rewriting/recompiling applications, mostly crypto libraries).
I think the "best" way of going about fixing this from a technical standpoint is at the hardware level (and I am also not a hardware engineer so this may be simplistic or not feasible), possibly splitting the data cache tables on a per process basis, so that one process cannot see the others data, and/or perform a "timing" attack, as described in the paper, to see the status of any given bit.
Would a patch at the kernel level, through SELinux, PAX, or in mainline code be more effective?
What d'yall think?
That's possibly because there
That's possibly because there is no such thing as a "truly secure system". Security is about layers. If you can add an additional layer consisting of SELinux then that is never a bad thing.
That's possibly because there
That's possibly because there is no such thing as a "truly secure system". Security is about layers. If you can add an additional layer consisting of SELinux then that is never a bad thing.
There is truly no secure system
and never will be - can there be one?
Of course there can be.
And there likely will be.
What it comes down it is designing a system, and then proving using formal mathematical methods that it behaves as expected for all inputs. It's hard, yes, but it's definitely possible, given enough work.
This will likely happen years and years from now, when computer science and software development tools have advanced far past today's standards, but I believe we will see it.
Theoretically it is not possi
Theoretically it is not possible to prove that software is "correct" according to the specifications. I doubt therefore you can prove it behaves as expected for all inputs.
Sorry, but he's far from a di
Sorry, but he's far from a dictator. If anything he is someone people trust. Linux's development model doesn't really allow a dictator.
There are many many tree's and essentially each distro may not have or have features that are in linus's tree. If distro's think its important to fix this, they can, so can other people who build tree's from vanilla.
Remember Linux isn't an OS, so features a particular kernel may not have, does not mean the OS using that kernel does not have a patch which does fix it.
I agree with Linus
I agree with Linus wrt covert channels. As the author of the paper about HT points out, you can get a good covert channel using the cache on a non-HT processor, being only limited by context switches. It turns out that using sched_yield(), you can make context switches happen pretty fast.
I personally believe it's the responsability of crypto libraries to not have side effects (e.g. on cache) that depends on the keys, that's it. You may want to disable HT if it's really important to you (or even patch the OS), but fundamentally, it's a crypto library issue.
Colin,
So then just incase there is blue aliens on mars, we should attack and setup camps now?
I believe Linus is saying it is not a high priority issue right now. I am sure he cares. To outright blame him for not carrying (.."even if all the cryptographers in the world are standing against him.") is a miss directed attack against him for your own personal advantage.
So lets keep this clean and in the correct context.
Linus isn't a dictator, the users just don't care
Linus is not a dictator. Everyone follows his lead willingly. If Linux suffers because the HT problem is not taken seriously it is because the users of Linux don't care. If they really did care and Linus didn't do something about it they would fork the tree and solve the problem themselves. That's the beauty of FOSS.
Is Linus missing the point?
"It is at times like this that Linux really suffers from having a single dictator in charge; when Linus doesn't understand a problem, he won't fix it, even if all the cryptographers in the world are standing against him."
Sorry, I miss the point. Is _he_ supposed to do that?
*YOU'RE* missing the point
Linux doesn't suffer from having a single dictator. In case you haven't quite noticed, what goes into the Linux kernel is what people build and submit to the process. If "all the cryptographers in the world" are standing around complaining about this problem, then perhaps one of them - presumably one with a vested interest in the proceedings - should examine the vulnerability, and submit a patch or at least document what has to get fixed to resolve the vulnerability. Seems to me that these are the folks who are best placed to get their hands dirty and fix the problem.
If anything, Linux suffers from people who complain loudly and offer nothing in the way of solutions. The smart ones have been those who contributed - Look at how Linux works on so many hardware platforms. And for that matter, ever wonder why support IBM, 3ware or Adaptec hardware tends to be pretty robust? It's robust because these companies saw the opportunities and committed to creating and maintaining their drivers within the framework of Linux kernel development. Intel had to learn this the hard way - remember how the best Intel NIC drivers were written by Donald Becker until Intel finally smartened up? Promise's higher level hardware RAID card drivers used to be a nightmare because they didn't do that. They probably still don't, and is one reason why I will not use their quote-unquote enterprise level RAID cards again in any environment I manage until I'm satisfied that they've acquired sufficient clue to get involved in the process.
Not good enough for you? Okay, here's an idea, why don't you and your buddies get together and start working on your own operating system which doesn't have that problem?
Don't like that idea either? Use something else. FreeBSD? Oh wait, Theo is kinda like Torvalds. Mac? It has problems too. Windows? Yeah, right. Etch-A-Sketch? Paper and pencil? How about stone and clay tablet?
Stop complaining already and start working. Sheesh.
BSD
Theo has very little to do with the development of FreeBSD as he's the OpenBSD-guy.
Theo is kinda like Torvalds
You are wrong, Theo does care about security.
Excelent.
This post is fantastic! Great post :)
you hit the point.
Linus is both right and wrong
Linux gets right: This is the job of crypto libraries to solve. Colin doesn't understand that this is possible so dismisses it.
Linux gets wrong: This attack is easy to implement and proven to work in "real world" situations. He has no idea about probabilistic analysis of multiple traces.
Typo
It's not a "mute issue" but a "moot issue".
Mute means silent, moot means irrellevant.
Cheers
I don't agree with Linus. _So
I don't agree with Linus. _Sometimes_, latency is second to throughput.
I don't agree with Colin either, though.
I actually pretty much agree with what Intel said. This exploit is real,
if someone wants to, they can probably use it to get your RSA key.
However, it'd take a _lot_ more effort than other, already well-known methods.
And, well. Linus said "noone cares".
It really appears so. See the first FAQ on Colin's page, that explains why shouldn't they care. Then, the only thing I don't understand is why he said Linus doesn't get it.
Eh?
This exploit is real,
if someone wants to, they can probably use it to get your RSA key.
However, it'd take a _lot_ more effort than other, already well-known methods.
Would you care to tell us what these "well-known methods" for moving from "able to login to a box over ssh" to "able to steal the box's ssh host key" are?
Heck, would you care to tell Theo? I'm sure that he'd like to know about such a major vulnerability in openssh.
Sure, if you only have one user on your system, you don't need to worry about this. But many systems have more than one user.
Any local root hole would do.
Any local root hole would do. Linux has obviously many in the kernel itself (since there has been 4 fixed in the last weeks), and then there are all the application programs running as root.
What an interesting attitude
Any local root hole would do.
Do you really mean to suggest that Linux has so many local root holes that it's not worth trying to fix them?
Have you already forgotten your question he was answering?
He obviously doesn't mean that.
He means that it would be easier for an attacker to use such a privilege escalation to steal the ssh host key.
Fixing those local root holes is a far higher priority anyway, because they are bigger holes.
they could just ask you "nice
they could just ask you "nicely" ;)
Realistic attack vectors
Do you have any statistics on how long it takes to perform this attack? Presumably your attack code has to be running at a higher-than-normal priority in order to help ensure that it is running on one processor thread at the time that the server ssh daemon is running, in order to do the snoop inferencing. And only root can give processes elevated privileges in this way, right?
Can this attack be performed realistically by a user without root privileges, and how long does it take?
Can this attack be performed
Can this attack be performed realistically by a user without root privileges
Yes, that's how I obtained the data shown in my paper.
and how long does it take?
Less than one second on the machine, plus a few cpu-minutes of computation which can be done offline (and can be distributed).
Okay, but which second?
Right, but you ran the attack code and the victim code simultaneously, right? How long would it take for a program sitting and spinning on the processor to come into random contact with the victim process on a separate thread, and does the attack program have enough information to recognize that it is running at the right time to start sampling data?
Unless the program is running at an elevated privilege level (and leaving enough execution units fallow for another thread to be scheduled), it seems like there would be no guarantee that it would encounter the proper circumstances for some time to come.
I suppose you could help things out by spawning your attack and then initiating a remote ssh to the box at close to the same moment, but it seems the timing would be tricky..
Has this attack been demonstrated under such real world conditions?
Not tricky at all
I suppose you could help things out by spawning your attack and then initiating a remote ssh to the box at close to the same moment, but it seems the timing would be tricky..
Not at all. My attack code collects data and stores it in a buffer... all you have to do is start running my attack code, ssh into the box, and then stop my code once you've sshed in. Assuming that the two threads were sharing a processor core, you'll have they data you need.
does the attack program have enough information to recognize that it is running at the right time to start sampling data?
You don't need to start at the right time -- you can just record everything and then look for the right pattern later. OpenSSL leaves a very obvious trace in the cache.
I'd like to try this attack for myself
I've got access to a variety of P4 boxes at work, and I can put on my choice of OS. Where can I get this attack code, so that I can try it for myself, and see just how badly affected each of my platforms are?
I'd like to try this attack for myself
I've got access to a variety of P4 boxes at work, and I can put on my choice of OS. Where can I get this attack code, so that I can try it for myself, and see just how badly affected each of my platforms are?
Exploit root hole first!
So you have to break into the machine to install your code, which collects the data. Why not steal the ssh key right away?
Got do some catching up on your reading list
Even I, a poor mortal, have heard of this sort of attack. Let me give you a tip: does the name Bruce Schneider ring a bell?
Read more carefully
No. But the name Bruce Schneier does.
You think you do...
You think you disagree with Linus, however, it's more likely that you're just missing the point.
What he's saying is that issues that people view as throughput issues actually are based at a lower level on a latency issue. Can you come up with a case where latency is second to throughput, considering this?
Latency defined
Huh. Latency is time from request inception to first bit of information processed. Throughput is how fast things gets processes beyond that first bit of information (whatever bit means).
Observed performance is latency + throughput.
I dunno what Linus is talking about here.
Sort of....
It's possible in many cases that throughput is defined as a function of latency, though.
Say you have rapidly occuring events that take 0.01s to occur, and have 0.09s of overhead latency. (Things in the kernel happen faster, but just an example.) This would leave you with a maximum throughput of 10 events per second. If you cut the latency to 0.04s though, you get 20 events per second - higher throughput. Throughput and latency are often mathematically linked.
Latency v Throughput with HT
I am not quite sure what Linus is talking about either. Hyperthreading (Intel speak for SMT) does not affect the latency of any single process. If anything it hurts it a little bit. The whole point of SMT is to use the processors resources for a different thread when one is stalled either due to a cache miss, or a long latency operation. Also since any single thread has only a certain amount of Instruction Level Parallelism in it having multiple threads keeps more of the execution units in the processor busy. So therefore the latency of any one thread is not improved with SMT (unless your other process happens to prefetch all of the data you need into the cache for you), however since you are getting more work done in a period of time the throughput of the processor has increased.
These terms are way too overloaded
IMHO, the latency aspect Linus seems to be talking about here is the response time for a sleeping process that becomes ready to go from "waiting" to "executing and producing useful results."
The cache pollution that results from HT will clearly slow down the throughput on individual threads while they compute, especially if they're all computing large data sets. Where HT provides a benefit is that threads can wake up and respond to synchronization events as well as asynchronous inputs much more readily.
I don't think this property is unique to HT. Any sort of SMP setup should have the same advantage.
The whole point of SMT is to
The whole point of SMT is to use the processors resources for a different thread when one is stalled either due to a cache miss, or a long latency operation.
That's right. So when your process awakes it doesn't have to wait for that one that is blocked. Hence, you have reduced latency :-)
me thinks...
I was under the impression th
I was under the impression throughput is the amount of data moved, and latancy was how long it took to move it. Silly me...
You are right, SMT doesn't mo
You are right, SMT doesn't move the same unit of data any faster, it just moves more units...
No, throughput is the amount
No, throughput is the amount of data moved per time interval, and latency is the time required to move a given (usually minimal) chunk of data. Thus by varying the design of the transport mechanism, it is possible to alter the tradeoffs and have more throughput by making latency horrid, or vice-versa.
No
Not true. Latency is, as OP says, time before work gets started (which may or may not mean moving data)
Makes complete sense to me. l
Makes complete sense to me. latency is the time it takes to get the work started and throughput is the rate at which the work is being done after initial start.
Kinda... but not quite.
Latency is roughly the time until the first useful result. Throughput is the rate at which subsequent results start arriving.
Linus isn't totally right on this. If we rewind to be early 1900s when Henry Ford made his mark, recall that cars were generally built one at a time. The time to build a single car wasn't enormous, but it wasn't a scalable process.
Ford envisioned the assembly line, turning the one-at-a-time process into a pipelined process. Typically when you do that, the time it takes to make one car (from start of the pipeline to rolling off the other end) goes up. But the throughput also goes up--a second car rolls off the line very shortly after the first one.
So lets recast Torvalds' statement above. He gave the example: "I need higher throughput in order to process these jobs in 4 hours instead of 8." Really, the whole thing is a function of both throughput and latency. Let's say it takes 1 hour to produce a single car (latency), but a new car rolls off the line every 20 minutes (throughput). In 8 hours, you can produce 21 cars. (One hour until the first car. 3 cars an hour for the next 7 hours.)
Now suppose you want to produce those 21 cars in 4 hours. If you hold your latency constant, you can do it if you increase your throughput to 1 car off the line every 8.57 minutes or so (7 cars/hr). (One hour until the first car. 7 cars an hour for the next 3 hours.) In other words, you have to just-over-double your throughput to get a 2x speedup of the overall process.
Could you have gotten there by addressing latency instead? Hardly! If you made the latency until the first car exactly zero (an infinite speedup), you would've gone from 8 hours to 7 hours. You still have to shave 3 more hours somewhere.
But, see, that sort of analysis can be misleading. The process I described above is a fully pipelinable process. What I do to build car N+1 does not depend at all on car N or earlier being finished.
The moment you start inserting serial dependencies, latency becomes a barrier to throughput. This is what Linus seems to be getting at. And, for many workloads Linux serves, such dependencies seem to be the dominating factor. That's why the Anticipatory Scheduler is such a win--adding latency to a series of dependant reads in the name of increasing overall disk throughput actually lowers the disk throughput, since it reduces the rate at which you generate requests.
Ah, such interconnected concepts...
I think you've missed his point
I read Linus as making a far simpler point; taking Henry Ford again, if you change your job size from "one car" to "21 cars", your time to produce 21 cars is all latency. It doesn't matter that you can produce 10 jobs in 8 hours, if all you want is 21 cars in 4 hours. If you reduce the job size back down to one car, then the time to produce 21 cars is related to both throughput (jobs per second) and latency (time to complete one job).
The important part of this is that benchmarks almost exclusively use small jobs ("get 1000 bytes from a webserver"), and measure throughput ("webserver can deliver 10MByte/s"), whereas people often have large jobs ("download and display the amazon.com homepage") and are worried about latency ("if it doesn't display in 5 seconds, I'll go elsewhere"). Hyperthreading has very little gain (and sometimes even a cost) in throughput benchmarks, but often shows large improvements in latency.
Questions
I'm not quite sure I understand the paper about this, but it seems to me that this is a *very* specific attack, for example:
o What happens if you don't know precisely which algorithms are being used to generate keys? You're not going to get anywhere from a pretty graph of an algorithm you aren't deeply familiar with already.
o What happens if you are in an SMP system and Spy gets scheduled to a different CPU to the key generating code?
o Even if you are pretty sure that Spy and the keygen are being kept on the same CPU, how can you be sure that they will ever be executing at the same time? A loaded system will be context switching all over the place, Spy may find itself executing with a thread that's not generating a key at all. How can it tell and defend against this?
From a quick read of the paper on daemonology it seems to me that this is one of those security flaws that is only useful if you already know everything about the system, at which point the whole exercise largely becomes academic. Even knowing everything about the system it still doesn't just spit out a key file, there would be considerable reconstruction work required.
I don't mean to be rude here, but I don't see any fun or profit in this unless you're the NSA and we all know they could just brute force your key if you were really that interesting to them anyway ;)
o What happens if you are in
o What happens if you are in an SMP system and Spy gets scheduled to a different CPU to the key generating code?
o Even if you are pretty sure that Spy and the keygen are being kept on the same CPU, how can you be sure that they will ever be executing at the same time? A loaded system will be context switching all over the place, Spy may find itself executing with a thread that's not generating a key at all. How can it tell and defend against this?
If you don't manage to Spy upon openssl on your first attempt, you can try again. And again. And again. Each attempt takes less than one second, and you only need one attempt to succeed.
If the server is sufficiently busy that you can never observe the entire crypto operation, you can splice together partial operations quite easily.
cache line accesses
This is not a kernel problem, but a user space problem. The fix
is to change the user space crypto code to need the same number of cache line accesses on all keys. (Andi Kleen).
I might be missing something, but when I'm writing a library in C or possibly a higher-level language, unaware of which exact version of a compiler will compile, on which exact CPU will it run, etc., how can I make sure my code will use the CPU cache (size of which will vary with CPU used) in the very same way for every key? What if some weird optimization kicks in?
My question is, can this really be ensured from a user-space library, in source code?