Re: [patch] Add basic sanity checks to the syscall execution patch

Previous thread: [PATCH] INITRAMFS: Preserve mtime from INITRAMFS cpio images by Nye Liu on Wednesday, September 3, 2008 - 6:40 pm. (1 message)

Next thread: reiserfs do_journal_end unnecessary hd wake up? by Bráulio Barros de Oliveira on Wednesday, September 3, 2008 - 7:56 pm. (1 message)
From: Arjan van de Ven
Date: Wednesday, September 3, 2008 - 7:51 pm

Add basic sanity checks to the syscall execution patch

Several pieces of malware (rootkits etc) have the nasty habbit
of putting their own pointers into the syscall table.
For example, the recently "hot in the news" phalanx rootkit does this.

The patch below, while obviously not perfect protection against malware,
adds some cheap sanity checks to the syscall path to verify the
system call is actually still in the kernel code region and not some
external-to-this region such as a rootkit.

The overhead is very minimal; measured at 2 cycles or less.
(this is because the branches get predicted right and the rest of the
code is almost perfectly parallelizable... and an indirect function call
is a branch issue anyway)

with eyes-on-the-code help from Peter
the idea is from Ben Herrenschmidt 

Signed-off-by: Arjan van de Ven

diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index 109792b..f25c0a1 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -347,7 +347,12 @@ sysenter_past_esp:
 sysenter_do_call:
 	cmpl $(nr_syscalls), %eax
 	jae syscall_badsys
-	call *sys_call_table(,%eax,4)
+	mov sys_call_table(,%eax,4), %eax
+	cmp $_stext, %eax
+	jb syscall_badsys
+	cmp $_etext, %eax
+	jae syscall_badsys
+	call *%eax
 	movl %eax,PT_EAX(%esp)
 	LOCKDEP_SYS_EXIT
 	DISABLE_INTERRUPTS(CLBR_ANY)
@@ -426,7 +431,12 @@ ENTRY(system_call)
 	cmpl $(nr_syscalls), %eax
 	jae syscall_badsys
 syscall_call:
-	call *sys_call_table(,%eax,4)
+	mov sys_call_table(,%eax,4), %eax
+	cmp $_stext, %eax
+	jb syscall_badsys
+	cmp $_etext, %eax
+	jae syscall_badsys
+	call *%eax
 	movl %eax,PT_EAX(%esp)		# store the return value
 syscall_exit:
 	LOCKDEP_SYS_EXIT
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 89434d4..be42486 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -360,8 +360,13 @@ ENTRY(system_call_after_swapgs)
 system_call_fastpath:
 	cmpq $__NR_syscall_max,%rax
 	ja badsys
+	mov ...
From: Andi Kleen
Date: Thursday, September 4, 2008 - 5:01 am

This just means that the root kits will switch to patch
the first instruction of the entry points instead.

So the protection will be zero to minimal, but the overhead will
be there forever.

Now that I said this I expect it to go in yesterday.

-Andi

--

From: Alan Cox
Date: Thursday, September 4, 2008 - 5:34 am

On Thu, 04 Sep 2008 14:01:46 +0200

Agreed entirely. This is a waste of time and a game not worth playing.
The only place you can expect to make a difference here is in virtualised
environments by teaching KVM how to provide 'irrevocably read only' pages
to guests where the guest OS isn't permitted to change the rights back or
the virtual mapping of that page.

Alan
--

From: Andi Kleen
Date: Thursday, September 4, 2008 - 6:06 am

Even that can be circumvented by patching indirect pointers (or pointer 
to objects with indirect pointers) in any writable object. Or in
a couple of other ways.

But yes it would still seem like a reasonable useful improvement.

-Andi

-- 
ak@linux.intel.com
--

From: Arjan van de Ven
Date: Thursday, September 4, 2008 - 5:44 am

On Thu, 04 Sep 2008 14:01:46 +0200

I'd have considered taking your email serious if you had left out the
uncalled and unneeded sarcasm line at the end.



-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: pageexec
Date: Friday, September 5, 2008 - 2:43 am

consider how your whole patch is based on one big self-contradiction.
you already assume that the attacker *can* modify arbitrary kernel memory
(even the otherwise *read-only* syscall table at that), but at the very
same time you're saying he *can't* use the same powers to patch out your
'protection' or do many other things to evade it. as it is, it's cargo cult
security at its best, reminding one on the Vista kernel's similar 'protection' 
mechanism for the service descriptor tables...

--

From: Benjamin Herrenschmidt
Date: Friday, September 5, 2008 - 3:14 am

Well, I see it a different way ... it will once for all screw up
binary modules that try to add syscalls :-)

Ben.

--

From: pageexec
Date: Friday, September 5, 2008 - 3:49 am

and that'd be because at the same time they patch the syscall table (remember,
they already have to go to length to get around the read-only pages), they
can't also patch this 'protection'? sounds really plausible, right :).

[fixed hpa's address, .org bounces.]

--

From: Benjamin Herrenschmidt
Date: Friday, September 5, 2008 - 3:57 am

Sure, they can :-)

It's just an idea I had on irc but I tend to agree that it wouldn't have
much effect in practice... regarding security, it will break some
existing rootkits ... until updated ones show up.

Cheers,
Ben.

--

From: Ingo Molnar
Date: Friday, September 5, 2008 - 4:42 am

at which point we are left with a change that has no relevance to 
updated rootkits (they circumvent it just fine), while the kernel 
syscall entry path is left with 2 cycles (or more) overhead, forever.

Not a good deal.

We introduced the read-only syscall table because it has debugging and 
robustness advantages, with near zero cost. This change is not zero cost 
- it's ~1% of our null syscall latency. (which is ~100 nsecs, the cost 
of this check is ~1 nsec)

The other, more fundamental problem that nobody has mentioned so far is 
that the check returns -ENOSYS and thus makes rootkit attacks _more 
robust_ and hence more likely!

The far better solution would be to insert uncertainty into the picture: 
some sort of low-frequency watchdog [runs once a second or so] that 
tries to hide itself from the general kernel scope as much as possible, 
perhaps as ELF-PIC code at some randomized location, triggered by some 
frequently used and opaque kernel facility that an attacker can not 
afford to block or fully filter, and which would just check integrity 
periodically and with little cost.

When it finds a problem it immediately triggers a hard to block/filter 
vector of alert (which can be a silent alarm over the network or to the 
screen as well).

that method does not prevent rootkits in general (nothing can), but sure 
makes their life more risky in practice - and a guaranteed livelihood 
and risk reduction is what typical criminals are interested in 
primarily, not whether they can break into a particular house.

If we implement it then it should not be present in distro .config's, 
etc. - it should be as invisible as possible - perhaps only be part of 
the kernel image .init.data section in some unremarkably generic manner. 

[ It would be nice to have a 'randomize instruction scheduling' option 
  for gcc, to make automated attacks that recognize specific instruction 
  patterns less reliable. ]

A good benchmark for such a silent alarm facility would be ...
From: Andi Kleen
Date: Friday, September 5, 2008 - 5:01 am

First as a minor pedantic correction (sorry!): the ro syscall table is not 
fully free.  It means you cannot use 2MB pages anymore to map it, which costs

One way to do that today is to feed gcc random data for profile feedback.

Game copy protections have been playing similar games for decades. While
I'm sure it was endless fun for both sides afaik the crackers tended to 
ultimatively win. And all of these things also make the kernel more 
fragile which is not good. Likely a case of "the only way to win is not to play"

I liked Alan's proposal of using hypervisor support for truly ro pages,  
although even that is not fully hole proof because of indirect pointers. 
But at least it would make it generally harder to inject code.

-Andi

-- 
ak@linux.intel.com
--

From: pageexec
Date: Friday, September 5, 2008 - 5:00 am

there's that adage about history being repeated by those not knowing it ;)
for details see the series based around bypassing Vista's PatchGuard at:

  http://uninformed.org/?v=3
  http://uninformed.org/?v=6

i believe the above mentioned papers prove that it's not a good benchmark ;)

--

From: Ingo Molnar
Date: Friday, September 5, 2008 - 8:42 am

i think Linux is fundamentally different here as we have the source 

and every box where it matters we could have a _per box_ randomized 
kernel image in essence, with non-essential symbols thrown away, and 
with a few checks inserted in random locations - inlined and in essence 
unrecognizable from the general entropy of randomization.

Not that a randomizing compiler which inserts true, hard to eliminate 
entropy would be easy to implement. But once done, the cat and mouse 
game is over and the needle is hidden in the hay-stack. At least as long 
as transparent rootkits are involved.

a successful attack that wants to disable the checks reliably would have 
to patch the IDT and would have to emulate full kernel execution and 
would have to detect the pattern of an alert on the hardware API level - 
as that would be the only reliably observable output of the system. 
Besides being impractical at best, at minimum a huge slow-down would 
occur.

the only other option would be for a rootkit to transparently switch to 
another, new, non-checked kernel image on the fly, while keeping all 
user-space context safe. That's a feature Linux would like to have 
anyway ;-) [and this could be made really difficult as well if gcc 
inserted a modest amount of per kernel random noise in the layout of all 
data structures / field offsets.]

	Ingo
--

From: pageexec
Date: Friday, September 5, 2008 - 9:23 am

how's that supposed to work for the binary distros, i.e., the majority of

why do you assume that an attacker wants to do that? it's equally possible,
and there's even academic research on this in addition to the underground
cracking scene, that one simply hides the modifications from the checker.

from marking your patched code as unreadable to executing it from a different
place than what the checker checks, there're many ways to trick such checkers.
as far as reality goes, it's never been game over ;).

--

From: Ingo Molnar
Date: Friday, September 5, 2008 - 9:52 am

it takes less than 10 minutes to build a full kernel on recent hardware. 

yes, in this area debuggability is in straight conflict. Since we can 
assume that both attacker and owner has about the same level of access 
to the system, making the kernel less accessible to an attacker makes it 

well at least in the case of Linux we have a fairly good tally of what 
kernel code is supposed to be executable at some given moment after 
bootup, and can lock that list down permanently until the next reboot, 
and give the list to the checker to verify every now and then? Such a 
verification pass certainly wouldnt be cheap though: all kernel 
pagetables have to be scanned and verified, plus all known code (a few 
megabytes typically), and the key CPU data structures.

	Ingo
--

From: Andi Kleen
Date: Friday, September 5, 2008 - 10:26 am

First such checkers already exist -- they are called root kit checkers.
There are various around. 

Doing it in a hypervisor implicitely like Alan proposed would seem much 

The issue is that a lot of non key data structures all over the memory
have function pointers (or pointers to function pointers) too.
So if you protect syscall table they are just going to patch some dentry
instead. Still if it's reasonable clean it might be still useful to
raise the bar a bit, but I'm not sure a checker qualifies for that.

-Andi
--

From: pageexec
Date: Friday, September 5, 2008 - 12:42 pm

how trivial do you think it is for *kernel* code to evade *userland*
checking it? ;) otherwise agreed with rest.

--

From: Andi Kleen
Date: Friday, September 5, 2008 - 1:48 pm

It depends on where the userland runs. e.g. if it's under a hypervisor
and in a separate domain it should be reasonably safe.

And then I don't think it is much difference between Ingo's kernel
checker and a user land checker. Both can be disabled it you know
about them.

-Andi

-- 
ak@linux.intel.com
--

From: pageexec
Date: Friday, September 5, 2008 - 12:37 pm

provided the end user wants/needs to have the whole toolchain on his boxes

it's not only installation time (if you meant 'installing the box' itself),

in other words, it's a permanently unsolved problem ;). somehow i don't see
Red Hat selling RHEL for production boxes with the tag 'we do not debug crashes

so no module support? what about kprobes and/or whatever else that generates

so good-bye to large page support for kernel code? else there's likely
enough unused space left in the large pages for a rootkit to hide.

what if the rootkit finds unused pieces of actual code and replaces
that (bound to happen with those generic distro configs, especially
if you have to go with a non-modular kernel)?

last but not least, how would that 'lock that list down' work exactly?

what would you verify on the code? it's obfuscated so you can't really
analyze it (else you've just solved the attacker's problem), all you can
do is probably compute hashes but then you'll have to take care of kernel
self-patching and also protecting the hashes somehow.

--

From: Ingo Molnar
Date: Saturday, September 6, 2008 - 8:42 am

it's minimal and easy. It really works to operate on the source code - 
this 'open source' thing ;-) We just still tend to think in terms of 
binary software practices that have been established in the past few 

not a problem really, it is rather small compared to all the stuff that 
is in a typical disto install. I like the fundamental message as well: 
"If you want to be more secure, you've got to have the source code, and 

it's not an unsolvable problem. The debug info can be on a separate box, 
encrypted, etc. etc - depending on your level of paranoia. The need to 
debug kernel crashes is a relatively rare event - especially on a box 

why no module support? Once the system has booted up all necessary 
modules are loaded and the ability to load new ones is locked down as 
well. This also makes it harder to inject rootkits btw. (combined with 

you dont need that in general on a perimeter box. If you need it, you 
open that locked box with the debug info and make the system more 
patchable/debuggable - at the risk of exposing same information to 

are you now talking about the randomized kernel image? The whole point 
why i proposed it was to hide the checking functionality in it, not to 
make it harder for the attacker to place the rootkit.

Once the identity of the checking code is randomized reasonably, we can 
assume it will run every now and then, and would expose any 
modifications of 'unused' kernel functions. (which the attacker would 

best would be hardware support for mark-read-only-permanently, but once 
the checker functionality is reasonably randomized, its data structure 

yes, hashes. The point would be to make the true characteristics of the 
checker a random, per system property. True, it has many disadvantages 
such as the inevitable slowdown from a randomized kernel image, the 
restrictions on debuggability, etc. - but it can serve its purpose if 
someone is willing to pay that price.

best (and most practical) tactics would still be to allow ...
From: pageexec
Date: Saturday, September 6, 2008 - 5:17 pm

the question wasn't whether it was minimal or easy but whether end users
want to have the toolchain on their production boxes, especially on these

the point is not the size of the toolchain, i don't think anyone cares
about that in the days of TB disks. the more fundamental issue is that
the toolchain doesn't normally belong to production boxes and if the
sole reason to have it is this kernel image randomization feature, then
it may not be as easy a sell as you think as there're better alternatives

what does having the debug info available in whatever form help you in
the debugging process that doesn't at the same time help an attacker?

remember, the assumption is that the attacker is already on the box (and
as root at that), trying to get his kernel rootkit to work, so you'll
have to come up with a debugging procedure where he can't leverage that
local acccess to pry the debug info out of your hands as you're trying
to diagnose a problem. e.g., you can't just disconnect the box from the
network if you need remote access yourself or reproducing the problem

how are the security constraints of the box related to its kernel's

and this also makes it impossible to load newer versions of modules,
which will now require a full reboot. i'm sure management will like the

so all an attacker needs to do is induce some kernel problems (due to
the underlying assumption, he can easily do that), wait for you guys


and was pointing out that you don't actually have such a good tally unless
you're willing to give up large page support for kernel code, and even if
you go for 4k pages you'll be in trouble because a generic kernel like
those used in distros is bound to have unused regions of code. and i base
this on the assumption that your randomization cannot fundamentally change
function boundaries (i.e., randomizing code placement at the basic block
level) without killing the branch predictor for good. the short of it is
that your list of 'kernel code pages' is useless without ...
From: Willy Tarreau
Date: Friday, September 5, 2008 - 1:41 pm

Then they will simply proceed like this :
  - patch /boot/vmlinuz
  - sync
  - crash system

=> user says "oh crap" and presses the reset button. Patched kernel boots.
   Game over. Patching vmlinuz for known targetted distros is even easier
   because the attacker just has to embed binary changes for the most
   common distro kernels.

Clearly all this is a waste of developer time, CPU cycles, memory,
reliability and debugging time. All that time would be more efficiently
spent auditing and debugging existing code to reduce the attack surface,
and CPU cycles + memory would be better spent adding double checks to
most sensible functions' entry points and user data processing.

Regards,
Willy

--

From: Ingo Molnar
Date: Saturday, September 6, 2008 - 8:45 am

a reboot often raises attention. But yes, in terms of end user boxes, 
probably not. Anyway, my points were about transparent rootkits 
installed on a running system without anyone noticing - obviously if the 
attacker can modify the kernel image and the user does not mind a reboot 
it's game over.

	Ingo
--

From: Jeroen van Rijn
Date: Saturday, September 6, 2008 - 9:34 am

Hi,

can't then, in this scenario, the VFS keep tabs on /boot/vmlinuz and
only allow modification when the process in question properly
authenticates itself. As long as we're talking signed modules, why not
lock certain files down as well?

e.g. hand the kernel a signed list of files to watch write access to,
and allow only after the process auths via a private key.

-- Jeroen.

n.b. I understand this would slow down things more, but if we're
talking about taking extreme measures...
--

From: Pavel Machek
Date: Sunday, September 7, 2008 - 5:53 am

Well, install a rootkit in /boot/vmlinuz, sync, then wait for user to
reboot its system?

Even well-kept servers are rebooted from time to time.

I agree -- the only way to win is not to play this game.

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Arjan van de Ven
Date: Friday, September 5, 2008 - 9:05 am

On Fri, 05 Sep 2008 11:43:31 +0200


so I'm not going to say that the patch is important or good;
it's the result of ben mentioning the idea on irc and me thinking "sure
lets see what it would take and cost".
Nothing more than that


-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

Previous thread: [PATCH] INITRAMFS: Preserve mtime from INITRAMFS cpio images by Nye Liu on Wednesday, September 3, 2008 - 6:40 pm. (1 message)

Next thread: reiserfs do_journal_end unnecessary hd wake up? by Bráulio Barros de Oliveira on Wednesday, September 3, 2008 - 7:56 pm. (1 message)