Hi,
With a lot of help from Ingo Molnar and Pekka Enberg over the last couple of
weeks, we've been able to produce a new version of kmemcheck!General description: kmemcheck is a patch to the linux kernel that detects
use of uninitialized memory. It does this by trapping every read and write to
memory that was allocated dynamically (e.g. using kmalloc()). If a memory
address is read that has not previously been written to, a message is printed
to the kernel log.Changes since v2:
- Don't use change_page_attr()
- Clean up the use of PTE flags. In particular, we don't mess with the
existing flags, we simply add a new flag called HIDDEN
- Clean up the use of gfp flags. We now have a separate SLAB_NOTRACK flag
for slab caches in addition to __GFP_NOTRACK
- Mask interrupts while we are executing the faulting instruction
- Handle CMPS instructions correctly
- Provide a (faster) memset() that doesn't go through the #PF/#DB dance
- More debugging facilities
- Allow (optionally) partially uninitialized reads
- It actually boots!In addition, we have a second patch (thanks to Ingo) to be applied on top of
the core kmemcheck -- this silences a few of the bogus reports.The current version of the patch boots on real hardware, but we've seen
freezes on some machines, so it's not perfect yet. (In other words, this
patch is HIGHLY experimental, and run at your own risk, etc.)Other errors that still need to be resolved:
- SLUB uses the first few bytes of an object as a pointer to the next free
object (the so-called freepointer). Because of this, those bytes will
almost always be marked "initialized" although they aren't really.
- DMA can be a problem since there's generally no way for kmemcheck to
determine when/if a chunk of memory is used for DMA. Ideally, DMA should be
allocated with untracked caches, but this requires annotation of the
drivers in question.The patch applies to linux-2.6.git.
Kind regards,
Vegard Nossumarch/x86/Kconfig.debug ...
Impressive patch! On the other hand a lot of the interesting
data isn't it kmalloc anymore, but in slab. Does it really track
all that much?Also i'm not sure how you handle initializedness of DMAed data
(like network buffers). Wouldn't you need hooks into pci_dma_*
for this?Your assumption that only the string instructions can take
multiple page faults seems a little dangerous too.-Andi
--
Yes, this is true. I cannot guarantee that there are no other
instructions that could access more than one memory location but only
take one page fault. However, since the kernel does boot, we at least
know that these instructions are not very frequently used. (If you
know of any other instructions we might be missing, I'll be happy to
know about it!)There is also the point that if kmemcheck doesn't handle all the
faulting addresses, it will simply fault again and again, without
making any progress. I mean, it won't go unnoticed for very long :-)This is also why we depend on M386 and !X86_GENERIC, to avoid those
MMX, etc. instructions, as we have no support for those currently.Sincerely,
Vegard Nossum
--
Pretty much all in the right circumstances.
e.g. consider a segment reload in tracked memory.
Also there are various instructions which do all kinds of complicated
things internally; like IRET or INT: often with many memory accesses.
Just page through a instruction manual and look at the pseudo codeI would not expect problems from MMX/SSE here (except for the generic
ones all instructions have)-Andi
--
Yes, this is true. Then our task is to make sure that this memory is
never allocated from tracked caches. We do have some changes in this
area, for instance we never track task structs. Keep in mind that only
slab objects are tracked currently, so things like stacks never catch
page faults. I am not sure if this is exactly what you had in mind,
but I don't know other kernel code very well enough to come up with
perhaps more relevant examples :-)For now, I am simply assuming that we never load task segments, GDTs,
LDTs, or paging structures from tracked memory (e.g. regularThe problem with these instructions is not that they take page faults,
but that kmemcheck doesn't know how to handle them. Kmemcheck needs to
parse the instruction stream at EIP to determine what addresses were
accessed, their size, and the type (read or write). This can be done
currently with surprisingly little amounts of code.But AFAIK the format for MMX and SSE is different from the "regular"
instructions, and so I don't know how to parse them. But this is
something we can look at later.Vegard
PS: Thanks for telling me about how change_page_attr() was wrong in
kmemcheck v2. A lot of things were simply wrong in v2, but hopefully
they are better now :-)
--
Given that you don't seem to handle networking yet I wonder how
There's the stack for once too. And some others I'm probably
You only need this for the size and to detect string instructions, right?
The address should be delivered with the page fault and the r/w status too.I think for string instructions you could probably detect it with
a little state machine that detects multiple page faults on the same
instruction.Or just prevent the compiler/the code from generating string instructions.
There should not be that many once you stop gcc from generating inline
string ops (-Os is probably enough for that)For size you could in theory use VT which has special support in the CPU
to help with parsing this, although that would limit it to modern CPUsI'm pretty sure there are other special instructions that you will
eventually run into. Intel (and sometimes AMD) add new ones each CPU
generation :) Reimplementing instruction decoding on x86 is not
an easy job. Anyways if you really want to do it I would rather
recommend to use one of the existing codes like the x86-emulate
that is in KVM, but even that one is far from complete. Trying
to avoid it would probably better.-Andi
--
you are wrong about no networking support: i have booted up a full
distro config on real hardware with full networking, etc.Ingo
--
Hi Andi,
If the DMA'd memory is allocated from the page allocator, we don't
need to worry about it just yet. In case it's from kmalloc() you can
pass __GFP_NOTRACK to annotate those call sites where the memory is
filled by DMA (memory that is read needs to be initialized by the
caller obviously). There was some discussion with Ingo of a
__GFP_DMAFILL annotation to tag those call sites instead of
__GFP_NOTRACK which would work the same way.
--
RX Network packets are usually allocated with kmalloc
Ok you should add that then to skbuff.c.
-Andi
--
Indeed. If you look at the second patch, I think Ingo is already doing
that with __GFP_ZERO which accomplishes the same thing. But yeah,
we're probably missing a lot of callsites atm.
--
Hi Andi,
It tracks all slab caches. What we're not tracking is pages from the
page allocator that are directly used by callers. We had some
discussion of this already and we definitely want to extend it to
cover that too later on.
--
It's probably tricky; there are all kinds of hidden page faults
on x86 on data structures allocated as pages (e.g. GDT, LDT [which
is sometimes kmalloc too], stack etc.)-Andi
--
Hi Andi,
Aah, I see. We can annotate those callers to disable the page faulting
but maybe that's not practical, dunno. Perhaps it's not such a big
problem to track slab objects only as those contain most of the
interesting ones anyway...
--
There is a fundamental misunderstanding here: GFP_DMA allocations have
nothing to do with DMA. Rather GFP_DMA means allocate memory in a special
range of physical memory that is required by legacy devices that cannot
use the high address bits for one or the other reason. Any regular
memory can be used for DMA.Could you refactor the patch a bit? This is quite a big patch.
--
Hi Christoph,
No there isn't and we've been over this with Vegard many times :-).
Christoph, can you actually see this in the patch? There shouldn't be
any __GFP_DMA confusion there. What we have is per-object
__GFP_NOTRACK which can be used to suppress false positives for
DMA-filled objects and SLAB_NOTRACK for whole _caches_ that contains
objects which we must not take page faults at all.
--
Hmmmm... You seem to assume that __GFP_NOTRACK can be passed to slab
function calls like kmalloc. That is pretty unreliable. Could we addDrop this one. create_kmalloc_cache is done only during bootstrap and
kmalloc caches either all have SLAB_NOTRACK set or all do not have itSame here.
--
Hello,
Thank you for taking the time to look at this patch!
I don't understand. This is the point, __GFP_NOTRACK _can_ be passed
to slab functions like kmalloc. By default, when kmemcheck is enabled
in the config, all other allocations will be tracked implicitly. The
notrack flag exists to exempt certain (critical) allocations from thisThe cache_cache is needed so that we have somewhere to allocate
kmem_cache objects from. These objects are accessed from kmemcheck in
the page fault handler. If the caches are allocated from tracked
memory, we get a recursive page fault, which is not nice, to say theNo. Exactly one kmalloc_cache is created with the NOTRACK flag set,
No. This is dma_kmalloc_cache(). No DMA memory should ever be tracked
by kmemcheck, because DMA doesn't cause page faults. (So in fact,
tracking DMA is by definition not possible.)Are you sure you are not confusing tracking with tracing? It's only
one letter different in spelling, but makes a huge difference in
meaning :-)Kind regards,
Vegard Nossum
--
Ok. Then the allocator manages the gfp flag. Then we need to make sure
to clear that flag at some point. The flag needs to be consistently setSo it breaks recursion. But this adds a new cache that is rarely
used. There will be only about 50-100 kmem_cache objects in the system. I
thought you could control the tracking on an per object level? Would not aMore reasons to drop cache_cache. cache_cache is not a kmalloc array
cache by the way since it does not support power of two allocs. Its aAll slab memory allocations never cause page faults regardless from where
they are allocated. Only user space pages can be handled by page faultsNo I am quite sure what tracking is since the slab allocators have their
own tracking.But you may want to explain things better.
--
Hi Christoph,
Yes. __GFP_NOTRACK can be used to suppress tracking of objects (but we
still take the page fault for each access). That is required for things
like DMA filled pages that are never initialized by the CPU.
SLAB_NOTRACK is for not tracking a whole *cache* so that we _don't_ take
the page fault. This is needed for kmemcheck implementation (to avoidNo. We need to not track the whole page to avoid recursive faults. So
for kmemcheck we absolutely do need cache_cache but we can, of course,
hide that under a alloc_cache() function that only uses the extra cache
when CONFIG_KMEMCHECK is enabled?
--
Btw, one option is to have a new _page flag_ so that we no longer need
to look inside struct kmem_cache in the page fault handler.Pekka
--
Ok, then I think we are still talking about different things :-)
The tracking that kmemcheck does is actually a byte-for-byte tracking
of whether memory has been initialized or not. Think of it as valgrind
for the kernel. We do this by "hiding" pages (marking them non-present
for for MMU) and taking the page faults, which effectively tells us
what memory is being attempted to be read from or written to.(This generally means that the tracking that kmemcheck does is
page-granular, but we can help this by making entire caches tracked or
non-tracked.)(This also means that *all* tracked memory allocations require twice
the amount of memory that was requested -- but this is luckily
configurable by disabling kmemcheck entirely :-)).I chose to implement this in the slab layer because this is probably
where most of the interesting allocations are coming from, and this
gives us a better control over what most users/callers care about,
namely the specific objects.In a way, kmemcheck is similar to slab poisoning, since that can also
be used to detect the cases where memory is used before it is
initialized. This is a heavier-weight approach, however, and more
precise, as it gives you the exact location of the error.I hope this clears it up.
Kind regards,
Vegard Nossum
--
Ahh. Okay. But ZONE_DMA pages are exempt from that scheme? You know
But the slab layer allocates pages < PAGE_SIZE. You need to take a fault
right? So each object would need its own page?
--
No. We allocate a shadow page for each data page which we then use as
a per-byte "bitmap." For every tracked _page_ we take the page fault
always.
--
it should also be made clear that not only does kmemcheck consume half
of the RAM to do byte granular tracking of the other half of RAM, it's
also slow, very slow, because almost every kernel-space instruction will
generate a pagefault and then it will be single-stepped and it takes a
debug fault as well.That's of course totally crazy, but that's also OK and it's what makes
the feature so interesting and powerful.For example, when CONFIG_DEBUG_PAGEALLOC=y was introduced 5 years ago,
it was almost unusable on modern hardware, due to the slowdown it gave.
People said "twiddling ptes and flushing the TLB for every allocation,
that's crazy!".Today it can be enabled without noticing anything on a desktop, and it
catches lots of nasty bugs.The many debugging helpers Linux has are our eyes and ears - they catch
stuff our real eyes did not catch. We need to sharpen these tools
constantly, and do all the things that current hardware allows us to do
sanely.The same speedup will happen with kmemcheck as well in the long run. It
is a big slowdown currently due to the massive amount of pagefaults it
generates, even on top of the line hardware, but it's already fast
enough to boot up and to catch bugs. [and we can optimize it by quite a
degree - i've alreadyextended the profiler to trace kmemcheck pagefault
sources.] It will never be usable in production, but the boundary of
where to enable it and why will move constantly.So i'm convinced that the time has come for kmemcheck. It already caught
4 live kernel bugs and it's been tested on 2 boxes only. Please help us
make the SLUB bits squeaky clean :-)Ingo
--
Applies on top of kmemcheck patch. Fixes/silences some reports of
use of uninitialized memory.From: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>diff --git a/include/asm-generic/siginfo.h b/include/asm-generic/siginfo.h
index 8786e01..b70cd97 100644
--- a/include/asm-generic/siginfo.h
+++ b/include/asm-generic/siginfo.h
@@ -278,11 +278,19 @@ void do_schedule_next_timer(struct siginfo *info);static inline void copy_siginfo(struct siginfo *to, struct siginfo *from)
{
+#ifdef CONFIG_KMEMCHECK
+ memcpy(to, from, sizeof(*to));
+#else
+ /*
+ * Optimization, only copy up to the size of the largest known
+ * union member:
+ */
if (from->si_code < 0)
memcpy(to, from, sizeof(*to));
else
/* _sigchld is currently the largest know union member */
memcpy(to, from, __ARCH_SI_PREAMBLE_SIZE + sizeof(from->_sifields._sigchld));
+#endif
}#endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 109734b..fb6f6ec 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -875,8 +875,8 @@ struct file_lock {
struct pid *fl_nspid;
wait_queue_head_t fl_wait;
struct file *fl_file;
- unsigned char fl_flags;
- unsigned char fl_type;
+ unsigned int fl_flags;
+ unsigned int fl_type;
loff_t fl_start;
loff_t fl_end;diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 047d432..d4cc6ee 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -193,8 +193,8 @@ struct dev_addr_list
{
struct dev_addr_list *next;
u8 da_addr[MAX_ADDR_LEN];
- u8 da_addrlen;
- u8 da_synced;
+ unsigned int da_addrlen;
+ unsigned int da_synced;
int da_users;
int da_gusers;
};
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 412672a..7bdb37f 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1294,7 +1294,11 @@ static inline void __skb_queue_purge(struct sk_buff_head *list)
static inline s...
| Zach Brown | [PATCH 3 of 4] Teach paths to wake a specific void * target instead of a whole tas... |
| Linus Torvalds | Re: LSM conversion to static interface |
| Greg Kroah-Hartman | [PATCH 001/196] Chinese: Add the known_regression URI to the HOWTO |
| Andrew Morton | -mm merge plans for 2.6.23 |
git: | |
| Gregory Haskins | [RFC PATCH 00/17] virtual-bus |
| David Miller | [GIT]: Networking |
| Jarek Poplawski | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
