Hi, With a lot of help from Ingo Molnar and Pekka Enberg over the last couple of weeks, we've been able to produce a new version of kmemcheck! General description: kmemcheck is a patch to the linux kernel that detects use of uninitialized memory. It does this by trapping every read and write to memory that was allocated dynamically (e.g. using kmalloc()). If a memory address is read that has not previously been written to, a message is printed to the kernel log. Changes since v2: - Don't use change_page_attr() - Clean up the use of PTE flags. In particular, we don't mess with the existing flags, we simply add a new flag called HIDDEN - Clean up the use of gfp flags. We now have a separate SLAB_NOTRACK flag for slab caches in addition to __GFP_NOTRACK - Mask interrupts while we are executing the faulting instruction - Handle CMPS instructions correctly - Provide a (faster) memset() that doesn't go through the #PF/#DB dance - More debugging facilities - Allow (optionally) partially uninitialized reads - It actually boots! In addition, we have a second patch (thanks to Ingo) to be applied on top of the core kmemcheck -- this silences a few of the bogus reports. The current version of the patch boots on real hardware, but we've seen freezes on some machines, so it's not perfect yet. (In other words, this patch is HIGHLY experimental, and run at your own risk, etc.) Other errors that still need to be resolved: - SLUB uses the first few bytes of an object as a pointer to the next free object (the so-called freepointer). Because of this, those bytes will almost always be marked "initialized" although they aren't really. - DMA can be a problem since there's generally no way for kmemcheck to determine when/if a chunk of memory is used for DMA. Ideally, DMA should be allocated with untracked caches, but this requires annotation of the drivers in question. The patch applies to linux-2.6.git. Kind regards, Vegard Nossum ...
Applies on top of kmemcheck patch. Fixes/silences some reports of
use of uninitialized memory.
From: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
diff --git a/include/asm-generic/siginfo.h b/include/asm-generic/siginfo.h
index 8786e01..b70cd97 100644
--- a/include/asm-generic/siginfo.h
+++ b/include/asm-generic/siginfo.h
@@ -278,11 +278,19 @@ void do_schedule_next_timer(struct siginfo *info);
static inline void copy_siginfo(struct siginfo *to, struct siginfo *from)
{
+#ifdef CONFIG_KMEMCHECK
+ memcpy(to, from, sizeof(*to));
+#else
+ /*
+ * Optimization, only copy up to the size of the largest known
+ * union member:
+ */
if (from->si_code < 0)
memcpy(to, from, sizeof(*to));
else
/* _sigchld is currently the largest know union member */
memcpy(to, from, __ARCH_SI_PREAMBLE_SIZE + sizeof(from->_sifields._sigchld));
+#endif
}
#endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 109734b..fb6f6ec 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -875,8 +875,8 @@ struct file_lock {
struct pid *fl_nspid;
wait_queue_head_t fl_wait;
struct file *fl_file;
- unsigned char fl_flags;
- unsigned char fl_type;
+ unsigned int fl_flags;
+ unsigned int fl_type;
loff_t fl_start;
loff_t fl_end;
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 047d432..d4cc6ee 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -193,8 +193,8 @@ struct dev_addr_list
{
struct dev_addr_list *next;
u8 da_addr[MAX_ADDR_LEN];
- u8 da_addrlen;
- u8 da_synced;
+ unsigned int da_addrlen;
+ unsigned int da_synced;
int da_users;
int da_gusers;
};
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 412672a..7bdb37f 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1294,7 +1294,11 @@ static inline void __skb_queue_purge(struct sk_buff_head *list)
static inline struct sk_buff ...Hmmmm... You seem to assume that __GFP_NOTRACK can be passed to slab function calls like kmalloc. That is pretty unreliable. Could we add Drop this one. create_kmalloc_cache is done only during bootstrap and kmalloc caches either all have SLAB_NOTRACK set or all do not have it Same here. --
Hello, Thank you for taking the time to look at this patch! I don't understand. This is the point, __GFP_NOTRACK _can_ be passed to slab functions like kmalloc. By default, when kmemcheck is enabled in the config, all other allocations will be tracked implicitly. The notrack flag exists to exempt certain (critical) allocations from this The cache_cache is needed so that we have somewhere to allocate kmem_cache objects from. These objects are accessed from kmemcheck in the page fault handler. If the caches are allocated from tracked memory, we get a recursive page fault, which is not nice, to say the No. Exactly one kmalloc_cache is created with the NOTRACK flag set, No. This is dma_kmalloc_cache(). No DMA memory should ever be tracked by kmemcheck, because DMA doesn't cause page faults. (So in fact, tracking DMA is by definition not possible.) Are you sure you are not confusing tracking with tracing? It's only one letter different in spelling, but makes a huge difference in meaning :-) Kind regards, Vegard Nossum --
Ok. Then the allocator manages the gfp flag. Then we need to make sure to clear that flag at some point. The flag needs to be consistently set So it breaks recursion. But this adds a new cache that is rarely used. There will be only about 50-100 kmem_cache objects in the system. I thought you could control the tracking on an per object level? Would not a More reasons to drop cache_cache. cache_cache is not a kmalloc array cache by the way since it does not support power of two allocs. Its a All slab memory allocations never cause page faults regardless from where they are allocated. Only user space pages can be handled by page faults No I am quite sure what tracking is since the slab allocators have their own tracking. But you may want to explain things better. --
Ok, then I think we are still talking about different things :-) The tracking that kmemcheck does is actually a byte-for-byte tracking of whether memory has been initialized or not. Think of it as valgrind for the kernel. We do this by "hiding" pages (marking them non-present for for MMU) and taking the page faults, which effectively tells us what memory is being attempted to be read from or written to. (This generally means that the tracking that kmemcheck does is page-granular, but we can help this by making entire caches tracked or non-tracked.) (This also means that *all* tracked memory allocations require twice the amount of memory that was requested -- but this is luckily configurable by disabling kmemcheck entirely :-)). I chose to implement this in the slab layer because this is probably where most of the interesting allocations are coming from, and this gives us a better control over what most users/callers care about, namely the specific objects. In a way, kmemcheck is similar to slab poisoning, since that can also be used to detect the cases where memory is used before it is initialized. This is a heavier-weight approach, however, and more precise, as it gives you the exact location of the error. I hope this clears it up. Kind regards, Vegard Nossum --
Ahh. Okay. But ZONE_DMA pages are exempt from that scheme? You know But the slab layer allocates pages < PAGE_SIZE. You need to take a fault right? So each object would need its own page? --
No. We allocate a shadow page for each data page which we then use as a per-byte "bitmap." For every tracked _page_ we take the page fault always. --
it should also be made clear that not only does kmemcheck consume half of the RAM to do byte granular tracking of the other half of RAM, it's also slow, very slow, because almost every kernel-space instruction will generate a pagefault and then it will be single-stepped and it takes a debug fault as well. That's of course totally crazy, but that's also OK and it's what makes the feature so interesting and powerful. For example, when CONFIG_DEBUG_PAGEALLOC=y was introduced 5 years ago, it was almost unusable on modern hardware, due to the slowdown it gave. People said "twiddling ptes and flushing the TLB for every allocation, that's crazy!". Today it can be enabled without noticing anything on a desktop, and it catches lots of nasty bugs. The many debugging helpers Linux has are our eyes and ears - they catch stuff our real eyes did not catch. We need to sharpen these tools constantly, and do all the things that current hardware allows us to do sanely. The same speedup will happen with kmemcheck as well in the long run. It is a big slowdown currently due to the massive amount of pagefaults it generates, even on top of the line hardware, but it's already fast enough to boot up and to catch bugs. [and we can optimize it by quite a degree - i've alreadyextended the profiler to trace kmemcheck pagefault sources.] It will never be usable in production, but the boundary of where to enable it and why will move constantly. So i'm convinced that the time has come for kmemcheck. It already caught 4 live kernel bugs and it's been tested on 2 boxes only. Please help us make the SLUB bits squeaky clean :-) Ingo --
Hi Christoph, Yes. __GFP_NOTRACK can be used to suppress tracking of objects (but we still take the page fault for each access). That is required for things like DMA filled pages that are never initialized by the CPU. SLAB_NOTRACK is for not tracking a whole *cache* so that we _don't_ take the page fault. This is needed for kmemcheck implementation (to avoid No. We need to not track the whole page to avoid recursive faults. So for kmemcheck we absolutely do need cache_cache but we can, of course, hide that under a alloc_cache() function that only uses the extra cache when CONFIG_KMEMCHECK is enabled? --
Btw, one option is to have a new _page flag_ so that we no longer need to look inside struct kmem_cache in the page fault handler. Pekka --
There is a fundamental misunderstanding here: GFP_DMA allocations have nothing to do with DMA. Rather GFP_DMA means allocate memory in a special range of physical memory that is required by legacy devices that cannot use the high address bits for one or the other reason. Any regular memory can be used for DMA. Could you refactor the patch a bit? This is quite a big patch. --
Hi Christoph, No there isn't and we've been over this with Vegard many times :-). Christoph, can you actually see this in the patch? There shouldn't be any __GFP_DMA confusion there. What we have is per-object __GFP_NOTRACK which can be used to suppress false positives for DMA-filled objects and SLAB_NOTRACK for whole _caches_ that contains objects which we must not take page faults at all. --
Impressive patch! On the other hand a lot of the interesting data isn't it kmalloc anymore, but in slab. Does it really track all that much? Also i'm not sure how you handle initializedness of DMAed data (like network buffers). Wouldn't you need hooks into pci_dma_* for this? Your assumption that only the string instructions can take multiple page faults seems a little dangerous too. -Andi --
Hi Andi, It tracks all slab caches. What we're not tracking is pages from the page allocator that are directly used by callers. We had some discussion of this already and we definitely want to extend it to cover that too later on. --
It's probably tricky; there are all kinds of hidden page faults on x86 on data structures allocated as pages (e.g. GDT, LDT [which is sometimes kmalloc too], stack etc.) -Andi --
Hi Andi, Aah, I see. We can annotate those callers to disable the page faulting but maybe that's not practical, dunno. Perhaps it's not such a big problem to track slab objects only as those contain most of the interesting ones anyway... --
Hi Andi, If the DMA'd memory is allocated from the page allocator, we don't need to worry about it just yet. In case it's from kmalloc() you can pass __GFP_NOTRACK to annotate those call sites where the memory is filled by DMA (memory that is read needs to be initialized by the caller obviously). There was some discussion with Ingo of a __GFP_DMAFILL annotation to tag those call sites instead of __GFP_NOTRACK which would work the same way. --
RX Network packets are usually allocated with kmalloc Ok you should add that then to skbuff.c. -Andi --
Indeed. If you look at the second patch, I think Ingo is already doing that with __GFP_ZERO which accomplishes the same thing. But yeah, we're probably missing a lot of callsites atm. --
Yes, this is true. I cannot guarantee that there are no other instructions that could access more than one memory location but only take one page fault. However, since the kernel does boot, we at least know that these instructions are not very frequently used. (If you know of any other instructions we might be missing, I'll be happy to know about it!) There is also the point that if kmemcheck doesn't handle all the faulting addresses, it will simply fault again and again, without making any progress. I mean, it won't go unnoticed for very long :-) This is also why we depend on M386 and !X86_GENERIC, to avoid those MMX, etc. instructions, as we have no support for those currently. Sincerely, Vegard Nossum --
Pretty much all in the right circumstances. e.g. consider a segment reload in tracked memory. Also there are various instructions which do all kinds of complicated things internally; like IRET or INT: often with many memory accesses. Just page through a instruction manual and look at the pseudo code I would not expect problems from MMX/SSE here (except for the generic ones all instructions have) -Andi --
Yes, this is true. Then our task is to make sure that this memory is never allocated from tracked caches. We do have some changes in this area, for instance we never track task structs. Keep in mind that only slab objects are tracked currently, so things like stacks never catch page faults. I am not sure if this is exactly what you had in mind, but I don't know other kernel code very well enough to come up with perhaps more relevant examples :-) For now, I am simply assuming that we never load task segments, GDTs, LDTs, or paging structures from tracked memory (e.g. regular The problem with these instructions is not that they take page faults, but that kmemcheck doesn't know how to handle them. Kmemcheck needs to parse the instruction stream at EIP to determine what addresses were accessed, their size, and the type (read or write). This can be done currently with surprisingly little amounts of code. But AFAIK the format for MMX and SSE is different from the "regular" instructions, and so I don't know how to parse them. But this is something we can look at later. Vegard PS: Thanks for telling me about how change_page_attr() was wrong in kmemcheck v2. A lot of things were simply wrong in v2, but hopefully they are better now :-) --
Given that you don't seem to handle networking yet I wonder how There's the stack for once too. And some others I'm probably You only need this for the size and to detect string instructions, right? The address should be delivered with the page fault and the r/w status too. I think for string instructions you could probably detect it with a little state machine that detects multiple page faults on the same instruction. Or just prevent the compiler/the code from generating string instructions. There should not be that many once you stop gcc from generating inline string ops (-Os is probably enough for that) For size you could in theory use VT which has special support in the CPU to help with parsing this, although that would limit it to modern CPUs I'm pretty sure there are other special instructions that you will eventually run into. Intel (and sometimes AMD) add new ones each CPU generation :) Reimplementing instruction decoding on x86 is not an easy job. Anyways if you really want to do it I would rather recommend to use one of the existing codes like the x86-emulate that is in KVM, but even that one is far from complete. Trying to avoid it would probably better. -Andi --
you are wrong about no networking support: i have booted up a full distro config on real hardware with full networking, etc. Ingo --
