A kernel crash dump is a snapshot of system state taken at the time that the kernel crashed, useful for finding and debugging the problem that caused the crash in the first place. There is no standard mechanism for automatiaclly collecting a crash dump on Linux, but there are a number of existing projects working toward efficiently meeting this goal. A "Linux Kernel Dump Summit" was recently mentioned on the lkml, with participants from some of the many crash dump projects looking to standardize the dump process and information collected. A followup email noted, "as memory size grows, the time and space for capturing kernel crash dumps really matter." It went on to examine partial dumps, and full dumps that are compressed. The former risks not collecting information necessary for proper debugging, while the latter risks greatly increasing the amount of time required to collect a dump.
There are a number of existing projects for collecting automatic kernel crash dumps on Linux, including Linux Kernel Crash Dump (LKCD), Mini Kernel Dump (mkdump), kdump, and diskdump (detailed here). Some of these projects also include tools for examining the obtained dumpfiles. Other projects focus just on tools for analyzing kernel crash dumps, including the perl-based Alicia (the Advanced LInux Crash-dump Interactive Analyzer) and Red Hat's crash analysis tool "loosely based on the SVR4 UNIX crash command, but significantly enhanced by completely merging it with the GNU gdb debugger."
From: Hiro Yoshioka [email blocked] To: linux-kernel Subject: Linux Kernel Dump Summit 2005 Date: Wed, 21 Sep 2005 20:55:50 +0900 (JST) To whom may concern We had a Linux Kernel Dump Summit 2005. The participants are Dump tools Session diskdump -- Fujitsu mkdump -- NTT Data Intellilink LTD -- Hitachi kdump -- Turbolinux Summary -- Miracle Linux Dump Analysis tools Session Alicia/crash -- Uniadex Other participants are VA Linux/NEC/NSSOL/IPA/OSDL/Toshiba Some discussion topics are (but not limited to) - What kind of information do we need? trace information all of registers the last log of panic, oops LTD (Linux Tough Dump) has some nice features - We need a partial dump - We have to minimize the down time - We have to dump all memory how can we distinguish from the kernel and user if kernel data is corrupted - How we are not able to dump data device power management we need a generic mechanism to reset a device - Hang NMI watch dog mount - It is very difficult to debug a memory corrupt bug - hardware error - Where will we go to? IHV and Linux Kernel community collaboration are needed Dump Analysis tools are very important - There is a concern that the development process of 'crash' is not open. - Do we have to extend gdb? - We'd like to collaborate 'crash' - kexec/kdump, mkdump, LTD, all of them use the second kernel to dump it. - We have to share the test data, check list, test tools of dump tool developments. We agree to have the Linux Kernel Dump Summit. Regards, Hiro
From: OBATA Noboru [email blocked] Subject: Re: Linux Kernel Dump Summit 2005 Date: Thu, 06 Oct 2005 21:17:18 +0900 (JST) Hi, Hiro, On Wed, 21 Sep 2005, Hiro Yoshioka wrote: > > We had a Linux Kernel Dump Summit 2005. > - We need a partial dump > - We have to minimize the down time > > - We have to dump all memory > how can we distinguish from the kernel and user if > kernel data is corrupted As memory size grows, the time and space for capturing kernel crash dump really matter. We discussed two strategies in the dump summit. 1. Partial dump 2. Full dump with compression PARTIAL DUMP ============ Partial dump captures only pages that are essential for later analysis, possibly by using some mark in mem_map. This certainly reduces both time and space of crash dump, but there is a risk because no one can guarantee that a dropped page is really unnecessary in analysis (it can be a tragedy if analysis went unsolved because of the dropped page). Another risk is a corruption of mem_map (or other kernel structure), which makes the identification of necessary pages unreliable. So there would be best if a user can select the level of partial dump. A careful user may always choose a full dump, while a user who is tracking the well-reproducible kernel bug may choose fast and small dump. FULL DUMP WITH COMPRESSION ========================== Those who still want a full dump, including me, are interested in dump compression. For example, the LKCD format (at least v7 format) supports pagewise compression with the deflate algorithm. A dump analyze tool "crash" can transparently analyze the compressed dump file in this format. The compression will reduce the storage space at certain degree, and may also reduce the time if a dump process were I/O bounded. WHICH IS BETTER? ================ I wrote a small compression tool for LKCD v7 format to see how effective the compression is, and it turned out that the time and size of compression were very much similar to that of gzip, not surprisingly. Compressing a 32GB dump file took about 40 minutes on Pentium 4 Xeon 3.0GHz, which is not good enough because the dump without compression took only 5 minutes; eight times slower. Besides, the compress ratios were somewhat picky. Some dump files could not be compressed well (the worst case I found was only 10% reduction in size). After examining the LKCD compress format, I must conclude that the partial dump is the only way to go when time and size really matter. Now I'd like to see how effective the existing partial dump functionalities are. Regards, -- OBATA Noboru [email blocked]