Linux: The Journaling Block Device

Submitted by Kedar Sovani
on June 21, 2006 - 2:40am

Atomicity is a property of an operation either to succeed or fail completely. Disks assure atomicity at the sector level. This means that a write to a sector either goes through completely or not at all. But when an operation spans over multiple sectors of the disk, a higher-level mechanism is needed. This mechanism should ensure that modifications to the entire set of sectors are handled atomically. Failure to do so leads to inconsistencies. This document talks about the implementation of the Journaling Block Device in Linux.

Let's look at how these inconsistencies could be introduced to a filesystem. Say we have an application that creates a file. The
filesystem internally has to decrease the number of free inodes by one, intialize the inode on the disk and add an entry to the
parent directory for the newly created file. But what happens if the machine crashes after only the first operation is executed? In this circumstance, an inconsistency has been introduced in the filesystem. The number of free inodes has decreased, but no initialisation of the inode has been performed on the disk.

The only way to detect these inconsistencies is by scanning the entire filesystem. This task is called fsck, filesystem consistency check. In large installations, the consistency check requires a significant amount of time (many hours) to check and fix inconsistencies. As you might have guessed, such downtime is not desirable. A better approach to solve this problem is to avoid introducing inconsistencies in the first place, and this could be accomplished by providing atomicity to operations. Journaling is such a way to provide atomicity to operations.

Simply stated, using journaling is like using a scratch pad. You perform operations on the scratch pad, and once you are satisfied that the operations are correct, you reflect them in a fairer copy.

In the case of filesystems, all the metadata and data are stored on the block device for the filesystem. Journaling filesystems use a journal or the log area as the scratch pad. A journal may be a part of the same block device or it may be a separate device in itself. A journaling filesystem first records all the operations it has performed in the journal. Once the set of operations that is part of one single atomic operation has completed and been recorded in the journal, only then is it writtent to the actual block device. Henceforth, the term disk is used to indicate the actual block device, whereas the term journal is used for the log area.


Journal Recovery Scenarios

The example operation from above requires that three blocks be modified—the inode count block, the block containing the on-disk inode and the block holding the directory where the entry is to be added. All of these blocks first are written to the journal. After that, a special block, called the commit record, is written to the journal. The commit record is used to indicate that all the blocks belonging to a single atomic operation are written to the journal.

Given journaling behavior, then, here is how a journaling filesystem reacts in the following three basic scenarios:

  • The machine crashes after only the first block is flushed to the journal. In this case, when the machine comes back up again and checks the journal, it finds an operation with no commit record at the end. This indicates that it may not be a completed operation. Hence, no modifications are done to the disk, preserving the consistency.

  • The machine crashes after the commit record is flushed to the journal. In this case, when the machine comes back up again and checks the journal, it finds an operation with the commit record at the end. The commit record indicates that this is a completed operation and could be written to the disk. All the blocks belonging to this operation are written at their actual locations on the disk, replaying the journal.

  • The machine crashes after all the three blocks are flushed to the journal but the commit record is not yet flushed to the journal. Even in this case, because of the absence of the commit record, no modifications are done to the disk. The scenario thus is reduced to the scenario described in the first case.

Likewise, any other crash scenario could be reduced to any of the scenarios listed above.

Thus, journaling guarantees consistency for the filesystem. The time required for looking up the journal and replaying the journal is minimal as compared to that taken by the filesystem consistency check.


Journaling Block Device

The Linux Journaling Block Device (JBD) provides this scratch pad for providing atomicity in operations. Thus, a filesystem controlling a block device can make use of JBD on the same or on another block device in order to maintain consistency. The JBD is a modular implementation that exposes a set of APIs for the use of such applications. The
following sections describe the concepts and implementation of the Linux JBD as is present in the Linux 2.6 kernel.

Before we move on to the implementation details of the JBD, an understanding of some of the objects that JBD uses is required. A journal is a log that internally manages updates for a single block device. As mentioned above, the updates first are stored in the journal and then are reflected to their real locations on the disk. The area belonging to the journal is managed like a circular-linked list. That is, the journal reuses its area when the journal is full.

A handle represents a single atomic update. The entire set of changes/writes that should be performed atomically are carried out with reference to a single handle.

It may not be an efficient approach to flush each atomic update (handle) to the journal, however. To achieve better performance, the JBD bunches a set of handles together into a transaction and flushes this transaction to the journal. The JBD ensures that the transaction is atomic in nature. Hence, the handles, which are the subcomponents of the transaction, also are guaranteed to be atomic.

The most important property of a transaction is its state. When a transaction is being committed, it follows the lifecycle of states listed below.

  1. Running: the transaction currently is live and can accept new handles. In a system only one transaction can be in the running state.

  2. Locked: the transaction does not accept any new handles but existing handles are not complete. Once all the existing handles are completed, the transaction goes to the next state.

  3. Flush: all the handles in a transaction are complete. The transaction is writing itself to the journal.

  4. Commit: the entire transaction log has been written to the journal. The transaction is writing a commit block indicating that the transaction log in the journal is complete.

  5. Finished: the transaction is written completely to the journal. It has to remain there until the blocks are updated to the actual locations on the disk.


Transaction Committing and CheckPointing

A running transaction is written to the journal area after a certain period. Thus, a transaction can be either in-memory (running) or on-disk. Flushing a transaction to the journal and marking that particular transaction as finished is a process called transaction commit.

The journal has a limited area under its control, and it needs to reuse this area. As for committed transactions, those having all their blocks written to the disk, they no longer need to be kept in the journal. Checkpointing, then, is the process of flushing the finished transactions to the disk and reclaiming the corresponding space in the journal. It is discussed in more detail later in this article.


Implementation Briefs

The JBD layer performs journaling of the metadata, during which the data simply is written to the disk without being journaled. But this does not stop applications from journaling the data, as it could be presented to the JBD as metadata itself. This document takes the linux kernel version 2.6.0 as a reference.


Commit

[journal_commit_transaction(journal object)]

A Kjournald thread is associated with every journaled device. The Kjournald thread ensures that the running transaction is committed after a specific interval. The transaction commit code is divided into eight different phases, described below. Figure 1 shows a logical layout of a journal.

Phase 0: moves the transaction from running state (T_RUNNING) to locked state (T_LOCKED), meaning the transaction no longer can issue new handles. The transaction waits until all the existing handles have completed. A transaction always has a set of buffers reserved for when the transaction is initiated. Some of these buffers may be unused and are unfiled in this phase. The transaction now is ready to be committed with no outstanding handles.

Phase 1: the transaction enters into the flush state (T_FLUSH). The transaction is marked as a currently committing
transaction for the journal. This phase also marks that no running transaction exists for the journal; therefore, new requests for handles initiate a new transaction.

Phase 2: the actual buffers of the transaction are flushed to the disk. Data buffers go first. There are no complications here, as data buffers are not saved in the log area. Instead, they are flushed directly to their actual positions on the disk. This phase ends when the I/O completion notifications for all such buffers are received.

Phase 3: all the data buffers are written to a disk but their metadata still is in the volatile memory. Metadata flushing is not as straightforward as data buffer flushing, because metadata needs to be written to the log area and the actual positions on the disk need to be remembered. This phase starts with flushing these metadata buffers, for which a journal descriptor block is acquired. The journal descriptor block stores the mapping of each metadata buffer in the journal to its actual location on the disk in the form of tags. After this, metadata buffers are flushed to the journal. Once the journal descriptor is full of tags or all metadata buffers are flushed to the journal, the journal descriptor also is flushed to the journal. Now we have all the metadata buffers in the journal, and their actual positions on the disk are remembered. This data, being persistent, can be used for recovery if failure occurs.

Phase 4 and Phase 5: both phase 4 and phase 5 wait on I/O completion notifications
of metadata buffers and journal descriptor blocks, respectively. The
buffers are unfiled from in-memory lists once I/O completion is
received.

Phase 6: all the data and metadata is on safe storage, data at its actual locations and metadata in the journal. Now transactions need to be marked as committed so that it can be known that all the updates are safe in the journal. For this reason, a journal descriptor block again is allocated. A tag is written stating that the transaction has committed successfully, and the block is synchronously written to its position in the journal. After this, the transaction is moved to the committed state, T_COMMIT.

Phase 7: occurs when a number of transactions are present in the journal, without yet being flushed to the disk. Some of the metadata buffers in this transaction already may be a part of some previous transaction. These need not be kept in the older transactions as we have their latest copy in the current committed transaction. Such buffers are removed from older transactions.

Phase 8: the transaction is marked as being in the finished state, T_FINISHED. The journal structure is updated to reflect this particular transaction as the latest committed transaction. It also is added to the list of transactions to be checkpointed.


Checkpointing

Checkpointing is initiated when the journal is being flushed to the disk—think of unmount— or when a new handle is started. A new handle can fall short of guaranteed number of buffers, so it may be necessary to carry out a checkpointing process in order to free some space in the journal.

The checkpointing process flushes the metadata buffers of a transaction not yet written to its actual location on the disk. The transaction then is removed from the journal. The journal can have multiple checkpointing transactions, and each checkpointing transaction can have multiple buffers. The process considers each committing transaction, and for each transaction, it finds the metadata buffers that need to be flushed to the disk. All these buffers are flushed in one batch. Once all the transactions are checkpointed, their log is removed from the journal.


Recovery

[journal_recover(journal object)]

When the system comes up after a crash and it can see that the log entries are not null, it indicates that the last unmount was not successful or never occurred. At this point, you need to attempt a recovery. Figure 2 depicts a sample physical layout of journal. The recovery takes place in three phases.

  1. PASS_SCAN: the end of the log is found.

  2. PASS_REVOKE: a list of revoked blocks is prepared from the log.

  3. PASS_REPLAY: unrevoked blocks are rewritten (replayed) in order to guarantee the consistency of the disk.

For recovery, the available information is provided in terms of the journal. But the exact state of the journal is unknown, as we do not know the point at which the system crashed. Hence, the last transaction could be in the checkpointing or committing state. A running transaction cannot be found, as it was only in the memory.

For committing transactions, we have to forget the updates made, as all of the updates may not be in place. So in the PASS_SCAN phase, the last log entry in the log is found. From here, the recovery process knows which transactions need to be replayed.

Every transaction can have a set of revoked blocks. This is important to know in order to prevent older journal records from being replayed on top of newer data using the same block. In PASS_REVOKE, a hash table of all these revoked blocks is prepared. This table is used every time we need to find out whether a particular block should get written to a disk through a replay.

In the last phase, all the blocks that need to be replayed are considered. Each block is tested for its presence in the revoked blocks' hash table. If the block is not in there, it is safe to write the block to its actual location on the disk. If the block is there, only the newest version of the block is written to the disk. Notice that we have not changed anything in the on-disk journal. Hence, even if system crashes again while the recovery is in progress, no harm is done.
The same journal is present for the recovery next time, and no non-idempotent operation is performed during the process of recovery.

Amey Inamdar (www.geocities.com/amey_inamdar) is a kernel developer working at Kernel Corporation. His interest areas include filesystems and distributed systems.

Kedar Sovani (www.geocities.com/kedarsovani) works for Kernel Corporation as a kernel developer. His areas of interest include filesystems and storage technologies.

Copyright (c) 2004-2006 Kedar Sovani and Amey Inamdar

So how does the journal get w

Anonymous (not verified)
on
June 22, 2006 - 4:11am

So how does the journal get written atomically?

It doesn't, that's the whole

Martin Ebourne (not verified)
on
June 22, 2006 - 6:53am

It doesn't, that's the whole point. It doesn't need to.

The only thing that ever needs to be written atomically is the commit record in the journal. This is just a marker so easily fits within a single block.

FWIW...

on
June 22, 2006 - 10:24am

FWIW, this is a common strategy in other scenarios as well. For instance, RCU works in a similar way. RCU snapshots a structure that may have multiple readers to a working copy. It then does all its updates to the working copy. Finally,it changes the pointer to the structure to point to the updated version, so that all subsequent readers see the updated version, with all the updates appearing "atomically."

It's a particularly useful strategy.

Journaling Block device vs Journaling fs?

on
June 22, 2006 - 11:49am

So does this obviate the need for a journalling filesystem? You could set aside a 'journal' partition on your disk and then run whatever fs you want in the other partitions... and they're all block-journalled even if they're not fs-journalled.

RE-journaling Block device vs Journaling fs?

Amit Mitkar (not verified)
on
June 22, 2006 - 1:04pm

@pj

No, its actually a facility meant to be used by a filesystem for its own journaling purposes. So, you could use the jbd layer to add journaling feature to any filesystem that needs to use it. Its not automatic, the filesystem must explicitly designed to use the facility.

The journaling block device, in itself, does not

on
June 23, 2006 - 3:48am

@pj
The journaling block device, in itself, cannot understand what set of blocks are a part of a single transaction. This information is very specific to the filesystem running on top. But once the jbd is informed of this (by the filesystem), jbd can deal with the rest of the journaling intricacies, the filesystem should not bother.

Nice!

Anonymous (not verified)
on
June 22, 2006 - 4:03pm

Excellent article.

I believe that this is one of the first non-Jeremy articles since he extended the Google offer? Hope we get treated to more like this, it's a beaut! :)

Dennis

Firstly, thanks a lot for thi

Anonymous (not verified)
on
June 26, 2006 - 4:42am

Firstly, thanks a lot for this valuable article.

Well, I think we can conclude that the idea behind journalling is just to have the new data (updated one) written to the storage without touching the old one. Then, onee we have insured that it has been successfully committed, the internal filesystem meta-data are just updated to point to the new version.

So, why to buffer and then to commit? I would say this happens only when the JBD is not the last storage device, right?

By the way, are log-structured FS and journalling FS different? or they are two names for the same thing?

thanks again.

Well the difference between l

on
June 26, 2006 - 1:36pm

Well the difference between log-structured and journaling FS is that the scratch-pad operations for log-structured FS are performed on the FS itself as log entries. So once a transaction is committed all the FS metadata points to the new log entry, while at the same time preserving the on-disk old log. Making possible time-travelling within the FS. In a way the whole FS is a journal.

The thing I don't get is that when JBD is used in metadata journaling mode (ordered-data mode, ext3) and we are executing the journal_commit_transaction. What if the crash occurs after Step-2, when the data buffers have been flushed to their on-disk locations, and we haven't updated the metadata in the journal yet. (consider this a file overwrite operation). How is the consistency maintained then??

Re : Well the difference between

Amit Mitkar (not verified)
on
June 27, 2006 - 1:13pm

@prabir.

In this case the only guarantee that is given is of meta-data level consistency. So, even though you have data blocks written, their metadata is not updated, which is fs consistent :). So, in your example, though a file is being overwritten, may be its existing (data)blocks will get changed. Any new blocks added to the file will be lost, but the filesystem is still consistent at the metadata level.( since thats the journaling mode you chose ).

HTH Amit.

what is the location of journal?

on
March 30, 2007 - 3:55pm

where is the journal exactly stored? what if while crash, the journal is being updated and if that data and metadata to be entered into the journal is inconsistent? what happens then??
Is it possible that this situation would arise by any chance that the journal itself is deleted? n if so how the journal is recovered?

seems that the state COMMIT

wengang wang (not verified)
on
June 12, 2008 - 10:17pm

seems that
the state COMMIT means flushing control blocks(descriptor, commit) and metadata block to on disk journal. -- not only the commit block.
the difference maybe caused by jbd source code changing?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.