Address Space operations

Submitted by Kedar Sovani
on March 30, 2005 - 10:35am

Novice filesystem developers often get confused between the ->prepare_write, ->commit_write and the ->writepage functions in the address space operations. The same question was raised on linux-fsdevel mailing list. Here is the response.

From: Nikita Danilov [email blocked]
To: linux-fsdevel
Date: Wed, 30 Mar 2005 17:55:16 +0400
Subject: Re: Address space operations questions

Martin Jambor writes: > Hi, > > I have problems understanding the purpose of different entries of > struc address_space_operations in 2.6 kernels: > > 1. What is bmap for and what is it supposed to do? ->bmap() maps logical block offset within "object" to physical block number. It is used in few places, notably in the implementation of FIBMAP ioctl. > > 2. What is the difference between sync_page and write_page? (It is spelt ->writepage() by the way). ->sync_page() is an awful misnomer. Usually, when page IO operation is requested by calling ->writepage() or ->readpage(), file-system queues IO request (e.g., disk-based file system may do this my calling submit_bio()), but underlying device driver does not proceed with this IO immediately, because IO scheduling is more efficient when there are multiple requests in the queue. Only when something really wants to wait for IO completion (wait_on_page_{locked,writeback}() are used to wait for read and write completion respectively) IO queue is processed. To do this wait_on_page_bit() calls ->sync_page() (see block_sync_page()---standard implementation of ->sync_page() for disk-based file systems). So, semantics of ->sync_page() are roughly "kick underlying storage driver to actually perform all IO queued for this page, and, maybe, for other pages on this device too". > > 3. What exactly (fs independent) is the relation in between > write_page, prepare_write and commit_write? Does prepare make sure a > page can be written (like allocating space), commit mark it dirty a > write write it sometime later on? ->prepare_write() and ->commit_write() are only used by generic_file_write() (so, one may argue that they shouldn't be placed into struct address_space at all). generic_file_write() has a loop for each page overlapping with portion of file that write goes into: a_ops->prepare_write(file, page, from, to); copy_from_user(...); a_ops->commit_write(file, page, from, to); In page is partially overwritten, ->prepare_write() has to read parts of the page that are not covered by write. ->commit_write() is expected to mark page (or buffers) and inode dirty, and update inode size, if write extends file. As for block allocation and transaction handling, this is up to the file system back end. Usually ->commit_write() doesn't start IO by itself, it just marks pages dirty. Write-out is done by balance_dirty_pages_ratelimited(): when number of dirty pages in the system exceeds some threshold, kernel calls ->writepages() of dirty inodes. ->writepage() is used in two places: - by VM scanner to write out dirty page from tail of the inactive list. This is "rare" path, because balance_dirty_pages() is supposed to keep amount of dirty pages under control. - by mpage_writepages(): default implementation of ->writepages() method. > > Thak you very much for any insight, > > Martin Hope this helps. Nikita.

More on sync_page

on
April 1, 2005 - 4:24am

Trond Myklebust explained more on the sync_page operation. He had introduced ->sync_page, in the 2.4.x series.

From: Trond Myklebust [email blocked]
To : linux-kernel
Date: Apr 1, 2005 3:23 AM
Subject: Re: Address space operations questions

to den 31.03.2005 Klokka 13:40 (-0800) skreiv Bryan Henderson: > >what it > >*really* means to be called in sync_page() is that you're being told > >that some process is about to block on that page. For what reason, you > >can't know from the call alone. > > Ugh. IOW it barely means anything. It reflects the fact that the page lock can be held for a variety of reasons, some of which require you to kick the filesystem and some which don't. I introduced the sync_page() call in 2.4.x partly in order to get rid of all those pathetic hard-coded calls to "run_task_queue(&tq_disk)" that used to litter the 2.4.x mm code (and still do in some places). As far as NFS is concerned, they are a useless distraction since only the block code uses the tq_disk queue. The other reason was that the NFS client itself had to defer actually putting reads on the wire until someone requested the lock: the reason was that there was no equivalent of the "readpages()" call, so that when we wanted to coalesce more than 1 page worth of data into a single read call, we had to exit readpage() without actually starting I/O in the hope that the readahead code would then schedule a readpage() on a neighbouring page. Cheers, Trond

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.