We have a large ram area on a PCI board (think of a custom framebuffer
type application). We're using 2.6.20.
We have the PCI ram mapped into kernel space, and knew the physical addresses.
We have a raw partition on the block device which we reserve for this.
We want to be able to stick the contents of selected portion of PCI ram onto a block device (disk). Past incarnations modified the disk driver, and developed a special API so the custom driver constructed scatter/gather lists and fed it to the driver (bypassing the elevator algorithm, to execute
as the "next request".
What I'm looking is for a more generic/driver independent way of sticking
contents of PCI ram onto a disk.
Is offset + length of each bio_vec < pagesize?
What's the best way to do this (much of my data is already in physically
contiguous memory [and mapped into virtual memory)).
Any good examples to look at?
marty
--
You don't need it mapped into virtual memory. Whether the data is contig
Apart from where you get your memory from, you can easily use the
generic infrastructure for this. Something ala:
void my_end_io_function(struct bio *bio, int err)
{
/*
* whatever you need to do here, once you get this call IO is
* done for that bio. put bio at the end to free it again.
*/
...
bio_put(bio);
}
write_my_data(struct block_device, sector_t sector, unsigned int bytes)
{
struct request_queue *q;
struct bio *bio = NULL;
struct page *page;
unsigned int offset, length;
q = bdev_get_queue(bdev);
offset = first_page_offset;
while (bytes) {
if (!bio) {
unsigned int npages = (bytes + PAGE_SIZE - 1) >> PAGE_SHIFT;
bio = bio_alloc(GFP_KERNEL, npages);
bio->bi_sector = sector;
bio->bi_bdev = bdev_to_write_to;
bio->bi_end_io = my_end_io_function; /* called on io end */
bio->bi_private = some_data; /* if my_end_io_function wants that */
}
page = some_func_to_return_you_a_page_in_the_pci_mem(sector);
length = bytes;
if (length > PAGE_SIZE)
length = PAGE_SIZE;
/* if this fails, we can't map more at this offset. send
* what we have and force a new bio alloc at the top of
* the loop
*/
if (!bio_add_page(bio, page, length, offset)) {
submit_bio(WRITE, bio);
bio = NULL;
}
bytes -= length;
sector += length >> 9;
offset = 0;
}
}
totally untested, just typed into this email. So probably full of typos,
but you should get the ...Ermm seriously why not have a userspace task with the PCI RAM mmapped and just use write() like normal sane people do ? Alan --
To avoid the fault and copy, I would assume. -- Jens Axboe --
On Fri, 26 Sep 2008 11:11:35 +0200 It's a write to a raw partition so with O_DIRECT you won't have to copy and MAP_POPULATE will premap the object if even the first write wants to occur without faulting overhead. Alan --
You are still going through get_user_pages() for each write. As I would imagine the writes would generally be large, the hit would not be too bad (but it's still there). Depending on the hardware, it may or may not be a big deal. But the path from device to disk is definitely a lot bigger and more complex with the mmap/write approach. Another alternative would be using splice - if the pci device exposed a char device node, you could support ->splice_read() there which would just fill the pages into the pipe buffer. Then change the block device fops ->splice_write() to go direct to the block device through a bio instead of using the page cache based generic_file_splice_write(). Such a change would actually make sense to do, if the block device has been opened with O_DIRECT. And it would get you about the same performance as doing it in-kernel, the only extra overhead would be two syscalls per 64k (well probably only one extra syscall, since you probably need an ioctl/syscall to initiate the in-kernel activity as well). So just about as free as you could get. -- Jens Axboe --
Something like this, totally untested but should get the point across.
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 57e2786..fd06032 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -24,6 +24,7 @@
#include <linux/uio.h>
#include <linux/namei.h>
#include <linux/log2.h>
+#include <linux/splice.h>
#include <asm/uaccess.h>
#include "internal.h"
@@ -1224,6 +1225,77 @@ static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg)
return blkdev_ioctl(file->f_mapping->host, file, cmd, arg);
}
+static void block_splice_end_io(struct bio *bio, int err)
+{
+ bio_put(bio);
+}
+
+static int pipe_to_disk(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
+ struct splice_desc *sd)
+{
+ struct block_device *bdev = I_BDEV(sd->u.file->f_mapping->host);
+ struct bio *bio;
+ int ret, bs;
+
+ bs = queue_hardsect_size(bdev_get_queue(bdev));
+ if (sd->pos & (bs - 1))
+ return -EINVAL;
+
+ ret = buf->ops->confirm(pipe, buf);
+ if (unlikely(ret))
+ return ret;
+
+ bio = bio_alloc(GFP_KERNEL, 1);
+ bio->bi_sector = sd->pos / bs;
+ bio->bi_bdev = bdev;
+ bio->bi_end_io = block_splice_end_io;
+
+ bio_add_page(bio, buf->page, buf->len, buf->offset);
+
+ submit_bio(WRITE, bio);
+ return buf->len;
+}
+
+/*
+ * Splice to file opened with O_DIRECT. Bypass caching completely and
+ * just go direct-to-bio
+ */
+static ssize_t __block_splice_write(struct pipe_inode_info *pipe,
+ struct file *out, loff_t *ppos, size_t len,
+ unsigned int flags)
+{
+ struct splice_desc sd = {
+ .total_len = len,
+ .flags = flags,
+ .pos = *ppos,
+ .u.file = out,
+ };
+ struct inode *inode = out->f_mapping->host;
+ ssize_t ret;
+
+ if (unlikely(*ppos & 511))
+ return -EINVAL;
+
+ inode_double_lock(inode, pipe->inode);
+ ret = __splice_from_pipe(pipe, &sd, pipe_to_disk);
+ inode_double_unlock(inode, pipe->inode);
+
+ if (ret > 0)
+ *ppos += ret;
+
+ return ret;
+}
+
+static ssize_t block_splice_write(struct pipe_inode_info ...Also: a) to deal with interrupts from the hardware b) using legacy code/design/architecture The splice approaches look very interesting...thanks... marty --
Just for kicks, I did the read part of the fast bdev interface as well.
As with the write, it's totally untested (apart from compiled). Just in
case anyone is curious... I plan to do a bit of testing on this this
week.
IMHO, this interface totally rocks. It's really async like splice was
intended, and it's fast too. I may have to look into some generic IO
mechanism to unify them all, O_DIRECT/page cache/splice. Famous last
words, I'm sure.
diff --git a/fs/block_dev.c b/fs/block_dev.c
index aff5421..f8df781 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -24,6 +24,7 @@
#include <linux/uio.h>
#include <linux/namei.h>
#include <linux/log2.h>
+#include <linux/splice.h>
#include <asm/uaccess.h>
#include "internal.h"
@@ -1155,6 +1156,264 @@ static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg)
return blkdev_ioctl(file->f_mapping->host, file, cmd, arg);
}
+static void block_splice_write_end_io(struct bio *bio, int err)
+{
+ bio_put(bio);
+}
+
+static int pipe_to_disk(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
+ struct splice_desc *sd)
+{
+ struct block_device *bdev = I_BDEV(sd->u.file->f_mapping->host);
+ struct bio *bio;
+ int ret, bs;
+
+ bs = queue_hardsect_size(bdev_get_queue(bdev));
+ if (sd->pos & (bs - 1))
+ return -EINVAL;
+
+ ret = buf->ops->confirm(pipe, buf);
+ if (unlikely(ret))
+ return ret;
+
+ bio = bio_alloc(GFP_KERNEL, 1);
+ bio->bi_sector = sd->pos / bs;
+ bio->bi_bdev = bdev;
+ bio->bi_end_io = block_splice_write_end_io;
+
+ bio_add_page(bio, buf->page, buf->len, buf->offset);
+
+ submit_bio(WRITE, bio);
+ return buf->len;
+}
+
+/*
+ * Splice to file opened with O_DIRECT. Bypass caching completely and
+ * just go direct-to-bio
+ */
+static ssize_t __block_splice_write(struct pipe_inode_info *pipe,
+ struct file *out, loff_t *ppos, size_t len,
+ unsigned int flags)
+{
+ struct splice_desc sd = {
+ .total_len = len,
+ .flags = flags,
+ .pos = ...Alright, so this one actually works :-)
Apart from fixing the bugs in it, it's also more clever in using the bio
for the write part. It'll reuse the same bio in the splice actor until
it's full, only then submitting it and allocating a new one. The read
part works the same way.
diff --git a/fs/block_dev.c b/fs/block_dev.c
index aff5421..1e807a3 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -24,6 +24,7 @@
#include <linux/uio.h>
#include <linux/namei.h>
#include <linux/log2.h>
+#include <linux/splice.h>
#include <asm/uaccess.h>
#include "internal.h"
@@ -1155,6 +1156,346 @@ static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg)
return blkdev_ioctl(file->f_mapping->host, file, cmd, arg);
}
+static void block_splice_write_end_io(struct bio *bio, int err)
+{
+ bio_put(bio);
+}
+
+/*
+ * No need going above PIPE_BUFFERS, as we cannot fill that anyway
+ */
+static inline unsigned len_to_max_pages(unsigned int len)
+{
+ unsigned pages = (len + PAGE_SIZE - 1) / PAGE_SIZE;
+
+ return min_t(unsigned, pages, PIPE_BUFFERS);
+}
+
+/*
+ * A bit of state data, to allow us to make larger bios
+ */
+struct block_splice_data {
+ struct file *file;
+ struct bio *bio;
+};
+
+static int pipe_to_disk(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
+ struct splice_desc *sd)
+{
+ struct block_splice_data *bsd = sd->u.data;
+ struct block_device *bdev = I_BDEV(bsd->file->f_mapping->host);
+ unsigned int mask;
+ struct bio *bio;
+ int ret;
+
+ mask = queue_hardsect_size(bdev_get_queue(bdev)) - 1;
+ if ((sd->pos & mask) || (buf->len & mask) || (buf->offset & mask))
+ return -EINVAL;
+
+ ret = buf->ops->confirm(pipe, buf);
+ if (unlikely(ret))
+ return ret;
+
+ bio = bsd->bio;
+ if (!bio) {
+new_bio:
+ bio = bio_alloc(GFP_KERNEL, len_to_max_pages(sd->total_len));
+ bio->bi_sector = sd->pos;
+ do_div(bio->bi_sector, mask + 1);
+ bio->bi_bdev = bdev;
+ bio->bi_end_io = block_splice_write_end_io;
+ bsd->bio = ...Hello Jens, I have been following this thread trying to grasp a very nifty use case (high speed acquisition and storage of data) of splice. I think it would make a perfect example of splice functionality. What would the user space part look like to exercise this interface? And whoever writes Linux Device Drivers 4th edition or one of the kernel books; make sure this topic in is :-) Regards, -- Leon --
Download: http://brick.kernel.dk/snaps/splice-git-latest.tar.gz which has lots of little examples for splice. You would want to do something ala # splice-in /dev/my-pci-device | splice-out /dev/sda in one app of course, but take a look at the examples and get a feel for the interface... BTW, in my splice branch I have this queued as well. Not going anywhere for now, but should get updated and tested every now and then. http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=refs/heads/splice -- Jens Axboe --
