[RFC] An alternative interface to device mapper

Previous thread: swapper OOPS in linux 2.6.24 on dell D620 laptop by l.genoni on Wednesday, March 5, 2008 - 2:22 am. (1 message)

Next thread: linux-next: Tree for March 5 by Stephen Rothwell on Wednesday, March 5, 2008 - 2:47 am. (3 messages)
From: Daniel Phillips
Date: Wednesday, March 5, 2008 - 2:29 am

Q: If you have an axe with a rusty head and a rotten handle, how do you 
repair it?

A: It is a two step process.  First you replace the handle, then you 
replace the head.

At one time, volume management on Linux was new and shiny.  Today it has 
fallen well behind Sun, FreeBSD, NetApp and even Microsoft.  In order 
to avoid losing even more of the storage "market" than we have already 
lost, we must make a concerted effort to catch up with the state of the 
art, or ideally take the lead as has proved possible with so many other 
aspects of Linux.

This RFC is about replacing the handle of our not-so-shiny LVM axe.  
Design goals for this alternative "ddsetup" interface are:

  * Convenient to embed in a C program
  * Make it simple enough that a library is unnecessary
  * Support for creating detailed, accurate error messages
  * Error messages delivered to caller rather than logged
  * Naturally extensible as new requirements emerge
  * 32 bit ABI works on 64 bit kernel without translation
  * Avoid bad API practices identified by [ARND 07]
  * Do not break the existing ioctl interface

The patch below includes a new kernel interface generator called ddlink.  
The ddsetup device mapper interface is an instance of a ddlink 
interface, instantiated by supplying domain-specific methods for read, 
write, ioctl and poll.

In more detail: ddlink is a generic pipe-like interface for controlling 
device drivers.  It was inspired by Trond's venerable and successful 
rpc-pipefs, which he invented to control various aspects of NFS server 
and client operation.  ddlink takes the form of a virtual filesystem 
with no namespace.  It provides application programs with fd objects 
that can be read, written, ioctled and polled, suitable for efficient 
binary communication with kernel components.  Read, write and poll 
operations act similarly to a pipe.  Unlike a pipe, there is no write 
buffering.  Each write to a ddlink directly triggers some kernel 
handler.  Reads are buffered via ...
From: Jonathan Corbet
Date: Wednesday, March 5, 2008 - 9:39 am

I'm not in a position to say much about the wider picture at the moment,
but one quibble comes immediately to mind: why do you create yet another
communication path into the kernel rather than using netlink, which is
already there and used in a number of other contexts?

jon
--

From: Daniel Phillips
Date: Wednesday, March 5, 2008 - 12:23 pm

Good question.  It is for the same reason that we are moving away from
unix domain sockets, which also work but are clumsy and force us to
structure the application in a less than desirable way.  One could equally
well ask why it was necessary to invent Netlink when unix domain sockets
already existed.

ddlink is very different from netlink.  Netlink is socket-oriented while
ddlink is file-oriented.  Compare:

 int rc;
 void *msg_head;
 /* create the message headers */
 msg_head = genlmsg_put(skb, pid, seq, type, 0, flags, DOC_EXMPL_C_ECHO, 1);
 if (msg_head == NULL) {
     rc = -ENOMEM;
     goto failure;
 }
 /* add a DOC_EXMPL_A_MSG attribute */
 rc = nla_put_string(skb, DOC_EXMPL_A_MSG, "Generic Netlink Rocks");
 if (rc != 0)
     goto failure;
 /* finalize the message */
 genlmsg_end(skb, msg_head);

static void selnl_add_payload(struct nlmsghdr *nlh, int len, int msgtype, void *data)
{
        switch (msgtype) {
        case SELNL_MSG_SETENFORCE: {
                struct selnl_msg_setenforce *msg = NLMSG_DATA(nlh);

                memset(msg, 0, len);
                msg->val = *((int *)data);
                break;
        }

        case SELNL_MSG_POLICYLOAD: {
                struct selnl_msg_policyload *msg = NLMSG_DATA(nlh);

                memset(msg, 0, len);
                msg->seqno = *((u32 *)data);
                break;
        }

        default:
                BUG();
        }
}

or:

static void selnl_notify(int msgtype, void *data)
{
        int len;
        sk_buff_data_t tmp;
        struct sk_buff *skb;
        struct nlmsghdr *nlh;

        len = selnl_msglen(msgtype);

        skb = alloc_skb(NLMSG_SPACE(len), GFP_USER);
        if (!skb)
                goto oom;

        tmp = skb->tail;
        nlh = NLMSG_PUT(skb, 0, 0, msgtype, len);
        selnl_add_payload(nlh, len, msgtype, data);
        nlh->nlmsg_len = skb->tail - tmp;
        NETLINK_CB(skb).dst_group = SELNLGRP_AVC;
        ...
From: Daniel Phillips
Date: Saturday, March 8, 2008 - 5:06 am

Hi Pavel,


It doesn't feel strange in practice.  The ddlink framework itself does
not implement this, the module does (e.g. ddsetup).  So you can put a
poll wait in your read method if that suits your interface.  It just
does not seem to be useful for ddsetup, which does not produce any
data of the kind that needs an application to sit in a loop waiting for
something to arrive.  If there is an application like that, it would
probably want to poll the ddlink anyway, to avoid having a whole thread
dedicated to just that.

Maybe the reason it does not feel strange to omit the wait is, reading
from proc never waits.  A ddlink fd is more like proc than like a pipe.

Yes, will fix.

Daniel
--

Previous thread: swapper OOPS in linux 2.6.24 on dell D620 laptop by l.genoni on Wednesday, March 5, 2008 - 2:22 am. (1 message)

Next thread: linux-next: Tree for March 5 by Stephen Rothwell on Wednesday, March 5, 2008 - 2:47 am. (3 messages)