Q: If you have an axe with a rusty head and a rotten handle, how do you repair it? A: It is a two step process. First you replace the handle, then you replace the head. At one time, volume management on Linux was new and shiny. Today it has fallen well behind Sun, FreeBSD, NetApp and even Microsoft. In order to avoid losing even more of the storage "market" than we have already lost, we must make a concerted effort to catch up with the state of the art, or ideally take the lead as has proved possible with so many other aspects of Linux. This RFC is about replacing the handle of our not-so-shiny LVM axe. Design goals for this alternative "ddsetup" interface are: * Convenient to embed in a C program * Make it simple enough that a library is unnecessary * Support for creating detailed, accurate error messages * Error messages delivered to caller rather than logged * Naturally extensible as new requirements emerge * 32 bit ABI works on 64 bit kernel without translation * Avoid bad API practices identified by [ARND 07] * Do not break the existing ioctl interface The patch below includes a new kernel interface generator called ddlink. The ddsetup device mapper interface is an instance of a ddlink interface, instantiated by supplying domain-specific methods for read, write, ioctl and poll. In more detail: ddlink is a generic pipe-like interface for controlling device drivers. It was inspired by Trond's venerable and successful rpc-pipefs, which he invented to control various aspects of NFS server and client operation. ddlink takes the form of a virtual filesystem with no namespace. It provides application programs with fd objects that can be read, written, ioctled and polled, suitable for efficient binary communication with kernel components. Read, write and poll operations act similarly to a pipe. Unlike a pipe, there is no write buffering. Each write to a ddlink directly triggers some kernel handler. Reads are buffered via ...
I'm not in a position to say much about the wider picture at the moment, but one quibble comes immediately to mind: why do you create yet another communication path into the kernel rather than using netlink, which is already there and used in a number of other contexts? jon --
Good question. It is for the same reason that we are moving away from
unix domain sockets, which also work but are clumsy and force us to
structure the application in a less than desirable way. One could equally
well ask why it was necessary to invent Netlink when unix domain sockets
already existed.
ddlink is very different from netlink. Netlink is socket-oriented while
ddlink is file-oriented. Compare:
int rc;
void *msg_head;
/* create the message headers */
msg_head = genlmsg_put(skb, pid, seq, type, 0, flags, DOC_EXMPL_C_ECHO, 1);
if (msg_head == NULL) {
rc = -ENOMEM;
goto failure;
}
/* add a DOC_EXMPL_A_MSG attribute */
rc = nla_put_string(skb, DOC_EXMPL_A_MSG, "Generic Netlink Rocks");
if (rc != 0)
goto failure;
/* finalize the message */
genlmsg_end(skb, msg_head);
static void selnl_add_payload(struct nlmsghdr *nlh, int len, int msgtype, void *data)
{
switch (msgtype) {
case SELNL_MSG_SETENFORCE: {
struct selnl_msg_setenforce *msg = NLMSG_DATA(nlh);
memset(msg, 0, len);
msg->val = *((int *)data);
break;
}
case SELNL_MSG_POLICYLOAD: {
struct selnl_msg_policyload *msg = NLMSG_DATA(nlh);
memset(msg, 0, len);
msg->seqno = *((u32 *)data);
break;
}
default:
BUG();
}
}
or:
static void selnl_notify(int msgtype, void *data)
{
int len;
sk_buff_data_t tmp;
struct sk_buff *skb;
struct nlmsghdr *nlh;
len = selnl_msglen(msgtype);
skb = alloc_skb(NLMSG_SPACE(len), GFP_USER);
if (!skb)
goto oom;
tmp = skb->tail;
nlh = NLMSG_PUT(skb, 0, 0, msgtype, len);
selnl_add_payload(nlh, len, msgtype, data);
nlh->nlmsg_len = skb->tail - tmp;
NETLINK_CB(skb).dst_group = SELNLGRP_AVC;
...-- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
Hi Pavel, It doesn't feel strange in practice. The ddlink framework itself does not implement this, the module does (e.g. ddsetup). So you can put a poll wait in your read method if that suits your interface. It just does not seem to be useful for ddsetup, which does not produce any data of the kind that needs an application to sit in a loop waiting for something to arrive. If there is an application like that, it would probably want to poll the ddlink anyway, to avoid having a whole thread dedicated to just that. Maybe the reason it does not feel strange to omit the wait is, reading from proc never waits. A ddlink fd is more like proc than like a pipe. Yes, will fix. Daniel --
