* devname.c: probe_one(): call dm_device_is_leaf() with *constant*
devno argument in the list_for_each_safe loop does not make sense at
all.
* devname.c: probe_one(): call dm_device_is_leaf() for already
checked DM devices is unnecessary. DM devices are already checked
in dm_probe_all(). The dm_probe_all() function never returns slave
devices.
* devname.c: dm_device_is_leaf(): stop checking for dependences when
a first dependence is detected.
Performance test:
# for i in $(seq 1 100); do dmsetup create $i --table "0 1 zero"; done
new version:
# blkid &> /dev/null
real 0m0.267s
user 0m0.047s
sys 0m0.212s
old version:
# blkid &> /dev/null
real 0m2.149s
user 0m0.291s
sys 0m1.820s
Signed-off-by: Karel Zak <kzak@redhat.com>
---
lib/blkid/devname.c | 37 ++++++++++++++++++++-----------------
1 files changed, 20 insertions(+), 17 deletions(-)
diff --git a/lib/blkid/devname.c b/lib/blkid/devname.c
index 86fd44c..4967b96 100644
--- a/lib/blkid/devname.c
+++ b/lib/blkid/devname.c
@@ -133,25 +133,27 @@ static void probe_one(blkid_cache cache, const char *ptname,
const char **dir;
char *devname = NULL;
- /* See if we already have this device number in the cache. */
- list_for_each_safe(p, pnext, &cache->bic_devs) {
- blkid_dev tmp = list_entry(p, struct blkid_struct_dev,
- bid_devs);
#ifdef HAVE_DEVMAPPER
- if (!dm_device_is_leaf(devno))
- continue;
+ /* call dm_device_is_leaf() for non-DM device only */
+ if (pri == BLKID_PRI_DM || dm_device_is_leaf(devno))
#endif
- if (tmp->bid_devno == devno) {
- if (only_if_new && !access(tmp->bid_name, F_OK))
- return;
- dev = blkid_verify(cache, tmp);
- if (dev && (dev->bid_flags & BLKID_BID_FL_VERIFIED))
- break;
- dev = 0;
+ {
+ /* See if we already have this device number in the cache. */
+ list_for_each_safe(p, pnext, &cache->bic_devs) {
+ blkid_dev tmp = list_entry(p, struct blkid_struct_dev,
+ ...Hi Karel,
Thanks! Your patch forced me to look much more closely at the device
mapper code, and after looking at it, I've been led to the inescapable
conclusion that the whole thing is total cr*p. :-)
My only excuse was that at the time when I accepted the patch, I
wasn't using a device-mapper based system, and couldn't do live
testing of the code, so I didn't realize how much of what it was doing
was totally unnecessary.
What I've commited into the git tree completely rips out the need to
use the device mapper library, and instead relies on the already
existing and well-tested code paths that supports non-dm devices.
This has the nice side benefit that initrd's that want to use blkid
will no longer need to pull in libdevmapper, libselinux, and libsepol
--- or at least, if initrd's need them, it will no longer be on
/sbin/blkid's account. And, we can yank all of the configure
machinery for including libdevmapper, et. al., from e2fsprogs.
It has an even more salutory performance benefit (about 3-6 times
faster than yours, I estimate, since we don't end up calling into
libdevmapper *at* *all*). On an X61s laptop, with:
# for i in $(seq 1 100); do dmsetup create $i --table "0 1 zero"; done
old version:
# time /sbin/blkid >& /dev/null
real 0m4.227s
user 0m0.623s
sys 0m3.406s
new version:
# time ./misc/blkid >& /dev/null
real 0m0.080s
user 0m0.000s
sys 0m0.010s
- Ted
commit 3f66ecf24e896377997b909edef040be98ac76b3
Author: Theodore Ts'o <tytso@mit.edu>
Date: Tue Aug 26 08:13:56 2008 -0400
libblkid: Optimize devicemapper support
This commit works by removing all calls from libdevicemapper
altogether, and using the standard support for "normal" non-dm
devices.
It depends on dm devices being placed in /dev/mapper (but the previous
code had this dependency anyway), and /proc/partitions containing dm
devices.
We don't actually rip out the libdevicemapper code in this commit, ...Well, I see few problems:
* /proc/partitions containing internal dm device names (e.g. dm-0).
The libdevmapper provides translation from internal to the "real"
names (e.g /dev/mapper/foo). I guess (hope:-) /sys provides the
real names too.
* we probably need to resolve dependencies for multi-path devices
where the same FS is accessable by more than one physical device.
If I good remember it was the original purpose for DM support
in libblkid.
# mount LABEL=blabla /mnt
has to mount the "right" device. I guess that only DM is able to
answer which device is the "right" one ;-)
The /sys/block/<devname>/slaves/ provides information about
dependencies.
I think "/dev/mapper" does not make any sense here, ...because the
names from /proc/partitions are in dm-<N> format, but the names in
Karel
--
Karel Zak <kzak@redhat.com>
--
You're right. So seaching for /dev/mapper/dm-n doesn't make any sense; adding /dev/mapper to the dirlist doesn't help, and in fact is a waste of time. However, the patch actually *did* work, and the reason why it does is because we are also are searching /dev/mapper by I don't think you mean multipath support in terms of where there are multiple paths to the same physical device ala fiber channel, but rather where are multiple devices which are built on each other, right? So where /dev/sda is used to create the LVM PV's, and so on, Right. And here, we are solving the problem already in that we prefer the /dev/mapper/* names over names like /dev/sda and /dev/sdb. That's what the priority field is all about. What we don't solve is the problem where one devicemapper device is used to build another device mapper device. This could happen in a number of circumstances. You might have some wierd circumstance where /dev/mapper/part1 and /dev/mapper/part2 are glued together to make /dev/mapper/whole-filesystem. Why you might do this instead of simply using something like lvextend is beyond me, but that is something legitimate can be done with the low-level device mapper primitives. But, #1, there are times when picking a leaf dm device over a non-leaf dm device is not the right thing to do (which would be the case when you make a live snapshot of a filesystem), and #2, your patch only checks non-leaf dm devices for non-dm devices, probably because of #1. So with both of our patches, we have the problem where we could pick the wrong dm device if the user builds one dm device on top of another dm device. (Although for transient devices, such as temporary snapshots, if the permanent devices is already in /etc/blkid.tab, the cache makes us win since we'll prefer the device already in the cache file to the other.) But in terms of making sure we pick the device mapper file over the raw file, which is the 99.99% common case, we do the right thing, and This does work, ...
Here's the new patch.
- Ted
commit 2e6c2735e19ec19821eeff1cbdf11c09c25540f0
Author: Theodore Ts'o <tytso@mit.edu>
Date: Tue Aug 26 08:13:56 2008 -0400
libblkid: Optimize devicemapper support
This commit works by removing all calls from libdevmapper altogether,
and using the standard support for "normal" non-dm devices.
It depends on dm devices being placed in /dev/mapper (but the previous
code had this dependency anyway), and /proc/partitions containing dm
devices.
We don't actually rip out the libdevmapper code in this commit, but
just disable it via #undef HAVE_DEVMAPPER, just so it's easier to
review and understand the fundamental code changes. A subsequent
commit will remove the libdevmapper code, as well as unexport
the blkid_devdirs string array.
Thanks to Karel Zak for inspiring me to look at the dm code in blkid,
so I could realize how much it deserved to ripped out by its roots. :-)
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
diff --git a/lib/blkid/blkidP.h b/lib/blkid/blkidP.h
index dfe0eca..9d1e459 100644
--- a/lib/blkid/blkidP.h
+++ b/lib/blkid/blkidP.h
@@ -153,6 +153,13 @@ extern void blkid_debug_dump_dev(blkid_dev dev);
extern void blkid_debug_dump_tag(blkid_tag tag);
#endif
+/* devno.c */
+struct dir_list {
+ char *name;
+ struct dir_list *next;
+};
+extern void blkid__scan_dir(char *, dev_t, struct dir_list **, char **);
+
/* lseek.c */
extern blkid_loff_t blkid_llseek(int fd, blkid_loff_t offset, int whence);
diff --git a/lib/blkid/devname.c b/lib/blkid/devname.c
index 86fd44c..ec3cff3 100644
--- a/lib/blkid/devname.c
+++ b/lib/blkid/devname.c
@@ -38,6 +38,8 @@
#include "blkidP.h"
+#undef HAVE_DEVMAPPER
+
#ifdef HAVE_DEVMAPPER
#include <libdevmapper.h>
#endif
@@ -122,6 +124,9 @@ blkid_dev blkid_get_dev(blkid_cache cache, const char *devname, int flags)
static int dm_device_is_leaf(const dev_t dev);
#endif
+/* ...No, in fact DM has actual mutliple-paths-to-the-same-device support, via "dm-multipath". Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. --
Well, *that* we have, as long as the parent devices are real (non-devicemapper) devices. So if /dev/sdc and /dev/sdg are both paths to the same filesystem, and dm-multipath has created /dev/mapper/filesystem as the multipath device to that filesystem, any devices with /dev/mapper have priority over non-dm devices, so /dev/mapper/filesystem will get returned first. - Ted --
Ignore this point. You are right. The physical devices are slaves to
the final DM device (when dm-multipath is on). BTW, I look forward to
There is worse scenario (thanks to Milan Broz from DM camp):
dmsetup create x --table "0 100 linear /dev/sdb 0"
dmsetup create y --table "0 100 linear /dev/mapper/x 0"
dmsetup create z --table "0 100 linear /dev/mapper/y 0"
# dmsetup ls --tree
z (254:3)
`-y (254:2)
`-x (254:1)
`- (8:16)
it means all these devices are exactly same, but
mount LABEL=foo
has to mount /dev/mapper/z (from top of the tree). The sdb, x and
I don't think so. The dm_probe_all() function never returns any
DM device which is slave to any other device. It means it always
returns the device from top of the hierarchy.
All devices from dm_probe_all() have greater priority than other
stuff from /proc/partitions (for example dm-N devs).
So back to your example...
/dev/mapper/part1 + /dev/mapper/part2 = /dev/mapper/whole-filesystem
the /dev/mapper/part1 and /dev/mapper/part2 will be visible for the
library (e.g. blkid.tab), but with *smaller priority* than
/dev/mapper/whole-filesystem.
In your non-libdevmapper implementation you need to check
/sys/block/dm-N/holders/ and prefer devices without holders.
I think we can ignore this minor problem for now. I'll try to found a
better solution for dependencies resolution without libdevmapper. My
Not elegant, but... good enough :-)
It would be nice to have /sys/block/dm-N/name where you can translate
the internal dm-N name to the real device name. Alasdair? Milan? :-)
Karel
--
Karel Zak <kzak@redhat.com>
--
Sure, but consider what happens when you create a snapshot (either read-only or read-write) of an existing filesystem? In that case, both the parent and the child filesystem is mountable, and if the child filesystem is transient, the praent one may not want to be transient. In fact, suppose the scenario is a virtualization scenario, where you create the parent, create child snapshots, then use "tune2fs -U random -L virt1 /dev/mapper/snap1" ; "tune2fs -U random -L virt2 /dev/mapper/snap2" and so on, so each of the child snapshots have their own independent identity separate from the parent, it may very *well* be the case that the parent device should be visible to mount. I don't think we can make the general argument that the leaf device is always mountable, and anything above it is *never* mountable. Maybe that's the default, but it's certainly not always true. I'm beginning the right answer is we need some assist from the device mapper infrastructure, where when we create the device mapper device, we specify whether or not it is mountable, and this information is made available somehow, either by trying to sneaking it into /proc/partitions (which will be tricky without breaking legacy Speaking of which, what is your plan for caching versus non-caching in libfsprobe? It seems to me that if you are going to be caching, you'll just be re-inventing blkid. If you don't cache, you'll either (a) have to iterate over all possible devices, which is what we did before blkid (it was Ric Wheeler pointed out to me this problem and I wrote blkid in response to his request, because it becomes a problem if you have hundred of LUN's getting exported by a large EMC storage array :-), or (b) do what vol_id does, which is depend on /dev/disk/by-label and /dev/disk/by-uuid, which has the charming Windows-like attribute of not getting updated until the next reboot --- which means after you create a new filesystem or swap device on an existing device, or change a label or UUID using ...
That's misunderstanding. I'm talking about LABEL/UUId resolution
where we need *priorities* for duplicate tags. I think dep-tree is
good enough for this purpose.
The snapshot (or arbitrary other device) is mountable when you
Both. I think you remember our (+ Kay Sievers) discussion about it.
We need a library which provides both ways. The smart way (cache,
dependencies, ...) for mount(8) and others standard utils, and the
Hehe.. I will directly copy code from blkid and vol_id. It's open
Yes, I have no clue why the dm-N crap is in /proc/partitions.
I'm wrong person for these questions. DM guys are in CC: :-)
Karel
--
Karel Zak <kzak@redhat.com>
--
OK, so what you're saying is that that a leaf dm-device is always more important (and should therefore have a higher priority) than a non-leaf device. But I'm not sure that is *still* not always the right thing to do. Suppose someone creates a snapshot of a device in order to run e2fsck, or to do create a coherent snapshot. Now suppose the machine crashes while the snapshot still exists; even though the read-only snapshot is the "leaf" device, you don't want to try use that snapshot to be mounted as the root filesystem. There are a number of solutions of course --- most of which do not require adding more smarts into blkid (or some other probing library). We could make the scripts that create the snapshots update the UUID and LABEL of the snapshot, although that means adding some kind filesystem-specific hook to lvcreate. Or we could create the concept of "ephemeral snapshots" that don't survive a reboot. Or we could try to mark certain LVM volumes with explicit priorities that would be pulled into blkid (possibly via the dm interfaces). Or we could try to put a lot of that smarts into the blkid library. Personally, I like the idea of emphermal snapshots as the best system-wide solution, but the point is we need to think about this not from a single component's point of view, but what is the best solution Sure, but if that's the case, we already have most of the "smart way" from blkid. What's the point of making fsprobe re-invent the caching Well, for one, it means changing every single mkfs/mkswap/tune*fs program on the system; so it seems more than a bit of kludge. Sometimes these tools are run by an unprivileged user, so there are some security problems have to be carefully thought out. Obviously you can't just tell udev the new label and uuid, since the source might be untrusted. The userspace program send an event to udevd that a device has changed, but that means that you have to allow an unprivileged process to kick udevd and then reprobe a device, ...
Here's a patch that I've been working on which gives a priority bonus
to dm leaf devices, without needing libdevmapper. As I've mentioned
before, I'm not 100% convinced this is always the right thing. But
it's probably a not-half-bad hueristic...
- Ted
diff --git a/lib/blkid/devname.c b/lib/blkid/devname.c
index 48a1dcc..2b3855b 100644
--- a/lib/blkid/devname.c
+++ b/lib/blkid/devname.c
@@ -25,6 +25,7 @@
#if HAVE_SYS_TYPES_H
#include <sys/types.h>
#endif
+#include <dirent.h>
#if HAVE_SYS_STAT_H
#include <sys/stat.h>
#endif
@@ -117,6 +118,38 @@ blkid_dev blkid_get_dev(blkid_cache cache, const char *devname, int flags)
/* Directories where we will try to search for device names */
static const char *dirlist[] = { "/dev", "/devfs", "/devices", NULL };
+static int is_dm_leaf(const char *devname)
+{
+ struct dirent *de, *d_de;
+ DIR *dir, *d_dir;
+ char path[256];
+ int ret = 1;
+
+ if ((dir = opendir("/sys/block")) == NULL)
+ return 0;
+ while ((de = readdir(dir)) != NULL) {
+ if (!strcmp(de->d_name, ".") || !strcmp(de->d_name, "..") ||
+ !strcmp(de->d_name, devname) ||
+ strncmp(de->d_name, "dm-", 3) ||
+ strlen(de->d_name) > sizeof(path)-32)
+ continue;
+ sprintf(path, "/sys/block/%s/slaves", de->d_name);
+ if ((d_dir = opendir(path)) == NULL)
+ continue;
+ while ((d_de = readdir(d_dir)) != NULL) {
+ if (!strcmp(d_de->d_name, devname)) {
+ ret = 0;
+ break;
+ }
+ }
+ closedir(d_dir);
+ if (!ret)
+ break;
+ }
+ closedir(dir);
+ return ret;
+}
+
/*
* Probe a single block device to add to the device cache.
*/
@@ -180,9 +213,11 @@ set_pri:
if (dev) {
if (pri)
dev->bid_pri = pri;
- else if (!strncmp(dev->bid_name, "/dev/mapper/", 11))
+ else if (!strncmp(dev->bid_name, "/dev/mapper/", 11)) {
dev->bid_pri = BLKID_PRI_DM;
- else if (!strncmp(ptname, "md", 2))
+ if (is_dm_leaf(ptname))
+ dev->bid_pri += 5;
+ } else if (!strncmp(ptname, "md", 2))
dev->bid_pri ...Seems good. This patch and the brute-force for conversion from dm-N to
real names is the way how replace the HAVE_DEVMAPPER crap in libblkid.
Thanks!
Karel
--
Karel Zak <kzak@redhat.com>
--
Couldn't you just change libblkid to export the probe functions? It always makes me cringe when code like this is copied, because I just _know_ one or the other will become out of date, and it will take twice as much effort to keep them in sync. I'd rather see people doing "high value" work instead of watching for and copying patches I've wanted that for so long... It's always a pain that LVM tools work by using /dev/vgfoo/lvfoo (which is fine) but /proc/partitions doesn't give any clue to what each dm-X device is. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. --
Yes, good point. Frankly, I think about it in last weeks. I have done
work on the low-lever part of libfsprobe -- it shouldn't be a problem
port this code to libblkid. The advantage will be a library that is
usable for udev and backwardly compatible for the current blkid
applications. I will try it...
Karel
--
Karel Zak <kzak@redhat.com>
--
