Re: [PATCH] blkid: optimize dm_device_is_leaf() usage

Previous thread: FIEMAP patches by Andreas Dilger on Monday, August 25, 2008 - 1:22 pm. (16 messages)

Next thread: [PATCH] e2fsck shouln't consider superblock summaries as fatal by Andreas Dilger on Tuesday, August 26, 2008 - 3:45 am. (7 messages)
From: Karel Zak
Date: Monday, August 25, 2008 - 1:48 pm

* devname.c: probe_one(): call dm_device_is_leaf() with *constant*
   devno argument in the list_for_each_safe loop does not make sense at
   all.

 * devname.c: probe_one(): call dm_device_is_leaf() for already
   checked DM devices is unnecessary. DM devices are already checked
   in dm_probe_all(). The dm_probe_all() function never returns slave
   devices.

 * devname.c: dm_device_is_leaf(): stop checking for dependences when
   a first dependence is detected.

Performance test:

  # for i in $(seq 1 100); do dmsetup create $i --table "0 1 zero"; done

 new version:

  # blkid &> /dev/null

  real    0m0.267s
  user    0m0.047s
  sys     0m0.212s

 old version:

  # blkid &> /dev/null

  real    0m2.149s
  user    0m0.291s
  sys     0m1.820s

Signed-off-by: Karel Zak <kzak@redhat.com>
---
 lib/blkid/devname.c |   37 ++++++++++++++++++++-----------------
 1 files changed, 20 insertions(+), 17 deletions(-)

diff --git a/lib/blkid/devname.c b/lib/blkid/devname.c
index 86fd44c..4967b96 100644
--- a/lib/blkid/devname.c
+++ b/lib/blkid/devname.c
@@ -133,25 +133,27 @@ static void probe_one(blkid_cache cache, const char *ptname,
 	const char **dir;
 	char *devname = NULL;
 
-	/* See if we already have this device number in the cache. */
-	list_for_each_safe(p, pnext, &cache->bic_devs) {
-		blkid_dev tmp = list_entry(p, struct blkid_struct_dev,
-					   bid_devs);
 #ifdef HAVE_DEVMAPPER
-		if (!dm_device_is_leaf(devno))
-			continue;
+	/* call dm_device_is_leaf() for non-DM device only */
+	if (pri == BLKID_PRI_DM || dm_device_is_leaf(devno))
 #endif
-		if (tmp->bid_devno == devno) {
-			if (only_if_new && !access(tmp->bid_name, F_OK))
-				return;
-			dev = blkid_verify(cache, tmp);
-			if (dev && (dev->bid_flags & BLKID_BID_FL_VERIFIED))
-				break;
-			dev = 0;
+	{
+		/* See if we already have this device number in the cache. */
+		list_for_each_safe(p, pnext, &cache->bic_devs) {
+			blkid_dev tmp = list_entry(p, struct blkid_struct_dev,
+						 ...
From: Theodore Tso
Date: Tuesday, August 26, 2008 - 5:24 am

Hi Karel,

Thanks!  Your patch forced me to look much more closely at the device
mapper code, and after looking at it, I've been led to the inescapable
conclusion that the whole thing is total cr*p.  :-)

My only excuse was that at the time when I accepted the patch, I
wasn't using a device-mapper based system, and couldn't do live
testing of the code, so I didn't realize how much of what it was doing
was totally unnecessary.

What I've commited into the git tree completely rips out the need to
use the device mapper library, and instead relies on the already
existing and well-tested code paths that supports non-dm devices.
This has the nice side benefit that initrd's that want to use blkid
will no longer need to pull in libdevmapper, libselinux, and libsepol
--- or at least, if initrd's need them, it will no longer be on
/sbin/blkid's account.  And, we can yank all of the configure
machinery for including libdevmapper, et. al., from e2fsprogs.

It has an even more salutory performance benefit (about 3-6 times
faster than yours, I estimate, since we don't end up calling into
libdevmapper *at* *all*).  On an X61s laptop, with:

# for i in $(seq 1 100); do dmsetup create $i --table "0 1 zero"; done

old version:

# time /sbin/blkid >& /dev/null

real   0m4.227s
user   0m0.623s
sys    0m3.406s

new version:

# time ./misc/blkid >& /dev/null

real   0m0.080s
user   0m0.000s
sys    0m0.010s

						- Ted


commit 3f66ecf24e896377997b909edef040be98ac76b3
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Tue Aug 26 08:13:56 2008 -0400

    libblkid: Optimize devicemapper support
    
    This commit works by removing all calls from libdevicemapper
    altogether, and using the standard support for "normal" non-dm
    devices.
    
    It depends on dm devices being placed in /dev/mapper (but the previous
    code had this dependency anyway), and /proc/partitions containing dm
    devices.
    
    We don't actually rip out the libdevicemapper code in this commit, ...
From: Karel Zak
Date: Tuesday, August 26, 2008 - 6:51 am

Well, I see few problems:

 * /proc/partitions containing internal dm device names (e.g. dm-0).
   The libdevmapper provides translation from internal to the "real"
   names (e.g /dev/mapper/foo). I guess (hope:-) /sys provides the
   real names too.

 * we probably need to resolve dependencies for multi-path devices
   where the same FS is accessable by more than one physical device.
   If I good remember it was the original purpose for DM support 
   in libblkid.

        # mount LABEL=blabla /mnt

   has to mount the "right" device. I guess that only DM is able to
   answer which device is the "right" one ;-)

   The /sys/block/<devname>/slaves/ provides information about
   dependencies.


I think "/dev/mapper" does not make any sense here, ...because the
names from /proc/partitions are in dm-<N> format, but the names in


    Karel

-- 
 Karel Zak  <kzak@redhat.com>
--

From: Theodore Tso
Date: Tuesday, August 26, 2008 - 7:47 am

You're right.  So seaching for /dev/mapper/dm-n doesn't make any
sense; adding /dev/mapper to the dirlist doesn't help, and in fact is
a waste of time.  However, the patch actually *did* work, and the
reason why it does is because we are also are searching /dev/mapper by

I don't think you mean multipath support in terms of where there are
multiple paths to the same physical device ala fiber channel, but
rather where are multiple devices which are built on each other,
right?  So where /dev/sda is used to create the LVM PV's, and so on,

Right.  And here, we are solving the problem already in that we prefer
the /dev/mapper/* names over names like /dev/sda and /dev/sdb.  That's
what the priority field is all about.

What we don't solve is the problem where one devicemapper device is
used to build another device mapper device.  This could happen in a
number of circumstances.  You might have some wierd circumstance where
/dev/mapper/part1 and /dev/mapper/part2 are glued together to make
/dev/mapper/whole-filesystem.  Why you might do this instead of simply
using something like lvextend is beyond me, but that is something
legitimate can be done with the low-level device mapper primitives.

But, #1, there are times when picking a leaf dm device over a non-leaf
dm device is not the right thing to do (which would be the case when
you make a live snapshot of a filesystem), and #2, your patch only
checks non-leaf dm devices for non-dm devices, probably because of #1.

So with both of our patches, we have the problem where we could pick
the wrong dm device if the user builds one dm device on top of another
dm device.  (Although for transient devices, such as temporary
snapshots, if the permanent devices is already in /etc/blkid.tab, the
cache makes us win since we'll prefer the device already in the cache
file to the other.)

But in terms of making sure we pick the device mapper file over the
raw file, which is the 99.99% common case, we do the right thing, and


This does work, ...
From: Theodore Tso
Date: Tuesday, August 26, 2008 - 11:04 am

Here's the new patch.

						- Ted

commit 2e6c2735e19ec19821eeff1cbdf11c09c25540f0
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Tue Aug 26 08:13:56 2008 -0400

    libblkid: Optimize devicemapper support
    
    This commit works by removing all calls from libdevmapper altogether,
    and using the standard support for "normal" non-dm devices.
    
    It depends on dm devices being placed in /dev/mapper (but the previous
    code had this dependency anyway), and /proc/partitions containing dm
    devices.
    
    We don't actually rip out the libdevmapper code in this commit, but
    just disable it via #undef HAVE_DEVMAPPER, just so it's easier to
    review and understand the fundamental code changes.  A subsequent
    commit will remove the libdevmapper code, as well as unexport
    the blkid_devdirs string array.
    
    Thanks to Karel Zak for inspiring me to look at the dm code in blkid,
    so I could realize how much it deserved to ripped out by its roots.  :-)
    
    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

diff --git a/lib/blkid/blkidP.h b/lib/blkid/blkidP.h
index dfe0eca..9d1e459 100644
--- a/lib/blkid/blkidP.h
+++ b/lib/blkid/blkidP.h
@@ -153,6 +153,13 @@ extern void blkid_debug_dump_dev(blkid_dev dev);
 extern void blkid_debug_dump_tag(blkid_tag tag);
 #endif
 
+/* devno.c */
+struct dir_list {
+	char	*name;
+	struct dir_list *next;
+};
+extern void blkid__scan_dir(char *, dev_t, struct dir_list **, char **);
+
 /* lseek.c */
 extern blkid_loff_t blkid_llseek(int fd, blkid_loff_t offset, int whence);
 
diff --git a/lib/blkid/devname.c b/lib/blkid/devname.c
index 86fd44c..ec3cff3 100644
--- a/lib/blkid/devname.c
+++ b/lib/blkid/devname.c
@@ -38,6 +38,8 @@
 
 #include "blkidP.h"
 
+#undef HAVE_DEVMAPPER
+
 #ifdef HAVE_DEVMAPPER
 #include <libdevmapper.h>
 #endif
@@ -122,6 +124,9 @@ blkid_dev blkid_get_dev(blkid_cache cache, const char *devname, int flags)
 static int dm_device_is_leaf(const dev_t dev);
 #endif
 
+/* ...
From: Andreas Dilger
Date: Tuesday, August 26, 2008 - 12:44 pm

No, in fact DM has actual mutliple-paths-to-the-same-device support,
via "dm-multipath".

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--

From: Theodore Tso
Date: Tuesday, August 26, 2008 - 1:00 pm

Well, *that* we have, as long as the parent devices are real
(non-devicemapper) devices.

So if /dev/sdc and /dev/sdg are both paths to the same filesystem, and
dm-multipath has created /dev/mapper/filesystem as the multipath
device to that filesystem, any devices with /dev/mapper have priority
over non-dm devices, so /dev/mapper/filesystem will get returned
first.

					- Ted
--

From: Karel Zak
Date: Tuesday, August 26, 2008 - 1:47 pm

Ignore this point. You are right. The physical devices are slaves to
 the final DM device (when dm-multipath is on). BTW, I look forward to

 There is worse scenario (thanks to Milan Broz from DM camp):

 dmsetup create x --table "0 100 linear /dev/sdb 0"
 dmsetup create y --table "0 100 linear /dev/mapper/x 0"
 dmsetup create z --table "0 100 linear /dev/mapper/y 0"

 # dmsetup ls --tree
  z (254:3)
  `-y (254:2)
    `-x (254:1)
      `- (8:16)

 it means all these devices are exactly same, but

    mount LABEL=foo

 has to mount /dev/mapper/z (from top of the tree). The sdb, x and

 I don't think so. The dm_probe_all() function never returns any
 DM device which is slave to any other device. It means it always
 returns the device from top of the hierarchy.

 All devices from dm_probe_all() have greater priority than other
 stuff from /proc/partitions (for example dm-N devs).

 So back to your example...

 /dev/mapper/part1 + /dev/mapper/part2 = /dev/mapper/whole-filesystem

 the /dev/mapper/part1 and /dev/mapper/part2 will be visible for the
 library (e.g. blkid.tab), but with *smaller priority* than
 /dev/mapper/whole-filesystem.

 In your non-libdevmapper implementation you need to check
 /sys/block/dm-N/holders/ and prefer devices without holders.

 I think we can ignore this minor problem for now. I'll try to found a
 better solution for dependencies resolution without libdevmapper. My


 Not elegant, but... good enough :-)

 It would be nice to have /sys/block/dm-N/name where you can translate
 the internal dm-N name to the real device name. Alasdair? Milan? :-)

    Karel

-- 
 Karel Zak  <kzak@redhat.com>
--

From: Theodore Tso
Date: Tuesday, August 26, 2008 - 4:32 pm

Sure, but consider what happens when you create a snapshot (either
read-only or read-write) of an existing filesystem?  In that case,
both the parent and the child filesystem is mountable, and if the
child filesystem is transient, the praent one may not want to be
transient.

In fact, suppose the scenario is a virtualization scenario, where you
create the parent, create child snapshots, then use "tune2fs -U random
-L virt1 /dev/mapper/snap1" ; "tune2fs -U random -L virt2
/dev/mapper/snap2" and so on, so each of the child snapshots have
their own independent identity separate from the parent, it may very
*well* be the case that the parent device should be visible to mount.

I don't think we can make the general argument that the leaf device is
always mountable, and anything above it is *never* mountable.  Maybe
that's the default, but it's certainly not always true.

I'm beginning the right answer is we need some assist from the device
mapper infrastructure, where when we create the device mapper device,
we specify whether or not it is mountable, and this information is
made available somehow, either by trying to sneaking it into
/proc/partitions (which will be tricky without breaking legacy

Speaking of which, what is your plan for caching versus non-caching in
libfsprobe?  It seems to me that if you are going to be caching,
you'll just be re-inventing blkid.  If you don't cache, you'll either
(a) have to iterate over all possible devices, which is what we did
before blkid (it was Ric Wheeler pointed out to me this problem and I
wrote blkid in response to his request, because it becomes a problem
if you have hundred of LUN's getting exported by a large EMC storage
array :-), or (b) do what vol_id does, which is depend on
/dev/disk/by-label and /dev/disk/by-uuid, which has the charming
Windows-like attribute of not getting updated until the next reboot
--- which means after you create a new filesystem or swap device on an
existing device, or change a label or UUID using ...
From: Karel Zak
Date: Tuesday, August 26, 2008 - 5:19 pm

That's misunderstanding. I'm talking about LABEL/UUId resolution
 where we need *priorities* for duplicate tags. I think dep-tree is
 good enough for this purpose.

 The snapshot (or arbitrary other device) is mountable when you


 Both. I think you remember our (+ Kay Sievers) discussion about it.
 We need a library which provides both ways. The smart way (cache,
 dependencies, ...) for mount(8) and others standard utils, and the

 Hehe.. I will directly copy code from blkid and vol_id. It's open


 Yes, I have no clue why the dm-N crap is in /proc/partitions.

 I'm wrong person for these questions. DM guys are in CC: :-)

    Karel

-- 
 Karel Zak  <kzak@redhat.com>
--

From: Theodore Tso
Date: Tuesday, August 26, 2008 - 6:21 pm

OK, so what you're saying is that that a leaf dm-device is always more
important (and should therefore have a higher priority) than a
non-leaf device.

But I'm not sure that is *still* not always the right thing to do.
Suppose someone creates a snapshot of a device in order to run e2fsck,
or to do create a coherent snapshot.  Now suppose the machine crashes
while the snapshot still exists; even though the read-only snapshot is
the "leaf" device, you don't want to try use that snapshot to be
mounted as the root filesystem.

There are a number of solutions of course --- most of which do not
require adding more smarts into blkid (or some other probing library).
We could make the scripts that create the snapshots update the UUID
and LABEL of the snapshot, although that means adding some kind
filesystem-specific hook to lvcreate.  Or we could create the concept
of "ephemeral snapshots" that don't survive a reboot.  Or we could try
to mark certain LVM volumes with explicit priorities that would be
pulled into blkid (possibly via the dm interfaces).  Or we could try
to put a lot of that smarts into the blkid library.

Personally, I like the idea of emphermal snapshots as the best
system-wide solution, but the point is we need to think about this not
from a single component's point of view, but what is the best solution

Sure, but if that's the case, we already have most of the "smart way"
from blkid.  What's the point of making fsprobe re-invent the caching

Well, for one, it means changing every single mkfs/mkswap/tune*fs
program on the system; so it seems more than a bit of kludge.
Sometimes these tools are run by an unprivileged user, so there are
some security problems have to be carefully thought out.  Obviously
you can't just tell udev the new label and uuid, since the source
might be untrusted.  The userspace program send an event to udevd that
a device has changed, but that means that you have to allow an
unprivileged process to kick udevd and then reprobe a device, ...
From: Theodore Tso
Date: Tuesday, August 26, 2008 - 9:40 pm

Here's a patch that I've been working on which gives a priority bonus
to dm leaf devices, without needing libdevmapper.  As I've mentioned
before, I'm not 100% convinced this is always the right thing.  But
it's probably a not-half-bad hueristic...

						- Ted

diff --git a/lib/blkid/devname.c b/lib/blkid/devname.c
index 48a1dcc..2b3855b 100644
--- a/lib/blkid/devname.c
+++ b/lib/blkid/devname.c
@@ -25,6 +25,7 @@
 #if HAVE_SYS_TYPES_H
 #include <sys/types.h>
 #endif
+#include <dirent.h>
 #if HAVE_SYS_STAT_H
 #include <sys/stat.h>
 #endif
@@ -117,6 +118,38 @@ blkid_dev blkid_get_dev(blkid_cache cache, const char *devname, int flags)
 /* Directories where we will try to search for device names */
 static const char *dirlist[] = { "/dev", "/devfs", "/devices", NULL };
 
+static int is_dm_leaf(const char *devname)
+{
+	struct dirent	*de, *d_de;
+	DIR		*dir, *d_dir;
+	char		path[256];
+	int		ret = 1;
+
+	if ((dir = opendir("/sys/block")) == NULL)
+		return 0;
+	while ((de = readdir(dir)) != NULL) {
+		if (!strcmp(de->d_name, ".") || !strcmp(de->d_name, "..") ||
+		    !strcmp(de->d_name, devname) ||
+		    strncmp(de->d_name, "dm-", 3) ||
+		    strlen(de->d_name) > sizeof(path)-32)
+			continue;
+		sprintf(path, "/sys/block/%s/slaves", de->d_name);
+		if ((d_dir = opendir(path)) == NULL)
+			continue;
+		while ((d_de = readdir(d_dir)) != NULL) {
+			if (!strcmp(d_de->d_name, devname)) {
+				ret = 0;
+				break;
+			}
+		}
+		closedir(d_dir);
+		if (!ret)
+			break;
+	}
+	closedir(dir);
+	return ret;
+}
+
 /*
  * Probe a single block device to add to the device cache.
  */
@@ -180,9 +213,11 @@ set_pri:
 	if (dev) {
 		if (pri)
 			dev->bid_pri = pri;
-		else if (!strncmp(dev->bid_name, "/dev/mapper/", 11))
+		else if (!strncmp(dev->bid_name, "/dev/mapper/", 11)) {
 			dev->bid_pri = BLKID_PRI_DM;
-		else if (!strncmp(ptname, "md", 2))
+			if (is_dm_leaf(ptname))
+				dev->bid_pri += 5;
+		} else if (!strncmp(ptname, "md", 2))
 			dev->bid_pri ...
From: Karel Zak
Date: Wednesday, August 27, 2008 - 1:32 am

Seems good. This patch and the brute-force for conversion from dm-N to
 real names is the way how replace the HAVE_DEVMAPPER crap in libblkid.
 Thanks!

    Karel

-- 
 Karel Zak  <kzak@redhat.com>
--

From: Andreas Dilger
Date: Wednesday, August 27, 2008 - 12:26 am

Couldn't you just change libblkid to export the probe functions?  It
always makes me cringe when code like this is copied, because I just
_know_ one or the other will become out of date, and it will take
twice as much effort to keep them in sync.  I'd rather see people
doing "high value" work instead of watching for and copying patches

I've wanted that for so long...  It's always a pain that LVM tools
work by using /dev/vgfoo/lvfoo (which is fine) but /proc/partitions
doesn't give any clue to what each dm-X device is.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--

From: Karel Zak
Date: Wednesday, August 27, 2008 - 1:10 am

Yes, good point. Frankly, I think about it in last weeks. I have done
 work on the low-lever part of libfsprobe -- it shouldn't be a problem
 port this code to libblkid. The advantage will be a library that is
 usable for udev and backwardly compatible for the current blkid
 applications. I will try it...

    Karel

-- 
 Karel Zak  <kzak@redhat.com>
--

Previous thread: FIEMAP patches by Andreas Dilger on Monday, August 25, 2008 - 1:22 pm. (16 messages)

Next thread: [PATCH] e2fsck shouln't consider superblock summaries as fatal by Andreas Dilger on Tuesday, August 26, 2008 - 3:45 am. (7 messages)