RE: Auto Rebuild on hot-plug

Previous thread: Raid Containers by Daniel Reurich on Wednesday, March 24, 2010 - 2:27 pm. (2 messages)

Next thread: 4 partition raid 5 with 2 disks active and 2 spare, how to force? by Anshuman Aggarwal on Thursday, March 25, 2010 - 2:30 am. (16 messages)
From: Neil Brown
Date: Wednesday, March 24, 2010 - 5:35 pm

Greetings.
 I find myself in the middle of two separate off-list conversations on the
 same topic and it has reached the point where I think the conversations
 really need to be unite and brought on-list.

 So here is my current understanding and thoughts.

 The topic is about making rebuild after a failure easier.  It strikes me as
 particularly relevant after the link  Bill Davidsen recently forwards to the
 list:

       http://blogs.techrepublic.com.com/opensource/?p=1368

 The most significant thing I got from this was a complain in the comments
 that managing md raid was too complex and hence error-prone.

 I see the issue as breaking down in to two parts.
  1/ When a device is hot plugged into the system, is md allowed to use it as
     a spare for recovery?
  2/ If md has a spare device, what set of arrays can it be used in if needed.

 A typical hot plug event will need to address both of these questions in
 turn before recovery actually starts.

 Part 1.

  A newly hotplugged device may have metadata for RAID (0.90, 1.x, IMSM, DDF,
  other vendor metadata) or LVM or a filesystem.  It might have a partition
  table which could be subordinate to or super-ordinate to other metadata.
  (i.e. RAID in partitions, or partitions in RAID).  The metadata may or may
  not be stale.  It may or may not match - either strongly or weakly -
  metadata on devices in currently active arrays.

  A newly hotplugged device also has a "path" which we can see
  in /dev/disk/by-path.  This is somehow indicative of a physical location.
  This path may be the same as the path of a device which was recently
  removed.  It might be one of a set of paths which make up a "RAID chassis".
  It might be one of a set of paths one which we happen to find other RAID
  arrays.

  Some how from all of that information we need to decide if md can use the
  device without asking, or possibly with a simple yes/no question, and we
  need to decide what to actually do with the device.

  Options ...
From: Michael Evans
Date: Wednesday, March 24, 2010 - 7:47 pm

My feeling on the entire subject matter is that this is /not/ an easy
decision.  Computers are rarely correct when they guess at what an
administrator wants, and attempting to implement the functionality
within mdadm is prone to many limitations or re-inventing the wheel.

If mdadm / mdmon is part of the process at all, I think it should be
used to either fork an executable (script or otherwise) which invokes
the administrative actions that have been pre-determined.

I believe that the default action should be to do /nothing/.  That is
the only safe thing to do.  If an administrative framework is desired
that seems to fall under a larger project goal which is likely better
covered by programs more aware of the overall system state.  This
route also allows for a range of scalability.

It may be sufficient in an initramfs context to either spawn a shell
or even just wait in a recovery console after the mdadm invocation
returns failure.  It might also be desired to use a very simple
reaction which assumes any spare of sufficient size which is added
should be allocated to the largest or closest comparable area based on
pre-determined preferences.

At the same time, I could see the value in mapping actual physical
locations to an array, remembering any missing or failed device
layouts, and re-creating the same layouts on the new device.  However
those actions are a little above what mdadm should be operating at.

With both of those viewpoints I see the following solution.

The most specific action match is followed.

Action-matches should be restrict-able by path wildcard, simple size
comparisons, AND state for metadata.
As a final deciding factor action-matches should also have an optional
priority value, so that when all else matches one rule out of a set
will be known to run first.

The result of matching an action, once again, should be an external
program or shell to allow for maximum flexibility.

I am not at all opposed to adding good default choices for ...
From: Neil Brown
Date: Tuesday, March 30, 2010 - 6:18 pm

On Wed, 24 Mar 2010 19:47:59 -0700

I agree that /nothing/ should be the default action for a device with
unrecognised content.
If the content of the device is recognised, it is OK to have a default with
does what the content implies - i.e. build a device into an array.
But maybe that it what you meant.

I think there is useful stuff that can be done entirely inside mdadm but it
is worth thinking about where to draw the line.  I'm not convinced that mdadm
should "know" about partition tables and MBRs.  Possible the task of copying
those is best placed in a script.

Thanks,
NeilBrown
--

From: Michael Evans
Date: Tuesday, March 30, 2010 - 7:46 pm

My larger context was looking at non-recognized devices; assembling
pre-marked containers is fine.  With the provision that pass basic
safety checks validate that outcome; is the uuid correct, does the
home-host match the current array, is the update count valid (or else
add as a prior stale member that should be marked as hot spare).

For anything else mdadm might be better off taking the approach that
an administratively selected set of actions should be performed.  If
the task is JUST doing stuff that mdadm would already be invoked to do
anyway then it is tolerable for those reactions to be configurable
within the .conf file, though I fear the syntax may be uglier than
assuming there's also at least a basic /bin/sh that could interpret a
set of more standard commands.  It would also provide a good example
to extend in to custom scripts.

Another advantage of using a shell script instead is that
administrators can hack in whatever tricks they want.  If they have a
partition tool or method they like they can script it and get the
results they want.  More complicated tricks could also be performed,
such as first preparing the disc for cryptographic storage by filling
it with random data, or performing SMART checks, or any other
operation of their choice.

Alternatively, if an administrator or device maker needs something
different they could produce a binary to run instead.
--

From: Luca Berra
Date: Thursday, March 25, 2010 - 1:01 am

well, i would not be upset by j. random jerk complaining in a blog
comments, as soon as you make it one click you will find another one
I really feel there is much room for causing disasters with a similar
approach.

The main difference from an hw raid controller is that the hw raid
controller _requires_ full control on the individual disks.
MD does not. Trying to do things automatically without full control is
very dangerous.
this may be different when using ddf or imsm since they are usually
working on whole drives attached to a raid-like controller (even if one
of the strenghts of md is being able to activate those arrays even
without the original controller).

If you want to be user-friendly just add a simple script
/usr/bin/md-replace-drive
It will take as input either an md array or a working drive as source,
and the new drive as target.
In the first case it has examine the components of the source md
determine if they are partitions or a whole devices (sysfs), in the first
case, find the whole drive and ensure they are partitioned in the same
way.
It will examine the source drive for partition and all md arrays it is
part of. it will ensure that those arrays have a failed device,
Check the size of the components and match them to the new drive (no
sense replacing a 1T drive with a 750Gb one)

ask the user for confirmation in big understandable letters

replicate any mbr and partition table, and include the device (or all
newly created partitions) in the relevant md device.

an improvement would be not needing user to specify a source in the most
simple of cases, by checking for all arrays with a failed device.


makes sense

-- 
Luca Berra -- bluca@comedia.it
         Communication Media & Services S.r.l.
  /"\
  \ /     ASCII RIBBON CAMPAIGN
   X        AGAINST HTML MAIL
  / \
--

From: Neil Brown
Date: Tuesday, March 30, 2010 - 6:26 pm

On Thu, 25 Mar 2010 09:01:08 +0100

We can learn something from any opinion that different from our own.

It is clear to me that using mdadm requires a certain level of understanding
to be used effectively and safely.
I don't think that can be entirely address in mdadm: there is a place of a
higher level framework that encodes policies and gives advice.  But there is
still room to improve mdadm to make it more powerful, more informative, and

You mean completely raw data, no partitions, no filesystem structure etc?
Yes, that is possible.  People who are likely to handle devices like that

I imagine an Email to the admin "Hey boss, I just noticed you plugged in a
drive that looks like it used to be part of some array.  We need a spare on
this other array and the new device is big enough.  Shall I huh huh huh?  Go
on let me..."


Yes, there is a place for something like that certainly.

NeilBrown
--

From: Luca Berra
Date: Tuesday, March 30, 2010 - 11:10 pm

yes, i realize my comment was rude, sorry for that, but that comment on
I can think of two scenarios.
1) an encrypted device (without LUKS header)
2) a device where the metadata is corrupted, and we plugged it in an
hurry to attempt data recovery (oh, we were in an hurry and forgot about
the mdadm policy)

What i am scared of are distributions thinking it would be cool and
ah. ok, i tought you meant something real-time.

-- 
Luca Berra -- bluca@comedia.it
         Communication Media & Services S.r.l.
  /"\
  \ /     ASCII RIBBON CAMPAIGN
   X        AGAINST HTML MAIL
  / \
--

From: John Robinson
Date: Thursday, March 25, 2010 - 7:10 am

Or indeed it may have no metadata at all - it may be a fresh disc. I 
didn't see that you stated this specifically at any point, though it was 
there by implication, so I will: you're going to have to pick up hotplug 
events for bare drives, which presumably means you'll also get events 

Indeed, I would like to be able to declare any 
/dev/disk/by-path/pci-0000:00:1f.2-scsi-[0-4] to be suitable candidates 
for hot-plugging, because those are the 5 motherboard SATA ports I've 
hooked into my hot-swap chassis.

As an aside, I just tried yanking and replugging one of my drives, on 
CentOS 5.4, and it successfully went away and came back again, but 

Definitely want this for bare drives. In my case I'd like the MBR and 
first 62 sectors copied from one of the live drives, or a copy saved for 
the purpose, so the disc can be bootable.

My concern is that this is surely outwith the regular scope of 
mdadm/mdmon, as is handling bare drives/CD-ROMs/USB sticks. Do we need 

Definitely, just so I can pull a drive and plug it in again and point 
and say ooh, everything's up and running again, to demonstrate how cool 
Linux md is. I imagine some distros' udev/hotplug rules do this already, 

I think in my situation I'd quite like the first partition, type fd 
metadata 0.90 RAID-1 mounted as /boot, added as an active mirror not a 
spare, again so that if this new drive appears as sda at the next power 
cycle, the system will boot.

The second partition, a RAID-5 with LVM on it, could be added as a 
spare, because it would then automatically be rebuilt onto if the array 
[...]

I'm afraid I have nothing to add here, it all sounds good.

Cheers,

John.

--

From: Neil Brown
Date: Tuesday, March 30, 2010 - 6:30 pm

On Thu, 25 Mar 2010 14:10:05 +0000

Correct.  We would expect that "domain path=" matching to say that those
should only be used if they already have recognisable metadata on them.
To make use of a device with no metadata already present, it would need to

No.  That is because we have not yet implemented anything that has been
described in this document...

Thanks,
NeilBrown
--

From: Labun, Marcin
Date: Thursday, March 25, 2010 - 8:04 am

I think that metadata keyword can be used to identify scope of devices to which the DOMAIN line applies.
For instance we could have:
DOMAIN path=glob-pattern metadata=imsm hotplug=mode1  spare-group=name1
DOMAIN path=glob-pattern metadata=0.90 hotplug=mode2  spare-group=name2

Keywords: 
Path, metadata and spare-group shall define to which arrays the hotplug definition (or other definition of action) applies. User could define any subset of it.
For instance to define that all imsm arrays shall use hotplug mode2 user shall define:
DOMAIN metadata=imsm hotplug=mode2

In above example user need not define spare-group in his/her configuration file for each array.


Please consider:
      spare_add - add any spare device that matches the metadata container/volume in case of native metadata regardless of array state, so later such a spare can be used in rebuild process.

Can we assume for all external metadata that spares added any container can be potentially moved between all container the same metadata?
I expect that this could be default behavior if no spare groups are defined for some metadata.
More over each metadata handler could impose build-in rules on spares assignment to specific container.

Thanks,
Marcin Labun
--

From: Dan Williams
Date: Friday, March 26, 2010 - 5:37 pm

For the 'platform' case we could automate some decisions, but I think
I would rather extend the --detail-platform option to dump the
recommended/compatible DOMAIN entries for the platform, perhaps via
the --brief modifier.  This mirrors what can be done with --examine
--brief to generate an initial configuration file that can be modified

This is the same as 'incr' above.  If the device has metadata and

Yes, that can be the default action, and the spare-group keyword can
be specified to override.

--
Dan
--

From: Doug Ledford
Date: Monday, March 29, 2010 - 11:10 am

g definition (or other definition of action) applies. User could define a=
 rules of accepting the spare in the container. Rules can be derived from=
 platform dependencies or metadata. Notice that user can disable platform=

So, a few things that I think can be said about the DOMAIN line type
(I'm assuming for now that this is what we'll use, mainly because I'm
implementing it right now):

There is an assumed, default DOMAIN line that is the equivalent of:

DOMAIN path=3D* metadata=3D* action=3Dincremental spare-group=3D<none>

This is what you get simply by normal udev incremental assembly rules
(notice I used action instead of hotplug, action makes more sense to me
as all the words we use to define hotplug mode are in fact actions to
take on hotplug).  We will treat this as a given.  Anything else
requires an explicit DOMAIN line in mdadm.conf.

The second thing I'm having a hard time with is the spare-group.  To be
honest, if I follow what I think I should, and make it a hard
requirement that any action other than none and incremental must use a
non-global path glob (aka, path=3D MUST be present and can not be *), the=
n
spare-group looses all meaning.  I say this because if a disk matches
the path glob is it in a specific spare group already (the one that this
DOMAIN represents) and ditto if arrays are on disks in this DOMAIN, then
they are automatically part of the same spare-group.  In other words, I
think spare-group becomes entirely redundant once we have a DOMAIN keywor=
d.

I'm also having a hard time justifying the existence of the metadata
keyword.  The reason is that the metadata is already determined for us
by the path glob.  Specifically, if we assume that an array's members
can not cross domain boundaries (a reasonable requirement in my opinion,
we can't make an array where we can guarantee to the user that hot
plugging a replacement disk will do what they expect if some of the
array's members are inside the domain and some are outside the domain),
then we ...
From: John Robinson
Date: Monday, March 29, 2010 - 11:36 am

On 29/03/2010 19:10, Doug Ledford wrote:

I think I agree; in my limited scenario I might want to use 0.90 
metadata on my sdX1 to make my /boot, but 1.x on my other partitions, 
and it'll be whole discs that match my path spec so one metadata type 
wouldn't apply uniformly.

[...]

Yes, but do create the partition(s), boot sector, etc and set up the 
spare(s). The user installed the system with anaconda or whatever, and 
may not know the incantations to partition his new disc or install a 
boot loader, so if he's managed to configure a mdadm.conf which says the 
spare slots in his RAID chassis should belong to mdadm, prepare them for 
him. Then all he needs to do is issue whatever grow command.

I think the exception to this is /boot on RAID-1, where I would prefer 
to be able to have the system automatically add the new partition as an 
active mirror instead of a hot spare, in case this new drive is what we 
have to boot off next time.

I suppose there might be circumstances where you want to do something 
else, like Netgear do on their ReadyNAS, but while it might be nice to 
be able to configure that sort of automatic growing and reshaping, it 
doesn't belong in the default config.

Cheers,

John.

--

From: Doug Ledford
Date: Monday, March 29, 2010 - 11:57 am

Really, we should never have to do this in the situation I listed: aka
no degraded arrays exist.  This implies that if you had a raid1 /boot
array, that it's still intact.  So partitioning and setting up boot
loaders doesn't make sense as the new disk isn't going in to replace
anything.  You *might* want to add it to the raid1 /boot, but we don't

Again, I'm drawing a distinction here between a degraded array and a
non-degraded array.  If the current array isn't degraded, then we won't
be booting off the new drive next time unless the user goes into the
BIOS and sets the new drive as the active boot device.  And if the user
is going to do that, then they ought to be able to setup their new boot


--=20
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband

From: John Robinson
Date: Monday, March 29, 2010 - 3:36 pm

Actually I've just recently had the scenario where it would have made 
perfect sense. I hooked up the RAID chassis SATA[0-4] ports to the RAID 
chassis and put 3 drives in the first 3 slots. Actually it turned out 
I'd wired it up R-L not L-R so if I'd added a new drive in one of the 
two right-hand slots it would have turned up as sda on the next boot. 
OK, to some extent that's me being stupid, but at the same time I 
correctly hooked up the first 5 SATA ports to the hot-swap chassis and 
would want them considered the same group etc.

Cheers,

John.

--

From: Dan Williams
Date: Monday, March 29, 2010 - 3:41 pm

On Mon, Mar 29, 2010 at 3:36 PM, John Robinson

This kind of situation is where an option-rom comes in handy i.e. the
platform firmware knows to boot from a defined raid volume.  However,
it comes with quirky constraints like not supporting > 2-drive raid1.
But I see your point that it would be nice to at least have the option
auto-grow raid1 boot arrays.

--
Dan
--

From: John Robinson
Date: Monday, March 29, 2010 - 3:46 pm

As it happens this was on an Intel-chipset board with ICH10-R and option 
ROM, and I would have used IMSM if RHEL/CentOS had supported it at the 
time, so I'm following IMSM support developments closely.

Cheers,

John.
--

From: Doug Ledford
Date: Monday, March 29, 2010 - 4:35 pm

Yes, but how do you want to fix that situation?  Would you want to make
the new drives be new boot drives, or would you prefer to shut down,
move all the previous drives over two slots, and then put the new drive
into the fourth slot that you previously thought was the second slot?  I
understand your situation, but were I in that position I'd just shuffle
my drives to correct my original mistake and go on with things, I
wouldn't make the new drives be boot drives.  So I'm still not sure I
see the point to making a new drive that isn't replacing an existing

I understand wanting them in the same group, but unless something is
degraded, just being in the same group doesn't tell us if you want to
keep it as a spare or use it to grow things.


--=20
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband

From: John Robinson
Date: Tuesday, March 30, 2010 - 5:10 am

I wouldn't want to take the server down to shuffle the drives or cables. 
But my point really is that if I have decided that I would want all the 
drives in my chassis to have identical partition tables and carry an 
active mirror of an array - in my example /boot - I would like to be 
able to configure the hotplug arrangement to make it so, rather than 
leaving me to have to manually regenerate the partition table, install 
grub, add the spare and perhaps even grow the array.

Of course this is a per-installation policy decision of what to do when 
an extra drive is added to a non-degraded array, I'm certainly not 
suggesting this should be the default action, though I think it would be 

I quite agree. All I'm getting at is that I'd like to be able to say 
something in my mdadm.conf or wherever to say what I'd like done. This 
might mean that I end up something like the following:
DOMAIN path=pci-0000:00:1f.2-scsi-[0-4]:0:0:0       action=include
DOMAIN path=pci-0000:00:1f.2-scsi-[0-4]:0:0:0-part1 action=grow
DOMAIN path=pci-0000:00:1f.2-scsi-[0-4]:0:0:0-part2 action=replace
DOMAIN path=pci-0000:00:1f.2-scsi-[0-4]:0:0:0-part3 action=include

The first line gets the partition table and grub boot code regenerated 
even when nothing's degraded. This in turn may trigger the other lines. 
In the second line my action=grow means fix up my /boot if it's degraded 
and both --add and --grow so it gets mirrored onto a fresh disc. The 
third lines says fix up my swap array if it's degraded, but leave alone 
otherwise. The fourth line says fix up my data array if it's degraded, 
and add as a spare if it's a fresh disc. This last lets me decide later 
what (if any) kind of --grow I want to do - make it larger or reshape 
from RAID-5 to RAID-6.

But as you say, the default should be
DOMAIN path=* action=incremental

and the installer (automated or human) probably wants to edit that to 
include at least
DOMAIN path=something action=replace
to take advantage of this auto-rebuild on ...
From: Doug Ledford
Date: Tuesday, March 30, 2010 - 8:53 am

I can (sorta) understand this.  I personally never create any more /boot
partitions than the number of drives I can loose from my / array + 1.
So, if I have raid5 / array, I do 2 /boot partitions.  Anything more is
a waste since if you loose both of those boot drives, you also have too
few drives for the / array.  But, if you want any given drive bootable,

This I'm not so sure about.  I can try to make this a reality, but the
issue here is that when you are allowed to specify things on a partition
by partition basis, it becomes very easy to create conflicting commands.
 For example, lets say you have part1 action=3Dgrow, but for the bare dis=
k
you have action=3Dincremental.  And let's assume you plug in a bare disk.=

 In order to honor the part1 action=3Dgrow, we would have to partition th=
e
disk, which is in conflict with the bare disk action of incremental
since that implies we would only use preexisting md raid partitions.  I
could *easily* see the feature of allowing per partition actions causing
the overall code complexity to double or more.  You know, I'd rather
provide a simple grub script that automatically setup all raid1 members
as boot devices any time it was ran than try to handle this
automatically ;-)  Maybe I should add that to the mdadm package on

As pointed out above, some of these are conflicting commands in that
they tell us to modify the disk in one place, and leave it alone in
another.  The basic assumption you are making here is that we will
always be able to duplicate the partition table because all drives in a
domain will have the same partition table.  And that's not always the cas=

I see where you are going, I'm a little worried about getting there ;-)


--=20
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband

From: John Robinson
Date: Friday, April 2, 2010 - 4:01 am

A very fair point. But it's not really all that wasteful - I've had to 
use the first 100MB from at least two drives, meaning that space would 
effectively go to waste on the others. And 100MB out of 1TB isn't an 
awfully big waste anyway.


Yes, but in that case I've given specific instructions about what to do 
with bare drives. It'd be a bad configuration, and you might warn about 
it, but you couldn't honour the grow. Bear in mind, the two domain lines 
here don't overlap. If they did you've more of a quandry, or at least 
you should shout louder about it. I don't think you should be writing 
partition tables unless I've told you to - which I would have done in 
the following more general case:

I'm not sure why, since you probably ought to be doing some fairly 
rigorous checking of the configuration anyway to make sure domains and 

That would be fine too, as long as there's some way of calling it from 

If the paths overlapped I'd agree, but they didn't, and I made sure the 
whole-drive action was sufficient to make sure the partition actions 
could be carried out. I agree though that there's plenty of scope for 
people writing duff configurations like the one you suggested, but I 
think there'll be scope for that whatever you do - even if it's 

It might be a reasonable restriction for a first implementation, though. 
If not, you're going to have to store copies of the partition tables, 
boot areas, etc somewhere else so that when the drives they were on are 
hot-swapped, you can write the correct stuff back.


I don't blame you. Isn't it just typical of a user who doesn't 
understand the work involved to demand the sky and the stars? Anyway 
thank you very much for taking the time to consider my thoughts.

Cheers,

John.

--

From: Dan Williams
Date: Monday, March 29, 2010 - 2:36 pm

I agree once you have a DOMAIN you implicitly have a spare-group.  So
DOMAIN would supersede the existing spare-group identifier in the
ARRAY line and cause mdadm --monitor to auto-migrate spares between
0.90 and 1.x metadata arrays in the same DOMAIN.  For the imsm case
the expectation is that spares migrate between containers regardless
of the DOMAIN line as that is what the implementation expects.
However this is where we get into questions of DOMAIN conflicting with
'platform' expectations, under what conditions, if any, should DOMAIN
be allowed to conflict/override the platform constraint?  Currently
there is an environment variable IMSM_NO_PLATFORM, do we also need a

...but this assumes we already have an array assembled in the domain
before the first hot plug event.  The 'metadata' keyword would be
helpful at assembly time for ensuring only arrays of a certain type
are brought up in the domain.

We also need some consideration for reporting and enforcing 'platform'
boundaries if the user requests it.  By default mdadm will block
attempts to create/assemble configurations that the option-rom does
not support (i.e. disk attached to third-party controller).  For the
hotplug case if the  DOMAIN is configured incorrectly I can see cases
where a user would like to specify "enforce platform constraints even
if my domain says otherwise", and the inverse "yes, I know the
option-rom does not support this configuration, but I know what I am
doing".

So I see a couple options:
1/ path=platform: auto-determine/enforce the domain(s) for all
platform raid controllers in the system
2/ Allow the user to manually enter a DOMAIN that is compatible but
different than the default platform constraints like your 3-ahci ports
for imsm-RAID remainder reserved for 1.x arrays example above
3/ Allow the user to turn off platform constraints and define 'exotic'
domains (mixed controller configurations).

--
Dan
--

From: Doug Ledford
Date: Monday, March 29, 2010 - 4:30 pm

Give me some clearer explanation here because I think you and I are
using terms differently and so I want to make sure I have things right.
 My understanding of imsm raid containers is that all the drives that
belong to a single option rom, as long as they aren't listed as jbod in
the option rom setup, belong to the same container.  That container is
then split up into various chunks and that's where you get logical
volumes.  I know there are odd rules for logical volumes inside a
container, but I think those are mostly irrelevant to this discussion.
So, when I think of a domain for imsm, I think of all the sata ports or
sas ports under a single option rom.  From that perspective, spares can
*not* move between domains as a spare on a sas port can't be added to a
sata option rom container array.  I was under the impression that if you
had, say, a 6 port sata controller option rom, you couldn't have the
first three ports be one container and the next three ports be another
container.  Is that impression wrong?  If so, that would explain our
confusion over domains.

However, that just means (to me anyway) that I would treat all of the
sata ports as one domain with multiple container arrays in that domain
just like we can have multiple native md arrays in a domain.  If a disk
dies and we hot plug a new one, then mdadm would look for the degraded
container present in the domain and add the spare to it.  It would then
be up to mdmon to determine what logical volumes are currently degraded
and slice up the new drive to work as spares for those degraded logical
volumes.  Does this sound correct to you, and can mdmon do that already

I'm not sure I would ever allow breaking valid platform limitations.  I
think if you want to break platform limitations, then you need to use
native md raid arrays and not imsm/ddf.  It seems to me that if you
allow the creation of an imsm/ddf array that the BIOS can't work with
then you've potentially opened an entire can of worms we don't want to
open ...
From: Dan Williams
Date: Monday, March 29, 2010 - 5:46 pm

I think the disconnect in the imsm case is that the container to
DOMAIN relationship is N:1, not 1:1.  The mdadm notion of an
imsm-container correlates directly with a 'family' in the imsm
metadata.  The rules of a family are:

1/ All family members must be a member of all defined volumes.  For
example with a 4-drive container you could not simultaneously have a
4-drive (sd[abcd]) raid10 and a 2-drive (sd[ab]) raid1 volume because
any volume would need to incorporate all 4 disks.  Also, per the rules
if you create two raid1 volumes sd[ab] and sd[cd] those would show up
as two containers.

2/ A spare drive does not belong to any particular family
('family_number' is undefined for a spare).  The Windows driver will
automatically use a spare to fix any degraded family in the system.
In the mdadm/mdmon case since we break families into containers we
need a mechanism to migrate spare devices between containers because
they are equally valid hot spare candidate for any imsm container in

Yes, we can have exactly this situation.

This begs the question, why not change the definition of an imsm
container to incorporate anything with imsm metadata?  This definitely
would make spare management easier.  This was an early design decision
and had the nice side effect that it lined up naturally with the
failure and rebuild boundaries of a family.  I could give it more
thought, but right now I believe there is a lot riding on this 1:1

This sounds correct, and no mdmon cannot do this today.  The current
discussions we (Marcin and I) had with Neil offlist was extending
mdadm --monitor to handle spare migration for containers since it
already handles spare migration for native md arrays.  It will need
some mdmon coordination since mdmon is the only agent that can

Agreed.

--
Dan
--

From: Doug Ledford
Date: Tuesday, March 30, 2010 - 8:23 am

This explains the weird behavior I got when trying to create arrays on
my IMSM box via the BIOS.  Thanks for the clear explanation of family

I'm fine with the container being family based and not domain based.  I
just didn't realize that distinction existed.  It's all cleared up now ;-=

So we'll need to coordinate on this aspect of things then.  I'll keep
you updated as I get started implementing this if you want to think
about how you would like to handle this interaction between mdadm/mdmon.

As far as I can tell, we've reached a fairly decent consensus on things.
 But, just to be clear, I'll reiterate that concensus here:

Add a new linetype: DOMAIN with options path=3D (must be specified at
least once for any domain action other than none and incremental and
must be something other than a global match for any action other than
none and incremental) and metadata=3D (specifies the metadata type
possible for this domain as one of imsm/ddf/md, and where for imsm or
ddf types, we will verify that the path portions of the domain do not
violate possible platform limitations) and action=3D (where action is
none, incremental, readd, safe_use, force_use where action is specific
to a hotplug when a degraded array in the domain exists and can possibly
have slightly different meanings depending on whether the path specifies
a whole disk device or specific partitions on a range of devices, and
where there is the possibility of adding more options or a new option
name for the case of adding a hotplug drive to a domain where no arrays
are degraded, in which case issues such as boot sectors, partition
tables, hot spare versus grow, etc. must be addressed).

Modify udev rules files to cover the following scenarios (it's
unfortunate that we have to split things up like this, but in order to
deal with either bare drives or drives that have things like lvm data
and we are using force_use, we must trigger on *all* drive hotplug
events, we must trigger early, and we must override other ...
From: Labun, Marcin
Date: Tuesday, March 30, 2010 - 10:47 am

I understand that there are following defaults:
- Platform/metadata limitations create default domains
- metadata handler deliver default actions 
The equivalent configuration line for imsm is:
DOMAIN path="any" metadata=imsm action=none
User could additionally split default domains using spare groups and path keyword.
For instance for imsm, the default domain area is platform controller. 
If any metadata is server by multiple controllers, each of them creates its own domain.
If we allow for "any" for the path keyword, a user could simply override metadata defaults for all his controllers by:
I think that implementation can be something like that:
We shall set cookie to store the path of disk which is removed from the md device. Later if the new device is re-plugged in the port, it can be used for rebuild. 
We shall set timer when cookies shall expire. I propose to clean them on start-up (mdadm -monitor can be a candidate; default action shall be cookies clean-up).

Enable spare disk sharing between containers if they belong to the same domain and have not conflicting spare group assignment. This will allow for spare sharing by default.
Additionally, we can consult metadata handlers before moving spares between containers. We can do that by adding another metadata handler function which shall test metadata and controller dependencies (I can imagine that user can define metadata stored domains of spare sharing; controllers (OROM) dependent constrains shall be handled in this function, too).  
Thanks,
Marcin Labun
--

From: Dan Williams
Date: Tuesday, March 30, 2010 - 4:47 pm

A single DOMAIN can span several controllers, but only if that does
not violate the 'platform' constraints for that metadata type (which

Yes, spare sharing by default within the domain and as Doug said
ignore any conflicts with the spare-group identifier i.e. DOMAIN

This really is just a variation of load_super() performed on a
container with an extra disk added to report whether the device is
spare, failed, or otherwise out of sync.

In the imsm case this is load_super_imsm_all() with another disk
(outside of the current container list) to compare against.

--
Dan
--

From: Dan Williams
Date: Tuesday, March 30, 2010 - 4:36 pm

Why not 0.90 and 1.x for instead of 'md'?  These match the 'name'

I have been thinking that the path= option specifies controller paths,
not disk devices.  Something like "pci-0000:00:1f.2-scsi-[0-3]*" to
pick the first 4 ahci ports.  This also purposefully excludes virtual
devices dm/md.  I think we want to limit this functionality to
physical controller ports... or were you looking to incorporate

Can't we limit the scope to the hotplug events we care about by
filtering the udev scripts based on the current contents of the
configuration file?  We already need a step in the process that
verifies if the configuration heeds the platform constraints.  So,
something like mdadm --activate-domains that validates the
configuration, generates the necessary udev scripts and enables

Yes, but this also reminds me about the multiple superblock case.  It
should usually only happen to people that experiment with different
metadata types, but we should catch and probably ignore drives that

Let's also limit this to ports that were recently (as specified by a
timeout= option to the DOMAIN) unplugged.  This limits the potential


Modulo the ability to have a global enable / disable for domains via


I think we have a consensus.  The wrinkle that comes to mind is the
case we talked about before where some ahci ports have been reserved
for jbod support in the DOMAIN configuration.  If the user plugs in an
imsm-metadata disk into a "jbod port" and reboots the option-rom will
assemble the array across the DOMAIN boundary.  You would need to put
explicit "passthrough" metadata on the disk to get the option-rom to
ignore it, but then you couldn't put another metadata type on that
disk.  So maybe we can't support the subset case and need to force the
platform's full expectation of the domain boundaries or honor the
DOMAIN line and let the user figure out/remember why this one raid
member slot does not respond to hotplug events.

Thanks for the detailed write up.

--
Dan
--

From: Neil Brown
Date: Tuesday, March 30, 2010 - 9:53 pm

On Tue, 30 Mar 2010 11:23:08 -0400

Thoughts ... yes ... all over the place.  I won't try to group them, just a
random list:

"bare devices"
  To make sure we are on the same page, we should have a definition for this.
  How about "every byte in the first megabyte and last megabyte of the device
  is the same (e.g. 0x00 or 0x5a of 0xff) ??
  We would want a program (mdadm option?) to be able to make a device into a
  bare device.

Dan's "--activate-domains" option which creates a targeted udev rules file for
  "force_use" - I first I though "yuck, no", but then it grew on me.  I think
  I quite like the idea now.  We can put a rules file in /dev/.udev/rules.d/
  which targets just the path that we want to over-ride.
  I can see two approaches:
    1/ create the file during boot with something like "mdadm --activate-domins"
    2/ create a file whenever a device in an md-array is hot-removed which
       targets just that path and claims it immediately for md.
       Removing these after a timeout would be needed.

  The second feels elegant but could be racy.  The first is probably the
  better approach.

Your idea of only performing an action if there is a degraded array doesn't
  seem quite right.
  If I have a set of ports dedicated to raid and I plug in a bare device,
  I want to become a hot-spare whether there are degraded arrays that
  will use it immediately or not.
  You say the making it a hot spare doesn't actually "do" anything, but it
  does.  It makes available for recovery.

  If a device fails, then I plug in a spare I want it to recovery - so do you.
  If I plug in a spare and then a device fails, I want it to recover, but it
  seems you don't.  I cannot reconcile that difference.

  Yes, the admin might want to grow the array, but that is fine:  the spare
  is ready to be use for growth, or to replace a failure, or whatever is
  needed.

Native metadata: on partitions or on whole device.
  We need to make sure we understand the distinctions between ...
From: linbloke
Date: Thursday, March 25, 2010 - 11:41 pm

Hi Neil,

I look forward to being able to update my mdadm.conf with the paths to 
devices that are important to my RAID so that if a fault were to develop 
on an array, then I'd be really happy to fail and remove the faulty 
device, insert a blank device  of sufficient size into the defined path 
and have the RAID auto restore. If the disk is not blank or too small, 
provide a useful error message (insert disk of larger capacity, delete 
partitions, zero superblocks) and exit.  I think you do an amazing job 
and it worries me that you and the other contributors to mdadm could 
spend your valuable time trying to solve problems about how to cater for 
every metadata, partition type etc when a simple blank device is easy to 
achieve and could then "Auto Rebuild on hot-plug".

Perhaps as we nominate a spare disk, we could nominate a spare path. I'm 
certainly no expert and my use case is simple (raid 1's and 10's) but it 
seems to me a lot of complexity can be avoided for the sake of a blank disk.

Cheers,
Josh




--

From: Neil Brown
Date: Tuesday, March 30, 2010 - 6:35 pm

On Fri, 26 Mar 2010 17:41:02 +1100

One the one hand, we should always look beyond the immediate problem we are
tring to solve in order to see the big picture and make sure the solution we
choose doesn't cut us off from solving other more general problems when they
arrive.
On the other hand, we don't want to expand the scope so much that we end up
biting off more than we can chew.

A general design with a specific implementation is probably a good target....

Thanks,

--

From: Majed B.
Date: Friday, March 26, 2010 - 12:52 am

Why not treat this similar to how hardware RAID manages disks & spares?
Disk has no metadata -> new -> use as spare.
Disk has metadata -> array exists -> add to array.
Disk has metadata -> array doesn't exist (disk came from another
system) -> sit idle & wait for an admin to do the work.

As to identify disks and know which disks were removed and put back to
an array, there's the metadata & there's the disk's serial number
which can obtained using hdparm. I also think that all disks now
include a World Wide Number (WWN) which is more suitable for use in
this case than a disk's serial number.

Some people rant because they see things only from their own
perspective and assume that there's no case or scenario but their own.
So don't pay too much attention :p

Here's a scenario: What if I had an existing RAID1 array of 3 disks. I
bought a new disk and I wanted to make a new array in the system. So I
add the new disk, and I want to use one of the RAID1 array disks in
this new array.

Being lazy, instead of failing the disk then removing it using the
console, I just removed it from the port then added it again. I
certainly don't want mdadm to start resyncing, forcing me to wait!

As you can see in this scenario, it includes the situation where an
admin is a lazy bum who is going to use the command line anyway to
make the new array but didn't bother to properly remove the disk he
wanted. And there's the case of the newly added disk.

Why assume things & guess when an admin should know what to do?
I certainly don't want to risk my arrays in mdadm guessing for me. And
keep one thing in mind: How often do people interact with storage
systems?

If I configure mdadm today, the next I may want to add or replace a
disk would be a year later. I certainly would have forgotten whatever
configuration was there! And depending on the situation I have, I
certainly wouldn't want mdadm to guess.




-- 
       Majed B.
--

From: Neil Brown
Date: Tuesday, March 30, 2010 - 6:42 pm

On Fri, 26 Mar 2010 10:52:07 +0300


Lazy people often do cause themselves more work in the long run, there is

That is a point worth considering.  Where possible we should discourage
configurations that would be 'surprising'.
Unfortunately a thing that is surprising to one person in one situation may
be completely obvious to someone else in a different situation.

Thanks,
NeilBrown

--

Previous thread: Raid Containers by Daniel Reurich on Wednesday, March 24, 2010 - 2:27 pm. (2 messages)

Next thread: 4 partition raid 5 with 2 disks active and 2 spare, how to force? by Anshuman Aggarwal on Thursday, March 25, 2010 - 2:30 am. (16 messages)