Greetings.
I find myself in the middle of two separate off-list conversations on the
same topic and it has reached the point where I think the conversations
really need to be unite and brought on-list.
So here is my current understanding and thoughts.
The topic is about making rebuild after a failure easier. It strikes me as
particularly relevant after the link Bill Davidsen recently forwards to the
list:
http://blogs.techrepublic.com.com/opensource/?p=1368
The most significant thing I got from this was a complain in the comments
that managing md raid was too complex and hence error-prone.
I see the issue as breaking down in to two parts.
1/ When a device is hot plugged into the system, is md allowed to use it as
a spare for recovery?
2/ If md has a spare device, what set of arrays can it be used in if needed.
A typical hot plug event will need to address both of these questions in
turn before recovery actually starts.
Part 1.
A newly hotplugged device may have metadata for RAID (0.90, 1.x, IMSM, DDF,
other vendor metadata) or LVM or a filesystem. It might have a partition
table which could be subordinate to or super-ordinate to other metadata.
(i.e. RAID in partitions, or partitions in RAID). The metadata may or may
not be stale. It may or may not match - either strongly or weakly -
metadata on devices in currently active arrays.
A newly hotplugged device also has a "path" which we can see
in /dev/disk/by-path. This is somehow indicative of a physical location.
This path may be the same as the path of a device which was recently
removed. It might be one of a set of paths which make up a "RAID chassis".
It might be one of a set of paths one which we happen to find other RAID
arrays.
Some how from all of that information we need to decide if md can use the
device without asking, or possibly with a simple yes/no question, and we
need to decide what to actually do with the device.
Options ...My feeling on the entire subject matter is that this is /not/ an easy decision. Computers are rarely correct when they guess at what an administrator wants, and attempting to implement the functionality within mdadm is prone to many limitations or re-inventing the wheel. If mdadm / mdmon is part of the process at all, I think it should be used to either fork an executable (script or otherwise) which invokes the administrative actions that have been pre-determined. I believe that the default action should be to do /nothing/. That is the only safe thing to do. If an administrative framework is desired that seems to fall under a larger project goal which is likely better covered by programs more aware of the overall system state. This route also allows for a range of scalability. It may be sufficient in an initramfs context to either spawn a shell or even just wait in a recovery console after the mdadm invocation returns failure. It might also be desired to use a very simple reaction which assumes any spare of sufficient size which is added should be allocated to the largest or closest comparable area based on pre-determined preferences. At the same time, I could see the value in mapping actual physical locations to an array, remembering any missing or failed device layouts, and re-creating the same layouts on the new device. However those actions are a little above what mdadm should be operating at. With both of those viewpoints I see the following solution. The most specific action match is followed. Action-matches should be restrict-able by path wildcard, simple size comparisons, AND state for metadata. As a final deciding factor action-matches should also have an optional priority value, so that when all else matches one rule out of a set will be known to run first. The result of matching an action, once again, should be an external program or shell to allow for maximum flexibility. I am not at all opposed to adding good default choices for ...
On Wed, 24 Mar 2010 19:47:59 -0700 I agree that /nothing/ should be the default action for a device with unrecognised content. If the content of the device is recognised, it is OK to have a default with does what the content implies - i.e. build a device into an array. But maybe that it what you meant. I think there is useful stuff that can be done entirely inside mdadm but it is worth thinking about where to draw the line. I'm not convinced that mdadm should "know" about partition tables and MBRs. Possible the task of copying those is best placed in a script. Thanks, NeilBrown --
My larger context was looking at non-recognized devices; assembling pre-marked containers is fine. With the provision that pass basic safety checks validate that outcome; is the uuid correct, does the home-host match the current array, is the update count valid (or else add as a prior stale member that should be marked as hot spare). For anything else mdadm might be better off taking the approach that an administratively selected set of actions should be performed. If the task is JUST doing stuff that mdadm would already be invoked to do anyway then it is tolerable for those reactions to be configurable within the .conf file, though I fear the syntax may be uglier than assuming there's also at least a basic /bin/sh that could interpret a set of more standard commands. It would also provide a good example to extend in to custom scripts. Another advantage of using a shell script instead is that administrators can hack in whatever tricks they want. If they have a partition tool or method they like they can script it and get the results they want. More complicated tricks could also be performed, such as first preparing the disc for cryptographic storage by filling it with random data, or performing SMART checks, or any other operation of their choice. Alternatively, if an administrator or device maker needs something different they could produce a binary to run instead. --
well, i would not be upset by j. random jerk complaining in a blog comments, as soon as you make it one click you will find another one I really feel there is much room for causing disasters with a similar approach. The main difference from an hw raid controller is that the hw raid controller _requires_ full control on the individual disks. MD does not. Trying to do things automatically without full control is very dangerous. this may be different when using ddf or imsm since they are usually working on whole drives attached to a raid-like controller (even if one of the strenghts of md is being able to activate those arrays even without the original controller). If you want to be user-friendly just add a simple script /usr/bin/md-replace-drive It will take as input either an md array or a working drive as source, and the new drive as target. In the first case it has examine the components of the source md determine if they are partitions or a whole devices (sysfs), in the first case, find the whole drive and ensure they are partitioned in the same way. It will examine the source drive for partition and all md arrays it is part of. it will ensure that those arrays have a failed device, Check the size of the components and match them to the new drive (no sense replacing a 1T drive with a 750Gb one) ask the user for confirmation in big understandable letters replicate any mbr and partition table, and include the device (or all newly created partitions) in the relevant md device. an improvement would be not needing user to specify a source in the most simple of cases, by checking for all arrays with a failed device. makes sense -- Luca Berra -- bluca@comedia.it Communication Media & Services S.r.l. /"\ \ / ASCII RIBBON CAMPAIGN X AGAINST HTML MAIL / \ --
On Thu, 25 Mar 2010 09:01:08 +0100 We can learn something from any opinion that different from our own. It is clear to me that using mdadm requires a certain level of understanding to be used effectively and safely. I don't think that can be entirely address in mdadm: there is a place of a higher level framework that encodes policies and gives advice. But there is still room to improve mdadm to make it more powerful, more informative, and You mean completely raw data, no partitions, no filesystem structure etc? Yes, that is possible. People who are likely to handle devices like that I imagine an Email to the admin "Hey boss, I just noticed you plugged in a drive that looks like it used to be part of some array. We need a spare on this other array and the new device is big enough. Shall I huh huh huh? Go on let me..." Yes, there is a place for something like that certainly. NeilBrown --
yes, i realize my comment was rude, sorry for that, but that comment on I can think of two scenarios. 1) an encrypted device (without LUKS header) 2) a device where the metadata is corrupted, and we plugged it in an hurry to attempt data recovery (oh, we were in an hurry and forgot about the mdadm policy) What i am scared of are distributions thinking it would be cool and ah. ok, i tought you meant something real-time. -- Luca Berra -- bluca@comedia.it Communication Media & Services S.r.l. /"\ \ / ASCII RIBBON CAMPAIGN X AGAINST HTML MAIL / \ --
Or indeed it may have no metadata at all - it may be a fresh disc. I didn't see that you stated this specifically at any point, though it was there by implication, so I will: you're going to have to pick up hotplug events for bare drives, which presumably means you'll also get events Indeed, I would like to be able to declare any /dev/disk/by-path/pci-0000:00:1f.2-scsi-[0-4] to be suitable candidates for hot-plugging, because those are the 5 motherboard SATA ports I've hooked into my hot-swap chassis. As an aside, I just tried yanking and replugging one of my drives, on CentOS 5.4, and it successfully went away and came back again, but Definitely want this for bare drives. In my case I'd like the MBR and first 62 sectors copied from one of the live drives, or a copy saved for the purpose, so the disc can be bootable. My concern is that this is surely outwith the regular scope of mdadm/mdmon, as is handling bare drives/CD-ROMs/USB sticks. Do we need Definitely, just so I can pull a drive and plug it in again and point and say ooh, everything's up and running again, to demonstrate how cool Linux md is. I imagine some distros' udev/hotplug rules do this already, I think in my situation I'd quite like the first partition, type fd metadata 0.90 RAID-1 mounted as /boot, added as an active mirror not a spare, again so that if this new drive appears as sda at the next power cycle, the system will boot. The second partition, a RAID-5 with LVM on it, could be added as a spare, because it would then automatically be rebuilt onto if the array [...] I'm afraid I have nothing to add here, it all sounds good. Cheers, John. --
On Thu, 25 Mar 2010 14:10:05 +0000 Correct. We would expect that "domain path=" matching to say that those should only be used if they already have recognisable metadata on them. To make use of a device with no metadata already present, it would need to No. That is because we have not yet implemented anything that has been described in this document... Thanks, NeilBrown --
I think that metadata keyword can be used to identify scope of devices to which the DOMAIN line applies.
For instance we could have:
DOMAIN path=glob-pattern metadata=imsm hotplug=mode1 spare-group=name1
DOMAIN path=glob-pattern metadata=0.90 hotplug=mode2 spare-group=name2
Keywords:
Path, metadata and spare-group shall define to which arrays the hotplug definition (or other definition of action) applies. User could define any subset of it.
For instance to define that all imsm arrays shall use hotplug mode2 user shall define:
DOMAIN metadata=imsm hotplug=mode2
In above example user need not define spare-group in his/her configuration file for each array.
Please consider:
spare_add - add any spare device that matches the metadata container/volume in case of native metadata regardless of array state, so later such a spare can be used in rebuild process.
Can we assume for all external metadata that spares added any container can be potentially moved between all container the same metadata?
I expect that this could be default behavior if no spare groups are defined for some metadata.
More over each metadata handler could impose build-in rules on spares assignment to specific container.
Thanks,
Marcin Labun
--
For the 'platform' case we could automate some decisions, but I think I would rather extend the --detail-platform option to dump the recommended/compatible DOMAIN entries for the platform, perhaps via the --brief modifier. This mirrors what can be done with --examine --brief to generate an initial configuration file that can be modified This is the same as 'incr' above. If the device has metadata and Yes, that can be the default action, and the spare-group keyword can be specified to override. -- Dan --
g definition (or other definition of action) applies. User could define a= rules of accepting the spare in the container. Rules can be derived from= platform dependencies or metadata. Notice that user can disable platform= So, a few things that I think can be said about the DOMAIN line type (I'm assuming for now that this is what we'll use, mainly because I'm implementing it right now): There is an assumed, default DOMAIN line that is the equivalent of: DOMAIN path=3D* metadata=3D* action=3Dincremental spare-group=3D<none> This is what you get simply by normal udev incremental assembly rules (notice I used action instead of hotplug, action makes more sense to me as all the words we use to define hotplug mode are in fact actions to take on hotplug). We will treat this as a given. Anything else requires an explicit DOMAIN line in mdadm.conf. The second thing I'm having a hard time with is the spare-group. To be honest, if I follow what I think I should, and make it a hard requirement that any action other than none and incremental must use a non-global path glob (aka, path=3D MUST be present and can not be *), the= n spare-group looses all meaning. I say this because if a disk matches the path glob is it in a specific spare group already (the one that this DOMAIN represents) and ditto if arrays are on disks in this DOMAIN, then they are automatically part of the same spare-group. In other words, I think spare-group becomes entirely redundant once we have a DOMAIN keywor= d. I'm also having a hard time justifying the existence of the metadata keyword. The reason is that the metadata is already determined for us by the path glob. Specifically, if we assume that an array's members can not cross domain boundaries (a reasonable requirement in my opinion, we can't make an array where we can guarantee to the user that hot plugging a replacement disk will do what they expect if some of the array's members are inside the domain and some are outside the domain), then we ...
On 29/03/2010 19:10, Doug Ledford wrote: I think I agree; in my limited scenario I might want to use 0.90 metadata on my sdX1 to make my /boot, but 1.x on my other partitions, and it'll be whole discs that match my path spec so one metadata type wouldn't apply uniformly. [...] Yes, but do create the partition(s), boot sector, etc and set up the spare(s). The user installed the system with anaconda or whatever, and may not know the incantations to partition his new disc or install a boot loader, so if he's managed to configure a mdadm.conf which says the spare slots in his RAID chassis should belong to mdadm, prepare them for him. Then all he needs to do is issue whatever grow command. I think the exception to this is /boot on RAID-1, where I would prefer to be able to have the system automatically add the new partition as an active mirror instead of a hot spare, in case this new drive is what we have to boot off next time. I suppose there might be circumstances where you want to do something else, like Netgear do on their ReadyNAS, but while it might be nice to be able to configure that sort of automatic growing and reshaping, it doesn't belong in the default config. Cheers, John. --
Really, we should never have to do this in the situation I listed: aka
no degraded arrays exist. This implies that if you had a raid1 /boot
array, that it's still intact. So partitioning and setting up boot
loaders doesn't make sense as the new disk isn't going in to replace
anything. You *might* want to add it to the raid1 /boot, but we don't
Again, I'm drawing a distinction here between a degraded array and a
non-degraded array. If the current array isn't degraded, then we won't
be booting off the new drive next time unless the user goes into the
BIOS and sets the new drive as the active boot device. And if the user
is going to do that, then they ought to be able to setup their new boot
--=20
Doug Ledford <dledford@redhat.com>
GPG KeyID: CFBFF194
http://people.redhat.com/dledford
Infiniband specific RPMs available at
http://people.redhat.com/dledford/Infiniband
Actually I've just recently had the scenario where it would have made perfect sense. I hooked up the RAID chassis SATA[0-4] ports to the RAID chassis and put 3 drives in the first 3 slots. Actually it turned out I'd wired it up R-L not L-R so if I'd added a new drive in one of the two right-hand slots it would have turned up as sda on the next boot. OK, to some extent that's me being stupid, but at the same time I correctly hooked up the first 5 SATA ports to the hot-swap chassis and would want them considered the same group etc. Cheers, John. --
On Mon, Mar 29, 2010 at 3:36 PM, John Robinson This kind of situation is where an option-rom comes in handy i.e. the platform firmware knows to boot from a defined raid volume. However, it comes with quirky constraints like not supporting > 2-drive raid1. But I see your point that it would be nice to at least have the option auto-grow raid1 boot arrays. -- Dan --
As it happens this was on an Intel-chipset board with ICH10-R and option ROM, and I would have used IMSM if RHEL/CentOS had supported it at the time, so I'm following IMSM support developments closely. Cheers, John. --
Yes, but how do you want to fix that situation? Would you want to make
the new drives be new boot drives, or would you prefer to shut down,
move all the previous drives over two slots, and then put the new drive
into the fourth slot that you previously thought was the second slot? I
understand your situation, but were I in that position I'd just shuffle
my drives to correct my original mistake and go on with things, I
wouldn't make the new drives be boot drives. So I'm still not sure I
see the point to making a new drive that isn't replacing an existing
I understand wanting them in the same group, but unless something is
degraded, just being in the same group doesn't tell us if you want to
keep it as a spare or use it to grow things.
--=20
Doug Ledford <dledford@redhat.com>
GPG KeyID: CFBFF194
http://people.redhat.com/dledford
Infiniband specific RPMs available at
http://people.redhat.com/dledford/Infiniband
I wouldn't want to take the server down to shuffle the drives or cables. But my point really is that if I have decided that I would want all the drives in my chassis to have identical partition tables and carry an active mirror of an array - in my example /boot - I would like to be able to configure the hotplug arrangement to make it so, rather than leaving me to have to manually regenerate the partition table, install grub, add the spare and perhaps even grow the array. Of course this is a per-installation policy decision of what to do when an extra drive is added to a non-degraded array, I'm certainly not suggesting this should be the default action, though I think it would be I quite agree. All I'm getting at is that I'd like to be able to say something in my mdadm.conf or wherever to say what I'd like done. This might mean that I end up something like the following: DOMAIN path=pci-0000:00:1f.2-scsi-[0-4]:0:0:0 action=include DOMAIN path=pci-0000:00:1f.2-scsi-[0-4]:0:0:0-part1 action=grow DOMAIN path=pci-0000:00:1f.2-scsi-[0-4]:0:0:0-part2 action=replace DOMAIN path=pci-0000:00:1f.2-scsi-[0-4]:0:0:0-part3 action=include The first line gets the partition table and grub boot code regenerated even when nothing's degraded. This in turn may trigger the other lines. In the second line my action=grow means fix up my /boot if it's degraded and both --add and --grow so it gets mirrored onto a fresh disc. The third lines says fix up my swap array if it's degraded, but leave alone otherwise. The fourth line says fix up my data array if it's degraded, and add as a spare if it's a fresh disc. This last lets me decide later what (if any) kind of --grow I want to do - make it larger or reshape from RAID-5 to RAID-6. But as you say, the default should be DOMAIN path=* action=incremental and the installer (automated or human) probably wants to edit that to include at least DOMAIN path=something action=replace to take advantage of this auto-rebuild on ...
I can (sorta) understand this. I personally never create any more /boot
partitions than the number of drives I can loose from my / array + 1.
So, if I have raid5 / array, I do 2 /boot partitions. Anything more is
a waste since if you loose both of those boot drives, you also have too
few drives for the / array. But, if you want any given drive bootable,
This I'm not so sure about. I can try to make this a reality, but the
issue here is that when you are allowed to specify things on a partition
by partition basis, it becomes very easy to create conflicting commands.
For example, lets say you have part1 action=3Dgrow, but for the bare dis=
k
you have action=3Dincremental. And let's assume you plug in a bare disk.=
In order to honor the part1 action=3Dgrow, we would have to partition th=
e
disk, which is in conflict with the bare disk action of incremental
since that implies we would only use preexisting md raid partitions. I
could *easily* see the feature of allowing per partition actions causing
the overall code complexity to double or more. You know, I'd rather
provide a simple grub script that automatically setup all raid1 members
as boot devices any time it was ran than try to handle this
automatically ;-) Maybe I should add that to the mdadm package on
As pointed out above, some of these are conflicting commands in that
they tell us to modify the disk in one place, and leave it alone in
another. The basic assumption you are making here is that we will
always be able to duplicate the partition table because all drives in a
domain will have the same partition table. And that's not always the cas=
I see where you are going, I'm a little worried about getting there ;-)
--=20
Doug Ledford <dledford@redhat.com>
GPG KeyID: CFBFF194
http://people.redhat.com/dledford
Infiniband specific RPMs available at
http://people.redhat.com/dledford/Infiniband
A very fair point. But it's not really all that wasteful - I've had to use the first 100MB from at least two drives, meaning that space would effectively go to waste on the others. And 100MB out of 1TB isn't an awfully big waste anyway. Yes, but in that case I've given specific instructions about what to do with bare drives. It'd be a bad configuration, and you might warn about it, but you couldn't honour the grow. Bear in mind, the two domain lines here don't overlap. If they did you've more of a quandry, or at least you should shout louder about it. I don't think you should be writing partition tables unless I've told you to - which I would have done in the following more general case: I'm not sure why, since you probably ought to be doing some fairly rigorous checking of the configuration anyway to make sure domains and That would be fine too, as long as there's some way of calling it from If the paths overlapped I'd agree, but they didn't, and I made sure the whole-drive action was sufficient to make sure the partition actions could be carried out. I agree though that there's plenty of scope for people writing duff configurations like the one you suggested, but I think there'll be scope for that whatever you do - even if it's It might be a reasonable restriction for a first implementation, though. If not, you're going to have to store copies of the partition tables, boot areas, etc somewhere else so that when the drives they were on are hot-swapped, you can write the correct stuff back. I don't blame you. Isn't it just typical of a user who doesn't understand the work involved to demand the sky and the stars? Anyway thank you very much for taking the time to consider my thoughts. Cheers, John. --
I agree once you have a DOMAIN you implicitly have a spare-group. So DOMAIN would supersede the existing spare-group identifier in the ARRAY line and cause mdadm --monitor to auto-migrate spares between 0.90 and 1.x metadata arrays in the same DOMAIN. For the imsm case the expectation is that spares migrate between containers regardless of the DOMAIN line as that is what the implementation expects. However this is where we get into questions of DOMAIN conflicting with 'platform' expectations, under what conditions, if any, should DOMAIN be allowed to conflict/override the platform constraint? Currently there is an environment variable IMSM_NO_PLATFORM, do we also need a ...but this assumes we already have an array assembled in the domain before the first hot plug event. The 'metadata' keyword would be helpful at assembly time for ensuring only arrays of a certain type are brought up in the domain. We also need some consideration for reporting and enforcing 'platform' boundaries if the user requests it. By default mdadm will block attempts to create/assemble configurations that the option-rom does not support (i.e. disk attached to third-party controller). For the hotplug case if the DOMAIN is configured incorrectly I can see cases where a user would like to specify "enforce platform constraints even if my domain says otherwise", and the inverse "yes, I know the option-rom does not support this configuration, but I know what I am doing". So I see a couple options: 1/ path=platform: auto-determine/enforce the domain(s) for all platform raid controllers in the system 2/ Allow the user to manually enter a DOMAIN that is compatible but different than the default platform constraints like your 3-ahci ports for imsm-RAID remainder reserved for 1.x arrays example above 3/ Allow the user to turn off platform constraints and define 'exotic' domains (mixed controller configurations). -- Dan --
Give me some clearer explanation here because I think you and I are using terms differently and so I want to make sure I have things right. My understanding of imsm raid containers is that all the drives that belong to a single option rom, as long as they aren't listed as jbod in the option rom setup, belong to the same container. That container is then split up into various chunks and that's where you get logical volumes. I know there are odd rules for logical volumes inside a container, but I think those are mostly irrelevant to this discussion. So, when I think of a domain for imsm, I think of all the sata ports or sas ports under a single option rom. From that perspective, spares can *not* move between domains as a spare on a sas port can't be added to a sata option rom container array. I was under the impression that if you had, say, a 6 port sata controller option rom, you couldn't have the first three ports be one container and the next three ports be another container. Is that impression wrong? If so, that would explain our confusion over domains. However, that just means (to me anyway) that I would treat all of the sata ports as one domain with multiple container arrays in that domain just like we can have multiple native md arrays in a domain. If a disk dies and we hot plug a new one, then mdadm would look for the degraded container present in the domain and add the spare to it. It would then be up to mdmon to determine what logical volumes are currently degraded and slice up the new drive to work as spares for those degraded logical volumes. Does this sound correct to you, and can mdmon do that already I'm not sure I would ever allow breaking valid platform limitations. I think if you want to break platform limitations, then you need to use native md raid arrays and not imsm/ddf. It seems to me that if you allow the creation of an imsm/ddf array that the BIOS can't work with then you've potentially opened an entire can of worms we don't want to open ...
I think the disconnect in the imsm case is that the container to
DOMAIN relationship is N:1, not 1:1. The mdadm notion of an
imsm-container correlates directly with a 'family' in the imsm
metadata. The rules of a family are:
1/ All family members must be a member of all defined volumes. For
example with a 4-drive container you could not simultaneously have a
4-drive (sd[abcd]) raid10 and a 2-drive (sd[ab]) raid1 volume because
any volume would need to incorporate all 4 disks. Also, per the rules
if you create two raid1 volumes sd[ab] and sd[cd] those would show up
as two containers.
2/ A spare drive does not belong to any particular family
('family_number' is undefined for a spare). The Windows driver will
automatically use a spare to fix any degraded family in the system.
In the mdadm/mdmon case since we break families into containers we
need a mechanism to migrate spare devices between containers because
they are equally valid hot spare candidate for any imsm container in
Yes, we can have exactly this situation.
This begs the question, why not change the definition of an imsm
container to incorporate anything with imsm metadata? This definitely
would make spare management easier. This was an early design decision
and had the nice side effect that it lined up naturally with the
failure and rebuild boundaries of a family. I could give it more
thought, but right now I believe there is a lot riding on this 1:1
This sounds correct, and no mdmon cannot do this today. The current
discussions we (Marcin and I) had with Neil offlist was extending
mdadm --monitor to handle spare migration for containers since it
already handles spare migration for native md arrays. It will need
some mdmon coordination since mdmon is the only agent that can
Agreed.
--
Dan
--
This explains the weird behavior I got when trying to create arrays on my IMSM box via the BIOS. Thanks for the clear explanation of family I'm fine with the container being family based and not domain based. I just didn't realize that distinction existed. It's all cleared up now ;-= So we'll need to coordinate on this aspect of things then. I'll keep you updated as I get started implementing this if you want to think about how you would like to handle this interaction between mdadm/mdmon. As far as I can tell, we've reached a fairly decent consensus on things. But, just to be clear, I'll reiterate that concensus here: Add a new linetype: DOMAIN with options path=3D (must be specified at least once for any domain action other than none and incremental and must be something other than a global match for any action other than none and incremental) and metadata=3D (specifies the metadata type possible for this domain as one of imsm/ddf/md, and where for imsm or ddf types, we will verify that the path portions of the domain do not violate possible platform limitations) and action=3D (where action is none, incremental, readd, safe_use, force_use where action is specific to a hotplug when a degraded array in the domain exists and can possibly have slightly different meanings depending on whether the path specifies a whole disk device or specific partitions on a range of devices, and where there is the possibility of adding more options or a new option name for the case of adding a hotplug drive to a domain where no arrays are degraded, in which case issues such as boot sectors, partition tables, hot spare versus grow, etc. must be addressed). Modify udev rules files to cover the following scenarios (it's unfortunate that we have to split things up like this, but in order to deal with either bare drives or drives that have things like lvm data and we are using force_use, we must trigger on *all* drive hotplug events, we must trigger early, and we must override other ...
I understand that there are following defaults: - Platform/metadata limitations create default domains - metadata handler deliver default actions The equivalent configuration line for imsm is: DOMAIN path="any" metadata=imsm action=none User could additionally split default domains using spare groups and path keyword. For instance for imsm, the default domain area is platform controller. If any metadata is server by multiple controllers, each of them creates its own domain. If we allow for "any" for the path keyword, a user could simply override metadata defaults for all his controllers by: I think that implementation can be something like that: We shall set cookie to store the path of disk which is removed from the md device. Later if the new device is re-plugged in the port, it can be used for rebuild. We shall set timer when cookies shall expire. I propose to clean them on start-up (mdadm -monitor can be a candidate; default action shall be cookies clean-up). Enable spare disk sharing between containers if they belong to the same domain and have not conflicting spare group assignment. This will allow for spare sharing by default. Additionally, we can consult metadata handlers before moving spares between containers. We can do that by adding another metadata handler function which shall test metadata and controller dependencies (I can imagine that user can define metadata stored domains of spare sharing; controllers (OROM) dependent constrains shall be handled in this function, too). Thanks, Marcin Labun --
A single DOMAIN can span several controllers, but only if that does not violate the 'platform' constraints for that metadata type (which Yes, spare sharing by default within the domain and as Doug said ignore any conflicts with the spare-group identifier i.e. DOMAIN This really is just a variation of load_super() performed on a container with an extra disk added to report whether the device is spare, failed, or otherwise out of sync. In the imsm case this is load_super_imsm_all() with another disk (outside of the current container list) to compare against. -- Dan --
Why not 0.90 and 1.x for instead of 'md'? These match the 'name' I have been thinking that the path= option specifies controller paths, not disk devices. Something like "pci-0000:00:1f.2-scsi-[0-3]*" to pick the first 4 ahci ports. This also purposefully excludes virtual devices dm/md. I think we want to limit this functionality to physical controller ports... or were you looking to incorporate Can't we limit the scope to the hotplug events we care about by filtering the udev scripts based on the current contents of the configuration file? We already need a step in the process that verifies if the configuration heeds the platform constraints. So, something like mdadm --activate-domains that validates the configuration, generates the necessary udev scripts and enables Yes, but this also reminds me about the multiple superblock case. It should usually only happen to people that experiment with different metadata types, but we should catch and probably ignore drives that Let's also limit this to ports that were recently (as specified by a timeout= option to the DOMAIN) unplugged. This limits the potential Modulo the ability to have a global enable / disable for domains via I think we have a consensus. The wrinkle that comes to mind is the case we talked about before where some ahci ports have been reserved for jbod support in the DOMAIN configuration. If the user plugs in an imsm-metadata disk into a "jbod port" and reboots the option-rom will assemble the array across the DOMAIN boundary. You would need to put explicit "passthrough" metadata on the disk to get the option-rom to ignore it, but then you couldn't put another metadata type on that disk. So maybe we can't support the subset case and need to force the platform's full expectation of the domain boundaries or honor the DOMAIN line and let the user figure out/remember why this one raid member slot does not respond to hotplug events. Thanks for the detailed write up. -- Dan --
On Tue, 30 Mar 2010 11:23:08 -0400
Thoughts ... yes ... all over the place. I won't try to group them, just a
random list:
"bare devices"
To make sure we are on the same page, we should have a definition for this.
How about "every byte in the first megabyte and last megabyte of the device
is the same (e.g. 0x00 or 0x5a of 0xff) ??
We would want a program (mdadm option?) to be able to make a device into a
bare device.
Dan's "--activate-domains" option which creates a targeted udev rules file for
"force_use" - I first I though "yuck, no", but then it grew on me. I think
I quite like the idea now. We can put a rules file in /dev/.udev/rules.d/
which targets just the path that we want to over-ride.
I can see two approaches:
1/ create the file during boot with something like "mdadm --activate-domins"
2/ create a file whenever a device in an md-array is hot-removed which
targets just that path and claims it immediately for md.
Removing these after a timeout would be needed.
The second feels elegant but could be racy. The first is probably the
better approach.
Your idea of only performing an action if there is a degraded array doesn't
seem quite right.
If I have a set of ports dedicated to raid and I plug in a bare device,
I want to become a hot-spare whether there are degraded arrays that
will use it immediately or not.
You say the making it a hot spare doesn't actually "do" anything, but it
does. It makes available for recovery.
If a device fails, then I plug in a spare I want it to recovery - so do you.
If I plug in a spare and then a device fails, I want it to recover, but it
seems you don't. I cannot reconcile that difference.
Yes, the admin might want to grow the array, but that is fine: the spare
is ready to be use for growth, or to replace a failure, or whatever is
needed.
Native metadata: on partitions or on whole device.
We need to make sure we understand the distinctions between ...Hi Neil, I look forward to being able to update my mdadm.conf with the paths to devices that are important to my RAID so that if a fault were to develop on an array, then I'd be really happy to fail and remove the faulty device, insert a blank device of sufficient size into the defined path and have the RAID auto restore. If the disk is not blank or too small, provide a useful error message (insert disk of larger capacity, delete partitions, zero superblocks) and exit. I think you do an amazing job and it worries me that you and the other contributors to mdadm could spend your valuable time trying to solve problems about how to cater for every metadata, partition type etc when a simple blank device is easy to achieve and could then "Auto Rebuild on hot-plug". Perhaps as we nominate a spare disk, we could nominate a spare path. I'm certainly no expert and my use case is simple (raid 1's and 10's) but it seems to me a lot of complexity can be avoided for the sake of a blank disk. Cheers, Josh --
On Fri, 26 Mar 2010 17:41:02 +1100 One the one hand, we should always look beyond the immediate problem we are tring to solve in order to see the big picture and make sure the solution we choose doesn't cut us off from solving other more general problems when they arrive. On the other hand, we don't want to expand the scope so much that we end up biting off more than we can chew. A general design with a specific implementation is probably a good target.... Thanks, --
Why not treat this similar to how hardware RAID manages disks & spares?
Disk has no metadata -> new -> use as spare.
Disk has metadata -> array exists -> add to array.
Disk has metadata -> array doesn't exist (disk came from another
system) -> sit idle & wait for an admin to do the work.
As to identify disks and know which disks were removed and put back to
an array, there's the metadata & there's the disk's serial number
which can obtained using hdparm. I also think that all disks now
include a World Wide Number (WWN) which is more suitable for use in
this case than a disk's serial number.
Some people rant because they see things only from their own
perspective and assume that there's no case or scenario but their own.
So don't pay too much attention :p
Here's a scenario: What if I had an existing RAID1 array of 3 disks. I
bought a new disk and I wanted to make a new array in the system. So I
add the new disk, and I want to use one of the RAID1 array disks in
this new array.
Being lazy, instead of failing the disk then removing it using the
console, I just removed it from the port then added it again. I
certainly don't want mdadm to start resyncing, forcing me to wait!
As you can see in this scenario, it includes the situation where an
admin is a lazy bum who is going to use the command line anyway to
make the new array but didn't bother to properly remove the disk he
wanted. And there's the case of the newly added disk.
Why assume things & guess when an admin should know what to do?
I certainly don't want to risk my arrays in mdadm guessing for me. And
keep one thing in mind: How often do people interact with storage
systems?
If I configure mdadm today, the next I may want to add or replace a
disk would be a year later. I certainly would have forgotten whatever
configuration was there! And depending on the situation I have, I
certainly wouldn't want mdadm to guess.
--
Majed B.
--
On Fri, 26 Mar 2010 10:52:07 +0300 Lazy people often do cause themselves more work in the long run, there is That is a point worth considering. Where possible we should discourage configurations that would be 'surprising'. Unfortunately a thing that is surprising to one person in one situation may be completely obvious to someone else in a different situation. Thanks, NeilBrown --
