Hello list!
I was wondering if and when the excellent geom_raid5 module created by
Arne Wörner would be committed to HEAD or releng_7. Right now there's
gvinum with RAID5 support and geom_raid3. The latter is nice but only
works with a 'strange' number of disks (3,5,9,17,33, etc) and has the
disadvantages associated with RAID3 (no parallel access; all disks are
involved in one single write operation), while gvinum offers poor RAID5
write performance (~20MB/s in my testings).geom_raid5, on the other hand, offers excellent performance due to write
combining, that allows 2-phase requests to be converted into 1-phase
requests. This is achieved by waiting for contiguous requests and
splitting those into the 'full stripe', meaning: stripesize * (
number_of_disks - 1). This way, no parity information has to be read
from any disk since all required data is already in memory. In normal
english this means excellent sequential performance. In my own testings
i have achieved over 400MB/s of RAID5 write performance using a simple
"dd if=/dev/zero of=/dev/raid5/data bs=1m" of at least 10GB worth of
data. raidtest benchmarks show that geom_raid5 scales nicely with the
number of disks, though ofcourse RAID5's weak point is random write.Considering this project has been around for quite a while, and sister
project FreeNAS has been using it for more than a year in their official
(stable) branch, i think it should become available in the official
FreeBSD distribution as well. I don't know who decides on this, but i
would like to get a discussion going on this matter.Looking forward to any replies!
Kind regards,
Veronica
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
Hi,
FWIW, I would also like to see geom_raid5 going into the
tree (after proper review).Just a small question: I noticed that the new gvinum
raid5 implementation (in P4) allows adding disks to an
existing RAID5, even while it is running. Does geom_raid5
support that, too? (ZFS doesn't, unfortunately.)Best regards
Oliver--
Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing b. M.
Handelsregister: Registergericht Muenchen, HRA 74606, Geschäftsfuehrung:
secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht Mün-
chen, HRB 125758, Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf GebhartFreeBSD-Dienstleistungen, -Produkte und mehr: http://www.secnetix.de/bsd
"Software gets slower faster than hardware gets faster."
-- Niklaus Wirth
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
Hi Oliver (or should I say inof)? :-)
Nope... graid5 doesnt do such things... I found no way, that could do it
without hurting the disks too much (I was afraid, that a power failure could
destroy the necessary knowledge about the size of the new-config-area; and I
didnt know how to do the beginning: it seemed like the first few blocks need a
special treatment, because there the new-config-area and the old-config-area
overlap)...But Veronica is developing a tool, that can do it in offline mode... With
service interruption...But growfs induces a service interruption anyway and it is buggy, if u do not
zero the new area... Veronica filed a bug report about this...Nowadays it is common practice to have 2 ot more hosts, that can substitute
each other (hot-standby or how they call it today), so that it doesnt matter,
if a box is damaged or in maintenance mode or... isnt it?P. S.: "It is... lovely weather we are having. I hope the weather continues."
(taken from "The Pink Panther (2006)") *rotfl*Bye
Arne____________________________________________________________________________________
Get easy, one-click access to your favorites.
Make Yahoo! your homepage.
http://www.yahoo.com/r/hs
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
Arne Wörner wrote:
> Oliver Fromme wrote:
> > Just a small question: I noticed that the new gvinum
> > raid5 implementation (in P4) allows adding disks to an
> > existing RAID5, even while it is running. Does geom_raid5
> > support that, too? (ZFS doesn't, unfortunately.)
>
> Nope... graid5 doesnt do such things... I found no way, that could do it
> without hurting the disks too much (I was afraid, that a power failure could
> destroy the necessary knowledge about the size of the new-config-area; and I
> didnt know how to do the beginning: it seemed like the first few blocks need a
> special treatment, because there the new-config-area and the old-config-area
> overlap)...OK. I don't know the inner workings of geom_raid5, so I
can't tell how difficult it would be to implement there.Here's a little description and some ASCII graphics that
explin how growing RAID5 was implemented in the new
gvinum:http://lists.freebsd.org/pipermail/p4-projects/2007-July/020082.html
> But Veronica is developing a tool, that can do it in offline mode... With
> service interruption...
>
> But growfs induces a service interruption anyway and it is buggy, if u do
> not zero the new area... Veronica filed a bug report about this...Hm. I used growfs only once, and it worked fine. Was
there a regression introduced at some point? It should
certainly be fixed, because growfs seem to be very
useful.About service interruption: growfs only takes a few
seconds, which might be acceptable in most cases.
But taking a whole RAID5 down to add disks and then
rebuilding it takes a _lot_ longer. Therefore I think
the feature to add disks to a live RAID5 would be very
valuable.> Nowadays it is common practice to have 2 ot more hosts, that can substitute
> each other (hot-standby or how they call it today), so that it doesnt matter,
> if a box is damaged or in maintenance mode or... isnt it?It ...
It should be easy to reproduce, if it is not fixed yet:
just fill the new part of the device with bytes from /dev/random and then use
growfs and be astonished... :-) U can use a bsdlabel (small in the beginning
and greater later) on a md device (mdconfig)...http://www.freebsd.org/cgi/query-pr.cgi?pr=bin/115174
-Arne
____________________________________________________________________________________
Be a better sports nut! Let your teams follow you
with Yahoo Mobile. Try it now. http://mobile.yahoo.com/sports;_ylt=At9_qDKvtAbMuh1G1SQtBI7ntAcJ
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
I would like to see this get wider testing and eventually be included
as well. A few questions (some of which I know the answer to but would
like to have you field for the list):Where can I get/test/review the code?
Is there still any focus on the "previous" generation code or is "TNG"
where everything is happening now? Which generation are you talking
about above?Does the code in the FreeBSD perforce repository represent the latest
improvements? Which generation?How much and what kinds of testing has it already received? What is the
relationship between the FreeNAS project and the geom_raid5
developer(s)? What version(s) of the code is/are in FreeNAS? Has the
FreeNAS project or anyone else done any (preferrably thorough and
repeatable) testing for correctness, stability and performance? Are the
methods and/or results publicly available?Just who is (/are) the developer(s) behind this code?
Where can I get more information? Can the information on your blog be
considered authoritative?What is the status of the code? Are there any known outstanding issues
or poorly tested functionality/configurations?Are there any threads on the freebsd-geom or other mailing lists that
would be enlightening? Has a call for testing and/or code review been
made on that list recently?Thanks. :)
JN
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
http://wgboome.homepage.t-online.de./geom_raid5.tbz
aka TOS (the original *ah*eh*... maybe B4 would be a nice name, too)
the most stable version
but slower than TNG/PP, if there r many requests within a short time,
that could be combined...
http://wgboome.homepage.t-online.de./geom_raid5-eff.tbz
aka TNG
quite memory hungry, but faster than without any "-..."
http://wgboome.homepage.t-online.de./geom_raid5-pp.tbz
pp = double-plus (I have a little thing with "1984" (by Orwell or so))
not as memory hungry as "-tng" and possibly not so fast...
but more memory hungry than without any "-..."
TNG was just for fun, because we wanted to know, if "bcopy" causes a big
performance penalty (that is why TNG has a big memory appetite - we chose a
quick and dirty approach in order to eliminate most bcopy-calls)...I personally like "-pp" best and I think it should be as stable as TOS...
But the only often tested version is TOS...
The test was: Several GB (from my TV cards) every day for several months...
There were just minor things, that are fixed now... I tested PP in this way,
too, but not so long (today I use *cough* another *cough* OS, that supports my
Just that real life test...
I did a consitency test with TOS according to Pawel's recommendation:
1. create this: gmirror (graid5 (3 disks), graid3 (3 disks))
2. write some random data with raidtest(I dont know if it can do?) or with
UFS+dd
3. wait for the gmirror device to enter state "SYNC-ED" (or how it is
called)...
Hmm... I dont know... I never saw one of them... And I am not a developer in
They did some HOWTOs (removing a disk and then running in "degraded" mode and
such things)...
But no real long-term test or correctness test...
They have a knowledge base with some performance graphs, but I dont remember
I wanted to see, if I can still write programs... And so I tried to write a
RAID5 kernel module and took the structure from gstripe and gmirror (Pawel J.'s
sources)...
Then I sent some emails to the -geom list...
Well, first I'd like to say I like the effort and the concept of the geom_raid5
class itself, and it's definately something useful. But I agree with the point
that is some very complex code. I took some quick sweeps over the code, and the
general impression I got was:- Many style(9) issues.
- Lack of documentation. There are many small comments, but there is little
description on top of functions describing their purpose and what they do.
This makes it hard to get into it for reviewers and other developers.
- As to the code logic itself I was a bit sceptic about having the malloc saving
queue. Does it really improve performance that much? It's just the sort of
thing that could easily lead to bugs.
- I also wonder a bit why you use two worker threads, as this also increases
complexity (but again, does it improve performance to the point that it's
worth it?).And last but not least: All of this have to be reviewed before going into the
tree, and there are not many people who can do that right now. However, I
really like your work and would gladly help improving it.--
Ulf Lilleengen
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
Hmm... Would "cb" help? R some function too long? I tried to comply to Pawel's
style, but obviously I deviated from it after some weeks... :)
Hmm... Yup...There r interface functions to the GEOM system: ..._start(), ..._done(),
..._create(), and so on...
Then there r 2 worker threads (one for the graid5 start queue (...worker()) and
one for the graid5 done queue (...workerD()).
The other functions r helper functions...
I could add the function-purpose-comment in PP and then try to merge it to TNG
Hmm... if I understood correctly, FreeBSD's kernel memory suffers under
fragmentation, if many big mem areas r needed... There might be even a dead
lock, if UFS uses 64kb block size... So I thought it would be a good idea to
avoid those sleeps but "hamster-ing" the big chunks... :) But I am not sure
anymore, that it improved performance (but performance was the reason for
Hmm... I think so... At least on MP boxes, since both threads do some XOR-ing
(worker() uses XOR for writing "full-stripes" (where no read is necessary) and
OK... review sounds good... maybe we should concentrate on PP then (it is quite
space (in comparison to TNG but not TOS)+time (in comparison to TOS; maybe in
comparison to TNG, too?) efficient and has a read cache)? Although fluffles
favors TNG, although it is quite nasty (a write request of size 4KB costs 3
full stripes ((<number of disks>-1)*<stripe size>) plus 2*128KB... *giggle*)...-Arne
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
Well, this might sound picky, but it's still a style issue:
- Parantheses around return values:static __inline u_int
g_raid5_nvalid(struct g_raid5_softc *sc)
{
u_int i, no;no = 0;
for (i = 0; i < sc->sc_ndisks; i++)
if (g_raid5_disk_good(sc,i))
no++;- return no;
+ return (no);
}- Proper spacing between variable declaration and function body.
struct g_consumer **cpp = cp->private;
+
if (cpp == NULL)
return -1;- Declaration of variables should be in the top of the function.
struct g_consumer **cpp = cp->private;
if (cpp == NULL)
return -1;
struct g_consumer *rcp = *cpp;
if (rcp == NULL)
return -1;
int dn = cpp - sc->sc_disks;- Proper indenting when breaking a line, should be 4 spaces etc.
All of this can be found in the style(9) manpage, so I'd rather just suggest
Hmm, I'm not sure what you mean about this dead lock, but sounds like a weird
thing to having to deadlock because of your filesystem. Maybe this could be
solved in another way, or is this not a graid5-thing at all?The general thing is that I don't think one should start optimizing for
performance before everything works correctly and having made sure that it
improves performance statistically. (I know this isn't a completely new
First of all, disk I/O is generally much slower than CPU anyway, so I would
doubt that having to use one thread would decrease performance noticeably.
In my ears, this is a good argument for using one thread only. But then
I'm starting to get busier and busier with exams coming up now, but I'll try
take a look when I can, but don't expect to much :) Also, as I said, I've
only looked at TOS so far.--
Ulf Lilleengen
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd...
Hmm... OK... I can do that, although "return" is no function... It feels more
like an operator for me (like "=" or "++")... But if we want to do it that way,
I have uploaded a PP version now, that has some explanations for each function
It seems to be general problem with kernel and/or VFS memory...
U can provoke it with heavy UFS access with several bonnie processes and
Yup...
Hmm... Yup... But reducing short delays (like for bcopy) increased throughput
OK - I cannot spend so much time for this, too...
I have a backlog of more than 10 hours of "important" TV series... :-)-Arne
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
I don't remember entire discussions about geom_raid5 but I seem to
recall there was concern about its aggressive caching, possibly in the
write path, which could make recovery in case of e.g. power outage
problematic. I'm possibly mis-remembering this so feel free to ignore if
it's not relevant to your geom_raid5._______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
graid5 puts write requests for about kern.geom.raid5.wdt seconds (but not less
than 1-2 seconds) into the write cache (if there is enough space left in
graid5's write cache)... I would guess that this behaviour is pretty
incompatible with soft-updates with power outage...Then there still is the write cache of the hard discs (I dont know how long it
waits, but that time would come in addition to graid5's delay)...Maybe gjournal could help, because graid5 honors the BIO_FLUSH, but that is
untested...-Arne
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
T24gMDcvMTEvMjAwNywgQXJuZSBXw7ZybmVyIDxhcm5lX3dvZXJuZXJAeWFob28uY29tPiB3cm90
ZToKCj4gZ3JhaWQ1IHB1dHMgd3JpdGUgcmVxdWVzdHMgZm9yIGFib3V0IGtlcm4uZ2VvbS5yYWlk
NS53ZHQgc2Vjb25kcyAoYnV0IG5vdCBsZXNzCj4gdGhhbiAxLTIgc2Vjb25kcykgaW50byB0aGUg
d3JpdGUgY2FjaGUgKGlmIHRoZXJlIGlzIGVub3VnaCBzcGFjZSBsZWZ0IGluCj4gZ3JhaWQ1J3Mg
d3JpdGUgY2FjaGUpLi4uIEkgd291bGQgZ3Vlc3MgdGhhdCB0aGlzIGJlaGF2aW91ciBpcyBwcmV0
dHkKPiBpbmNvbXBhdGlibGUgd2l0aCBzb2Z0LXVwZGF0ZXMgd2l0aCBwb3dlciBvdXRhZ2UuLi4K
CkNhbiB0aGlzIGNhY2hlIGJlIGRpc2FibGVkPwoKPiBUaGVuIHRoZXJlIHN0aWxsIGlzIHRoZSB3
cml0ZSBjYWNoZSBvZiB0aGUgaGFyZCBkaXNjcyAoSSBkb250IGtub3cgaG93IGxvbmcgaXQKPiB3
YWl0cywgYnV0IHRoYXQgdGltZSB3b3VsZCBjb21lIGluIGFkZGl0aW9uIHRvIGdyYWlkNSdzIGRl
bGF5KS4uLgo+Cj4gTWF5YmUgZ2pvdXJuYWwgY291bGQgaGVscCwgYmVjYXVzZSBncmFpZDUgaG9u
b3JzIHRoZSBCSU9fRkxVU0gsIGJ1dCB0aGF0IGlzCj4gdW50ZXN0ZWQuLi4KClllcywgQUZBSUsg
dGhpcyB3b3VsZCB3b3JrLgo=
Probably - but recent info shows it to be the prime mover in providing decent
A RAID5 is one of the harder ones to do both fast and well in software-only.
The better hardware ($$$) controllers have fast hardware XOR engines as well as
CPU-as-state-machines and battery-backed cache, and THEY have to work hard.Further, a hardware controller sits in the right place to do the job well, the
'GP' CPU(s) - no matter they have spare cycles to burn - do not.I don't think even GEOM magic can get around that w/o user willingness to take
on some unavoidable compromises.Given decent hardware & any UPS that costs less than the hardware controller,
these are 'choices' - not really show-stoppers.Bill
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
I agree. But regarding the immediate topic of gjournal on graid5:
gjournal has hooks in the UFS code to do full sync before journal switch
(commit), which it then propagates to the devices and issues BIO_FLUSH,In theory this is correct, in practice still many people don't know the
choices they are implicitly making._______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
I'm all for having it / improving it.
GEOM in general and GMIRROR in particular have been *magic* for us as they are
much more safely managed over ssh in the absence of an IPMI, IP KVM, or serial
link than even a good hardware RAID controller.But I'd not like to see yet-another iteration of 'a little knowledge..' folk
follow geom_raid5 as flavor-of-the-month, then expect coders to yet-again defy
gravity when the inevitable bites 'em in the anatomy, either.First we walk. THEN we run....
Bill
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
Nope...
But it would be quite easy to implement that (e. g. .wdt=-1 with no range
checking -- currently graid5 sets everything below 2 back to 2 [seconds])...-Arne
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
You may be interested at some tests made by Michael Monashev in his=20
personal blog(RUSSIAN):
http://michael.mindmix.ru/168-958-rezul-taty-testirovanija-graid5-graid3...
ache-i-raidz.zhtmlHe is testing graid5, graid3+gcache and ZFS raidz:
Hardware:
Motherboard: Intel S5000PAL (Alcolu), Intel E5000P
CPU Dual-Core Intel Xeon 5130 2,00 GHz, cache 4 MB, FSB 1333 MHz
RAM 4 GB DDR2-677 Fully Buffered ECC (2*2 GB)
HDD SATA Seagate 750Gb 7200 rpm
HDD SATA Seagate 750Gb 7200 rpm
HDD SATA Seagate 750Gb 7200 rpm
HDD SATA Seagate 750Gb 7200 rpm
HDD SATA Seagate 750Gb 7200 rpm
HDD SATA Seagate 750Gb 7200 rpmSoftware:
=46reeBSD 7.0-CURRENT amd64
# mount
/dev/ad4s2d on /home (ufs, local, noatime, soft-updates)
/dev/ad4s1h on /opt/log (ufs, local, noatime, soft-updates)
=2E..
tank/opt on /opt (zfs, local)
tank on /tank (zfs, local)
/dev/raid3/g3 on /opt2 (ufs, local, noatime, soft-updates)
/dev/raid5/g5 on /opt3 (ufs, local, noatime, soft-updates)# zpool status
pool: tank
state: ONLINE
scrub: none requested
config:NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz1 ONLINE 0 0 0
ad6s1 ONLINE 0 0 0
ad8s2 ONLINE 0 0 0
ad10s3 ONLINE 0 0 0
ad12s1 ONLINE 0 0 0
ad14s2 ONLINE 0 0 0errors: No known data errors
# gcache list
Geom name: cache_ad6s2a
WroteBytes: 98304
Writes: 42
CacheFull: 63546
CacheMisses: 70050
CacheHits: 647
CacheReadBytes: 99614208
CacheReads: 7151
ReadBytes: 1494126592
Reads: 92918
InvalidEntries: 0
UsedEntries: 6
Entries: 6
TailOffset: 250048479232
BlockSize: 65536
Size: 100
Providers:
1. Name: cache/cache_ad6s2a
Mediasize: 250048503296 (233G)
Sectorsize: 512
Mode: r1w1e1
Consumers:
1. Name: ad6s2a
Mediasize: 250048503808 (233G)
Sectorsize: 512
Mode: r1w1e1Geom name: cache_ad8s3a
WroteBytes: 98304
Writes: 42
CacheFull: 63659
CacheMisses: 70150
CacheHits: 652
CacheReadBytes: 99899904
CacheReads: 7143
ReadBytes: 1492060160
Reads: 92918
InvalidEntries: 0
UsedEntries: 6
Entries: 6
TailOffset: 250048479232
Bloc...
| Tarkan Erimer | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Greg KH | [GIT PATCH] driver core patches against 2.6.24 |
| Mike Travis | [RFC 00/15] x86_64: Optimize percpu accesses |
| Dave Jones | agp / cpufreq. |
| Willy Tarreau | Re: [PATCH] tcp: splice as many packets as possible at once |
| Gerrit Renker | [PATCH 14/37] dccp: Tidy up setsockopt calls |
| David Miller | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Natalie Protasevich | [BUG] New Kernel Bugs |
git: | |
