-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 05:03 < flips> we started 3 minutes ago 05:03 < flips> no maze 05:04 < flips> so we will take a slight change in session plan 05:04 < flips> instead of doing bio transfers we will continue drilling down into generic_write 05:05 < flips> ok, somebody summarize where we got to, please... mention _2copy 05:06 * flips looks at RazvanM 05:06 < RazvanM> http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2063 05:06 < flips> and the summary? 05:07 < RazvanM> and we got there from here: http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2319 05:07 < RazvanM> the 2copy is used when there is no support for write_begin 05:08 < flips> what is happening in this function? 05:08 < RazvanM> and we use prepare_Write and commit_write 05:09 < RazvanM> the data is moved to some kernel pages and then to some user memory? :P 05:09 <@shapor> hi all 05:09 < flips> hi 05:09 * shapor takes a seat at the back of the room 05:09 < flips> the data is moved from user memory onto buffer pages 05:09 < flips> then the buffer pages are committed to disk 05:10 < RazvanM> sorry... I got the order wrong :P 05:10 < flips> 2copy is the lamest name anybody could have possibly chosen :p 05:10 < flips> appears to be the real thing though 05:10 < flips> just where we should be reading 05:11 < flips> __grab_cache_page is the heart of it 05:11 < flips> other things are decoration 05:11 < flips> such as fault_in_readable 05:12 < RazvanM> just a quick q: why some functions start with uppercase? 05:12 < flips> attempts to deal with the many dangerous recursions 05:12 < flips> with varying degrees of success in terms of robustness and readability 05:12 < flips> razvanm, random hackers 05:12 <@shapor> what is write_begin? 05:12 < flips> sometimes have studly caps days 05:12 < MaZe> hey 05:13 < flips> write_begin is a hook for some specialized user I don't know about 05:13 < flips> "completely general interface used inexactly one place" like as not 05:13 < flips> or "homework for shapor" 05:13 < flips> hey maze 05:13 <@shapor> :) 05:13 <@shapor> ok 05:13 < flips> ok, we can return to the original session plan 05:14 < flips> maze, the plan is for you to report your findings on basic bio transfers 05:14 < MaZe> lol 05:14 < flips> point to code (you might want to pastie it) 05:14 < MaZe> uhm, lol 05:14 < MaZe> how about I put a tar.gz up? 05:14 < flips> don't copy in the channel unless it's 1/2 lines 05:14 < flips> that too 05:14 < flips> pastie is good, use your taste 05:15 < flips> if you had it checked in you could point a urls 05:15 < flips> so... remember to check in next time ;) 05:15 < MaZe> uploading 05:15 < flips> since you code is so short I'd suggest just pasting the whole thing 05:16 < MaZe> http://m.a.z.e.pl/junkfs.tar.gz 05:16 <@shapor> lol nice domain! 05:16 < flips> really 05:16 < flips> leet 05:16 < MaZe> yeah, I own z.e.pl 05:17 < MaZe> so I also have m.a@z.e.pl 05:17 <@shapor> heh 05:17 < flips> "opened with ark" 05:17 < MaZe> or m@z.e.pl - whichever you prefer 05:17 < flips> ok, who has got the code open, and who not? 05:17 < MaZe> me not 05:17 < MaZe> ok, got it open 05:18 < flips> ark works pretty fscking well 05:18 < flips> I'm impressed 05:18 < MaZe> mind you - this is very rough, and mostly was debugging plus getting it working 05:18 < MaZe> I'm still not quite sure of everything, and although I fixed the last hang bug I found 05:18 < MaZe> I haven't since tested 05:18 < MaZe> so I'm not sure ;-) 05:18 < flips> don't worry, shapor will hurt you if you get anything wrong 05:18 < MaZe> lol 05:19 * shapor wields axe 05:19 < flips> so... where does the bio read setup start? 05:19 < MaZe> do you want me answering? 05:20 < flips> yes 05:20 < flips> you should have been asking ;) 05:20 < MaZe> hmm. 05:20 < MaZe> right 05:20 < MaZe> so pretty much everything except super.c is either makefile or debug 05:20 < flips> noticed 05:21 < MaZe> and the bottom of super.c is pretty standard module init stuff 05:21 < flips> nicely lindented 05:21 < flips> for the moment we only care about the bio transfer 05:21 < MaZe> and above that is the standard fs registering and fs_ops stuff 05:21 < MaZe> and from there we get to junkfs_get_sb which calls into get_sb_bdev 05:22 < MaZe> which calls junkfs_fill_super as a callback 05:22 < MaZe> and that's were all the action is 05:22 < flips> action :) 05:22 < MaZe> get_sb_bdev also exclusively opens the block device for us, so that's nice 05:22 < flips> finally, after 4 days of tux3 U 05:22 < MaZe> at the point we enter into junkfs_fill_super, we have an exclusively opened block device 05:22 < MaZe> which is passed in the superblock 05:23 < MaZe> sb->s_bdev 05:23 < MaZe> in junkfs_fill_super we then proceed to allocate memory for 3 basic objects 05:23 < MaZe> 1) memory to read in the 512 byte (SB_SIZE) superblock 05:23 < flips> 1 sector sb, leet 05:23 < MaZe> 2) an object to store state (in the bio->b_private field) 05:24 < MaZe> c) a bio 05:24 < MaZe> 1 and 2 are just normal kmalloc's 05:24 < MaZe> 3 is via bio_alloc 05:24 < MaZe> thus 1 and 2 will need to be kfree'd 05:24 -!- Bushman [~marcin@c-76-23-106-132.hsd1.sc.comcast.net] has joined #tux3 05:24 < MaZe> and 3 will need to be bio_put'ed at some point before the end of junkfs_fill_super 05:24 < MaZe> or we'll leak 05:24 < MaZe> anyway, standard handling of error returns on all the allocs 05:25 < MaZe> and we get to: 05:25 < MaZe> bio->bi_bdev = sb->s_bdev; 05:25 < MaZe> <------>bio->bi_sector = 0; // first sector 05:25 < MaZe> <------>s = bio_add_page(bio, virt_to_page(buf), SB_SIZE, offset_in_page(buf)); 05:25 < MaZe> which is most of the bio preparation stage 05:25 <@shapor> Bushman: hi Marcin 05:25 < flips> the real meat 05:25 < MaZe> we set the bio to refer to the correct block device 05:25 < flips> marcin, hi 05:25 < MaZe> and (for now - this is all junkfs ;-) ) we just read the first sector 05:25 < MaZe> sectors in new linux are always exactly 512 bytes 05:25 < flips> that's leet nuff for us 05:26 < MaZe> so we're saying here offset 0 * 512 into the block dev 05:26 < MaZe> then we need to tell the bio where to store the data 05:26 < MaZe> (or read from, since a write would be identical) 05:26 < flips> right, struct bio is sector-addressed for no good reason 05:26 < MaZe> s = bio_add_page(bio, virt_to_page(buf), SB_SIZE, offset_in_page(buf)) 05:26 < Bushman> hello Daniel 05:27 < MaZe> this actually gives our carefully allocated memory to the bio as memory 05:27 < flips> bushman, enjoy ;) 05:27 < MaZe> note that bio_add_page takes (bio, struct page*, len, ofs) 05:27 < Bushman> i dunno if enjoy is the right word for kernel code just before bedtime ;) 05:27 < MaZe> so we pass in the bio, then convert the bufs address to a page via virt_to_page 05:27 < flips> and you could write it out in full in about as much code as the function call takes 05:27 < MaZe> pass the length of the block 05:28 < MaZe> and calc the offset from the page struct for the ofs via offset_in_page 05:28 < flips> bushman, then just enjoy the geek banter 05:28 <@shapor> virt_to_page? 05:28 < MaZe> I'm assuming at this point that a kmalloc can't give us memory split across pages 05:28 < MaZe> - not sure if this is correct 05:28 < flips> shapor, great question 05:28 < flips> maze, correct 05:28 < MaZe> so buf was kmalloc'ed, so it's a virtual kernel memory address 05:29 < flips> maze, unless the kmalloc is bigger than a page 05:29 < MaZe> virt_to_page gives us the struct page * for the kaddr we pass to it 05:29 < MaZe> [flips: of course] 05:29 < flips> maze, and why do we need the struct page? 05:29 < MaZe> because that's what bios want 05:29 < MaZe> if you look at what a bio is 05:29 < MaZe> it's 3 things 05:29 < MaZe> the struct bio 05:30 < MaZe> which has a lot of management fields 05:30 < MaZe> the bvec which 05:30 < MaZe> is an array of a tiny struct with 3 fields 05:30 < MaZe> { struct page * p; int len; int ofs; } 05:30 < MaZe> so basically a list of where to put the next len bytes, specifying memory via page/ofs pairs 05:31 < MaZe> this is for two reasons: 05:31 < MaZe> [at least as far as i can tell] 05:31 < MaZe> a) most hw (ie. stuff the blockdevice drivers care about) 05:31 < MaZe> cares about physicall addresses and not virtual kernel addresses 05:31 < flips> right 05:31 < MaZe> ie. for dma and all that good for performance goodness 05:31 < MaZe> b) this can also be used for data xfr into userspace 05:32 < MaZe> and there is no guarantee userspace memory has a mapping into kernel space 05:32 < MaZe> [high mem] 05:32 < flips> the big reason: scatter gather 05:32 < flips> this is a dma interface in disguise 05:32 < flips> very effective one 05:32 < MaZe> this also makes it easier to coallesce physically neighboring memory together into the bvecs 05:32 < MaZe> precisely 05:32 < flips> right, another way of saying scatter gather 05:33 < MaZe> notice that in bio_alloc 05:33 < MaZe> we passed in a 1 05:33 < MaZe> that 1 is the number of bvecs in the bvec area allocated to the bio 05:33 < MaZe> so that limits how many non-contig pieces of memory we can have in the bio 05:33 <@shapor> ah 05:33 < MaZe> here - all we need is 1 05:33 < flips> and because you did that, you could have initialized your one bvec with a simple structure assignment 05:33 < flips> instead of the function call 05:33 < MaZe> right. 05:33 < flips> which does a bunch of stuff you don't need 05:34 < MaZe> oh well. 05:34 < RazvanM> does a bio_vec describes exactly one page? 05:34 < flips> maze, exactly 05:34 < MaZe> no 05:34 < RazvanM> bv_len 05:34 < MaZe> it describes a start page with ofset and a length 05:34 < MaZe> the length may exceed that page and cross into however many next ones 05:34 < MaZe> the precise rules for merging are overridable 05:34 < flips> it describes a data region that resides within one page 05:34 <@shapor> so the bio interface will be quite good for extents 05:35 < MaZe> many device drivers have limits on how many sectors they can transfer in one go (ie. 200 or so) 05:35 < flips> maze, you can't cross a page with a bvec 05:35 < MaZe> flips, you sure? 05:35 < flips> sadly, or perhaps sanely 05:35 < MaZe> I certainly ain't ;-) 05:35 < flips> pretty sure 05:35 < MaZe> but then I don't know what I'm talking about here 05:36 < flips> never seen it done ;) 05:36 < MaZe> these are still all guesses 05:36 < Bushman> pollacks ain't sane, just ask Shap 05:36 < MaZe> I thought they merged by themselves 05:36 < MaZe> hmm, well, first homework I;d guess 05:36 < RazvanM> one more q: bv_len is counting bytes or sectors? :P 05:36 < flips> merging happens in the physical driver 05:37 < flips> good question 05:37 < MaZe> anyway bio_add_page returns how much it successfully added (or what the current total is, not sure) in bytes 05:37 < flips> bytes I think 05:37 < MaZe> so if everything is good it should be 512 at this point 05:37 < MaZe> hence the check 05:37 < flips> it's pretty badly braindamaged i that respect, counting in different units for no good reason 05:37 < MaZe> if it doesn't match, we've got a problem - which mind you - - AFAICT - can't happen 05:37 < MaZe> and we bio_put to free the structure and basically error out 05:38 < MaZe> [of course here we always error out, because this is junkfs (tm)] 05:38 < MaZe> anyway if s==512 then we're good 05:38 < flips> oh bv_len is definitely bytes 05:38 < MaZe> we setup to more fields in the bio 05:38 < MaZe> bi_end_io is the call back for when the bio is processed (or errors out) 05:39 < flips> when the disk completion interrupt fires 05:39 < flips> key point 05:39 < MaZe> bi_private is a pointer to our data (the mz struct) so that we can figure out what we're talking about in the endio handler 05:39 < MaZe> and then we submit the bio for READ 05:39 < MaZe> now this (ie. bios) are inherently asynchronous 05:39 < MaZe> so at this point it might have already completed - it could have been cached and come back immediately 05:39 < flips> right... it's the _only_ way to recover a memory context for a completed bio 05:40 < MaZe> [I think] 05:40 < MaZe> or we might need to wait some indeterminate amount of time 05:40 < flips> it's much more direct than that 05:40 < MaZe> here's where we make use of the waitqueue which we helpfully placed in the mz struct 05:40 < flips> disk raises interrupt -> endio gets called 05:40 < flips> in interrupt context 05:40 < flips> this is as on the metal as you will get without going hypervisor 05:41 < MaZe> oh, so basically end_io should do as little as feasibly possible 05:41 < MaZe> preferably as simple as it is here 05:41 < flips> yes 05:41 < flips> again yes 05:41 <@shapor> is it the right place to call bio_put ? 05:41 < flips> though I often get excessive there ;) 05:41 < MaZe> anyway, earlier on, we'd already initialized the waitqueue, so now we can just wait on it 05:41 <@shapor> in the endio handler? 05:41 < MaZe> except wait needs not only a waitqueue (wq) but also a condition 05:42 < MaZe> [which it checks _first_] 05:42 < flips> maze, _interruptible? 05:42 < MaZe> hence mz struct also contains a boolean 05:42 < MaZe> flips: yeah, no idea what the right choice is there, meaning to ask about this 05:42 < flips> shapor, yes 05:42 < flips> very important question 05:42 < Bushman> flips, so how would it behave in a hypervisor? any changes? does it lose determinism? 05:42 <@shapor> why does it matter? 05:42 < flips> if interruptible, you better be prepared to field anything that can be thrown at you 05:43 < flips> if uninterruptible, you'd better be able to prove it always completes 05:43 < Bushman> is that the basis for atomicity then? 05:43 < MaZe> so what could get thrown at us, and will the bio always complete? 05:43 <@shapor> flips: what happens if there is an error 05:43 < flips> bushman, we don't touch hypervisors 05:43 <@shapor> disk io error or something 05:43 < flips> if we did, it would be to implement hard realtime or something 05:43 < MaZe> hypervisors should be transparent to the os 05:43 <@shapor> does the endio handler get called? 05:44 < MaZe> yes endio has err parameter 05:44 < flips> bushman, there is some sense of atomicity here in the interruptible/noninterrupble distinction 05:44 < flips> loose sense 05:44 < MaZe> just to finish off this (junkfs_fill_super) function, we then dump the superblock via printk and free everything and return an error (junkfs remember.?) 05:44 < flips> maze, in kernel interrupts don't just happen, you have to ask for them 05:44 < MaZe> even with preemption 05:44 < MaZe> ? 05:45 < flips> or they get fielded on syscall exit 05:45 < Bushman> SHOULD be transparrent, but since most of them mangle time into nonlinear, doesnt it screw up our predictions when interrupt is gonna finish? 05:45 < flips> task switch is not interrupt 05:45 < flips> it's caused by an interrupt 05:45 <@shapor> oh i see you just aren't checking the err parameter in end_io_read 05:45 < flips> you can get a task switch even with wait_uninterruptible 05:45 <@shapor> probably should ;) 05:45 < MaZe> so while in kernel space, my thread of execution is guaranteed not get interrupted by anything? 05:45 < MaZe> right I should ;-) 05:45 < flips> all that means is, an interrupt won't cause the wait to bail early 05:46 < flips> you have to wrap your interruptible wait in a loop 05:46 < flips> or write uninterruptible 05:46 < MaZe> so interruptible here refers to what? can be interrupted by killing the mount process? 05:46 < flips> which is probably what you want here 05:46 < flips> just means the wait may bail before the wak 05:46 < flips> wake 05:47 < flips> so has to be in a loop, and you can't assume that what you were waiting for actually happened 05:47 < Bushman> so i guess the big question here is how do we guarantee that the write is gonna complete? 05:47 < MaZe> so I'd want uninterruptible? or interruptible and then on some interrupts somehow cancel and free the bio 05:47 < flips> just write uninterruptible until you know kernel scheduling better ;) 05:47 < MaZe> (read here) 05:47 <@shapor> uninterruptable will cause it to be D too iirc 05:47 < flips> bushman, it always completes 05:47 <@shapor> D state 05:47 < flips> with or without an error 05:47 <@shapor> Bushman: it may complete with an error 05:48 <@shapor> which gets passed to the endio handler 05:48 < flips> yes, this is d state, the real thing 05:48 < MaZe> which as written ignores all errors, and just marks the io as completed, frees the bio, and wakes the wq 05:48 <@shapor> interruptable is not quite so severe i guess 05:48 < flips> you are in d state any time you're waiting in kernel 05:48 <@shapor> even interruptable? 05:48 < flips> yes 05:48 < MaZe> unless you're doing wait_interruptible? 05:49 < flips> hmm 05:49 <@shapor> flips: didn't we find that not to be the case 05:49 <@shapor> with ddsnap 05:49 < flips> even then I think 05:49 < MaZe> hmm, so how could I get this to be abortable, in case for example the block device hangs on network? 05:49 <@shapor> remember our threads were all D state 05:49 < flips> you get a qualifier on your ps output 05:49 <@shapor> until we changed it to interruptable 05:49 < flips> maze, that's not your job, it's the job of the device insert/remove 05:50 < flips> which of course means it's badly mismanaged ;) 05:50 < flips> but... 05:50 < flips> not your problem for now 05:50 < MaZe> well what if we're running this off of a nbd or something like that, and the network gets pulled 05:50 < MaZe> would the bio then just (eventually) return with an error to endio? 05:50 < flips> that's nbd's problem 05:50 < flips> again not yours 05:51 < flips> you can try to do timeouts and things, but you're risking redudancy 05:51 < flips> and confusion 05:51 < MaZe> right 05:51 <@shapor> risking redundancy ? 05:51 < flips> duplicating functionality that is better performed at some other layer 05:52 < flips> constant risk with the blind leading the blind ;) 05:52 <@shapor> yeah 05:52 <@shapor> good point 05:52 <@shapor> but the blind leading the deaf is ok 05:52 < flips> maze, that was a great walkthrough, and the code is great too 05:52 <@shapor> yes! 05:52 < flips> not perfect, but you don't need that to be great in linux ;) 05:52 < MaZe> I stil don't quite understand a bunch of it 05:52 <@shapor> MaZe: thanks, i was following closely with little time to type 05:52 < flips> a few warts make it more real, like a european movie 05:53 <@shapor> hah 05:53 * Bushman rolls eyeballs 05:53 < MaZe> lol 05:53 < flips> maze, I am going to cut and paste your code into fs/tux3/super.c 05:53 < flips> and tux3 is going to read a leet sector sized sb too 05:53 <@shapor> heh 05:54 <@shapor> s/junkfs/tux3/ 05:54 < MaZe> hehe 05:54 < flips> exactly 05:54 < flips> or s/tux3/junkfs/ 05:54 < flips> depending on leetness or lack of it 05:54 <@shapor> so it seems silly for every fs to have to do this 05:54 <@shapor> is the vfs totally useless? 05:54 < flips> yes 05:54 < flips> pretty much 05:54 < MaZe> what I still haven't found is how to specify the io priority of the bio you submit 05:54 < flips> pretty close 05:54 < flips> not completely 05:55 < flips> lame but not useless 05:55 < flips> better than NT 05:55 < MaZe> I'm assuming it inherits from the ionice'ness of the process in whose context you're running 05:55 < flips> maze, completely separate 05:55 < flips> it's part of the elevator abstraction 05:55 <@shapor> oh? 05:56 < MaZe> huh? 05:56 <@shapor> i was wondering that too 05:56 < flips> inheriting anything is completely a property of the elevator plugin 05:56 < MaZe> shouldn't submitting a read/write request to a blockdevice be exactly when this matters? 05:56 < flips> see "request queue" 05:56 < MaZe> oh, the mysterious q parameter 05:56 < flips> one of the harder code reading projects in kernel 05:56 < flips> it's a mess 05:56 < MaZe> I saw all over the place 05:56 < MaZe> that is apparently a field in the bio struct 05:57 < flips> q is a carpet under which all kinds of doggie poo is swept 05:57 < flips> it's really a bag tied onto the side of the bio 05:57 < flips> we'll get rid of it before next christmas 05:57 < flips> I hope 05:57 < MaZe> I just want a nice aio read/write with priority interface for my coding 05:57 < flips> you got it 05:57 < flips> already 05:58 < flips> well s/nice/nicer than what we had before/ 05:58 <@shapor> that would be a good project.. a new aio interface 05:58 < MaZe> right, I have the aio rw 05:58 <@shapor> sounds like it should map easily enough.... 05:58 < flips> bio transfer is aio at its purest 05:58 <@shapor> yeah 05:58 < MaZe> right, but you want prioritization in there 05:58 <@shapor> should be easier than non aio realy 05:58 < MaZe> and that's what I'm failing to see 05:58 < flips> maze, in the elevator 05:58 < Bushman> 'scuze my newbness, but wouldnt priority be at odds with queuing that the controllers try to do? 05:58 < MaZe> so does the bio go through the elevator? 05:59 < flips> bushman, interactions, yes 05:59 < flips> not all good 05:59 < MaZe> well, you want something htb like for io 05:59 < flips> best to try and harmonize with them 05:59 < MaZe> wait a minute, what's the layering here? 05:59 < MaZe> is the physical hw under the elevator under the bio 05:59 < flips> vfs <-> bio <-> driver 06:00 < MaZe> and where's the elevator? 06:00 <@shapor> between bio and driver 06:00 < flips> vfs <-> bio <-> elevator <-> driver 06:00 <@shapor> right? 06:00 < MaZe> vfs <-> bio <-> elevator <-> driver 06:00 < MaZe> ? 06:00 < flips> heh 06:00 <@shapor> heh 06:00 < flips> exactly 06:00 < MaZe> so by choosing the request queue in the bio, I choose priority of the request with regards to other requests? 06:00 < flips> and the presence/lack of the elevator is up to the driver or virtual driver even 06:01 < flips> so the elevator can appear at multiple or no places in the stack 06:01 <@shapor> so the elevator messes with fields in the bios? 06:01 < MaZe> is this screwy? or is this just me...? 06:01 < flips> and vice versa in an idiotic way... sometimes useful way 06:01 < flips> maze, it's screwy 06:01 < flips> not just you 06:01 < flips> but better than we had in 2.4 06:02 < flips> it's damn fast actually, compared to a disk 06:02 < flips> we didn't have that a few years ago 06:02 < flips> now it's looking slow again 06:02 < flips> and people are asking me to fix it 06:02 < flips> it shall be done 06:02 < MaZe> wait a minute - what is slow? 06:03 < MaZe> the interfaces / kernel code? 06:03 < flips> this who kooky chain 06:03 < flips> whole 06:03 < flips> vfs <-> bio <-> elevator <-> driver 06:03 < flips> layering is right 06:03 < flips> implementation is faulty 06:03 < MaZe> agreed 06:04 < flips> anyway 06:04 < flips> we're using the existing one for now 06:04 < flips> it will work for tux3 as well as it works for anybody 06:04 < flips> better, because we will use it more directly 06:04 < flips> and have fewer strange waits and so on 06:04 < MaZe> right 06:04 < flips> and when we do see a strange wait, we will be able to pounce on it 06:04 < MaZe> that's why I wanted to go all the way down to the bio on the sb read 06:04 < MaZe> a) for practice 06:05 < MaZe> b) because it's the way it should be done 06:05 < flips> unlike if you use the... odd... vfs block io helpers 06:05 < flips> well I think we are going to stay all the way down here for tux3 06:05 < flips> tux3 has no use asking other subsystems to submit bios on its behalf, unless that subsystem is an lvm 06:06 < flips> and even then, we just submit a bio to the lvm without caring its not a real device 06:06 < MaZe> still have to figure out how to do mmap like stuff (ie. trigger read in, on page fault, or write out, both for kernel and userspace, and cow, etc) 06:06 < flips> maze, handled for you 06:06 < flips> like magic 06:06 < MaZe> cool - assuming it does the right thing (tm) 06:06 < flips> see filemap.c -> nopage 06:06 < flips> kinda right 06:06 < flips> some messed locking 06:06 < MaZe> which I'm not sure it does for cache coherency netfs 06:07 < flips> bottlenecks on i_mutex during fault in 06:07 < flips> bad 06:07 < MaZe> so it probably needs to be gone through with a fine comb then 06:07 < flips> even nfs is cache coherent/consistent with respect to mmap 06:07 < MaZe> as I was expecting 06:07 < flips> yes 06:07 < flips> right in to the danger zone 06:08 < flips> speaking of which 06:08 <@shapor> what bottlenecks on i_mutex? 06:08 < flips> time to turn on the ghetto blaster 06:08 < flips> and get back to coding 06:08 < MaZe> I'm assuming the code in filemap.c which deals with page-in/outs of mmapped pages 06:08 < MaZe> oh, right it's already 10 past 9 06:08 < MaZe> so is that it for this time? 06:08 * flips puts on Holst's the planets, performed by korean rock band 06:08 * shapor scrolls back to remember his homework 06:09 < flips> that's it, nice one maze 06:09 < Bushman> is anybody sticking around to ask lame(er) questions? 06:09 < flips> next time it will be razvanm's turn 06:09 < RazvanM> :P 06:09 < MaZe> oh, awesome, what's he doing? 06:09 < flips> to explain some more of _2copy 06:09 < MaZe> ah - -- begin of open questions 06:09 < flips> lame question period is officially open 06:10 < flips> intelligent questions banned 06:10 < Bushman> what's an elevator? 06:10 * RazvanM doesn't have anything to ask this time 06:10 < flips> a kernel elevator 06:10 < MaZe> when you read/write data to a hard disk 06:10 < flips> otherwise you're going to get some dumb jokes 06:10 < MaZe> which is a spinning platter with a seeking head 06:10 <@shapor> elevator = io scheduler 06:10 < MaZe> then depending on the order you send out request 06:10 < tim_dimm_> just caught up 06:10 < MaZe> you may need to do a small or large number of seeks 06:10 < tim_dimm_> like tivo for geeks 06:10 < flips> yup, and it's algorithms are the same as a busy elevator in a skyscraper 06:10 < MaZe> seeks are very expensive 06:11 < MaZe> so you try to minimize seeks 06:11 < MaZe> for good performance (b/w), but higher latency 06:11 < tim_dimm_> so are tlb misses 06:11 < flips> and page cache misses 06:11 < MaZe> you basically scan the disk from top to bottom, doing read writes at increasing lba addresses 06:11 < MaZe> irregardless of the order they were submitted in 06:11 < MaZe> then do the same thing going downwards 06:12 < flips> somewhat downwards 06:12 < Bushman> ok great, but from this level, can we be aware of what media we're writing to so we dont make it overinvolved in cases it doesnt matter, like solid state disks? 06:12 < MaZe> right 06:12 < flips> the disk doesn't like going backwards as much as forwards 06:12 < MaZe> the consecutive read/write sectors are still upwards 06:12 <@shapor> Bushman: you can pick an io scheduler on a per-block-device basis 06:12 < MaZe> and sometimes you skip the backwards step entirely 06:12 < MaZe> depends 06:12 < flips> bushman, mostly we don't care, where we do care we care a lot 06:12 < MaZe> lots of fine tuning required to get optimal performance 06:12 < MaZe> and it heavily depends on usecases 06:13 <@shapor> /sys/block/sda/queue/scheduler 06:13 < Bushman> as long as it's adjustable from userspace i'm good ;) 06:13 < MaZe> plus you can throw in individual io priorities into the mix (ie. reading this sector is more important) 06:13 < flips> we try to design for whole classes of usecases, rather than one at a time 06:13 < MaZe> and b/w per job, and hard read/write deadlines, etc 06:13 < MaZe> and it all gets complex 06:13 <@shapor> http://friedcpu.wordpress.com/2007/07/17/why-arent-you-using-ionice-yet/ 06:13 < Bushman> shapor, nice, i havent gotten used to the new linux, i've been bsd'ing since '03 06:13 <@shapor> i only recently discovered ionice 06:13 < MaZe> and the elevator is the piece of code which gets requests thrown at it 06:13 <@shapor> i think mentioned on here 06:14 < MaZe> does some algo mumbo jumbo to put them in the 'best' order 06:14 < flips> shapor, because it doesn't work that well? 06:14 < MaZe> and throws them at the disk 06:14 <@shapor> flips: yes but the interface is there 06:14 <@shapor> if people use it they can report bugs 06:14 < flips> sure 06:14 <@shapor> if people dont report bugs or say it sucks on lkml it wont get fixed 06:14 <@shapor> same problem with posix_fadvise 06:14 < MaZe> note that for a network nic 06:14 < flips> we will take it for a spin at some point 06:14 < MaZe> you have a certain amount of b/w 06:14 < flips> maze will ;) 06:15 < MaZe> and it's all pretty easy - conceptually 06:15 < flips> and shapor will make some nice charts of the event logs 06:15 < flips> vfs + bio events 06:15 <@shapor> oh i almost forgot about that 06:15 < MaZe> sending each packet involves a fixed amount of headroom, (header fields), the packet itself, and a fixed footer 06:15 <@shapor> still no clue how to glue those together 06:15 < MaZe> so when you send a packet you know exactly how much of the nic (ie. for how long) you're using it up 06:15 < MaZe> thus you can make very nice guarantees 06:16 < MaZe> and this is what htb + sfq does for networking 06:16 <@shapor> htb? sfq? 06:16 < MaZe> you can partition your network card pretty much arbitrarily between diifferent apps 06:16 < MaZe> giving different apps different priorities, then different priorities different amounts of bw 06:16 < MaZe> and the priorities don't need to be strictly linear either 06:16 < flips> htb? sfq? 06:16 < MaZe> htb 06:16 < Bushman> oh could i get in on the testing? i've done a lot of work visualizing sequences of events in temporal OSPF loops, this should be i could do ;) 06:16 < MaZe> htb is basically a tree structure 06:17 < MaZe> the nodes are were requests come in 06:17 < flips> what's the tla mean? 06:17 < MaZe> the root is were requests come out 06:17 < MaZe> so each application (or tcp stream, or whatever you're using) gets assigned to a leaf node in this tree 06:17 < flips> (Stochastic Fairness Queueing) 06:17 < MaZe> and the network driver then (when it wants to send) always pulls from the root 06:18 < flips> gah 06:18 < MaZe> each node in this tree has a certain speed of accumulating tokens 06:18 < MaZe> (htb = hierarchical token buckets) 06:18 < MaZe> that it accumulates in the bucket in that node 06:18 < Bushman> wouldnt stochastic approach that every client is equally unhappy? ;) 06:19 < MaZe> Bushman: sfq is used in the leafs to randomly select between clients / tcp streams you consider equivalent 06:19 < MaZe> you hang an sfq off of each leaf node in htb, so you actually throw the packets at the correct sfq, and the htb leaf pulls it from the attached sfq 06:19 < flips> network peeps are always reinventing the world ;) 06:19 < Bushman> ah, so you use the hiarchical token buckets to assign different classes of service to different apps/streams? 06:19 < MaZe> anyway, you divide up each nodes bandwidth among it's children 06:20 < MaZe> and then define how and when they can borrow/lend tokens to each other 06:20 < MaZe> I'm not doing a very good job of defining it here 06:20 < MaZe> but it's wicked! 06:20 < tim_dimm_> no- you're doing a great job 06:20 < flips> maze, I'm getting the idea 06:20 < tim_dimm_> sounds wicked 06:20 < Bushman> yea i just did a project with filtering/limiting at work, so i'm getting it 06:21 < Bushman> it sounds a lot smarter than it is ;) 06:21 < flips> well, disk layer doesn't have any such pretentions to sophistication 06:21 < flips> yet 06:21 < tim_dimm_> heh 06:21 < Bushman> damn academis justifying their existence 06:21 < MaZe> anyway, basically htb + sfq is the best I've seen for networking, and would probably be awesome for other stuff as well like scheduling cpus 06:21 < flips> I can imagine the mess if it did 06:21 <@shapor> Bushman: gee filtering and limiting, i wouldn't have guessed :P 06:21 < MaZe> except it's probably to compute intensive for that and can't take cache-heat or memory nearness into account 06:21 < Bushman> shapor: stfu ;) 06:21 < flips> :) 06:22 < MaZe> anyway, with disk it gets tougher 06:22 < tim_dimm_> if it did, could be interesting as a cache coherency protocal 06:22 < MaZe> because you can't just up and calculate how long a particular operation will take 06:22 < flips> network peeps always trying to find the must obscrue TLA 06:22 <@shapor> Bushman: don't you guys use bullets for limiting ? :P 06:22 < flips> mot <- most obscure tla 06:22 <@shapor> haha 06:22 < MaZe> (with the nic, you know its line rate, you know how many bytes your sending, the size of the pre and post-amble, the wait between packets, you thus now the _entire_ cost of sending any given packet] 06:22 < Bushman> dont make me whip out stories about invalidating keys with thermite granades 06:22 < tim_dimm_> motley cru 06:23 < MaZe> tla? 06:23 < MaZe> mot? 06:23 < flips> maze, and you don't know much carrier sense backout is going to cost ;) 06:23 <@shapor> most obscure three letter acronym 06:23 < MaZe> ah, so you use the hiarchical token buckets to assign different classes of service to different apps/streams? - precisely 06:23 < flips> and that's where your pretentions to realtime control come crashing down 06:23 <@shapor> which is a fla 06:24 <@shapor> which is a tla 06:24 <@shapor> which is a tla 06:24 < flips> third time lucky 06:24 < MaZe> for example I would give each user in my network their own sfq for local traffic to another nic (just switching) to another network via wireless and to the internet (via the same wireless) 06:24 < Bushman> to make delivery time guaranteed, woudlnt you have to have full preempt kernel? (oh i miss 80ties Amigas) 06:24 * flips thinks of some keys he'd like invalidated 06:24 < MaZe> and then use htb to make sure everything was fair on the slow internet link, and on the others at the same time - worked awesome 06:25 < MaZe> be right back in 10. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.9 (GNU/Linux) iEYEARECAAYFAkjWoO8ACgkQydrGfzV1md20igCg7GnJrsik45uVvCqX1i4QN7Q8 VkEAmgLDkQOSpb3bS0Hi/mpxc+UTstgD =oZ7Z -----END PGP SIGNATURE----- _______________________________________________ Tux3 mailing list Tux3@tux3.org http://tux3.org/cgi-bin/mailman/listinfo/tux3
