In fact, even with mmap(), it's not guaranteed. There are really crappy mmap implementations out there, partly due to bad CPU design (virtual CPU caches without coherency), but more often due to total crap OS. (Yeah, Linux did count in that area at some point. Long ago. Early 90's. Maybe) I think HP-UX used to have non-coherent mmap for the longest time, due to carrying around some totally crap memory management based on some ancient BSD version that everybody else (including the BSD's) had long since jettisoned. That said, I suspect any unix you can run today (without calling it a retro setup) probably has coherent-enough mmap. The possible virtual cache coherency issue is unlikely to be able to trigger this (and not relevant on any sane hardware anyway). Linus -
I've just checked the Mac OS X build and it looks like there is a mmap and git is indeed using it, so this is obviously an example of a "really crappy" mmap implementation. This adds more ammunition to the fight against the whole Mac OS X is powered/built/based on UNIX myth. Charles. -
Looking closer, this is not necessarily the case here. Git uses MAP_PRIVATE, because that whole pack-file mapping was really *meant* to map an existing read-only pack-file, and fast-import seems to really be screwing with it. It so happens that Linux has a particularly clean and streamlined VM, and if you do only reads to a MAP_PRIVATE mapping on a normal filesystem, you'll always be as coherent as with MAP_SHARED because Linux will simply map in the page cache pages directly. But this is definitely not portable, and the git fast-import mmap window usage before Shawn's patch it was simply wrong. So in this case, it really was git that was crap. It just happened to work because the Linux mmap handling is just generally pretty sane. It probably also worked fine on pretty much any other modern UNIX (ie Solaris). I'm not quite sure what OS X does to MAP_PRIVATE mappings, but if OS X is still based on Mach (with FreeBSD just as a single-server on top), I suspect that may be why it broke on OS X. The Mach VM is insanely complex and does really odd things. But the fact is, without MAP_SHARED, you shouldn't expect things to be coherent, even if they often will be (especially for PROT_READ). Btw, even with Shawn's patch, I wonder if the index_data usage is correct. Linus -
Hmph. gfi uses data in a "pack" in two quite different ways. A new object is written to an unfinalized pack. Such a pack already has "struct packed_git" allocated for it and a pointer to it is held in pack_data. As far as the core part of git (that is, sha1_file.c) is concerned, however, this pack does not even exist. It is still not part of packed_git list in sha1_file.c, and read_sha1_file() will not see objects in it, as no idx into the packfile exists yet. gfi has a table of objects in this pack and uses gfi_unpack_entry() function to retrieve data from it. A packfile is finalized in end_packfile(). The pack header and footer is recomputed, an idx file is written, and the pack is finally registered. Before that time p->index_data is not even used for that pack (it is initialized with NULL). So I do not think "index_data usage" can be incorrect, as there won't be any index_data usage with unfinalized pack, and the core part of git would not even have any mmap(2) (nor open fd) into its idx file before it is finalized. By the way, I was quite puzzled how the gfi_unpack_entry() function manages to work correctly when it has to read an object it deltified based on another object it wrote into the same unfinalized pack earlier. It knows where in the unfinalized pack it wrote the object, so it can find from its own "struct object_entry" the offset for the object, and calls unpack_entry() defined in the core to do the rest. However, most of the core does not really know about the other objects in this half-built pack. If the object is a delta, unpack_delta_entry() needs to find the delta base. And it needs to do that without having the idx. The trick (the code really needs a bit more documentation) is that gfi never writes anything but OFS_DELTA. So the core, even though it does not have the corresponding idx file, does not have to look up the object (in fact it does not even know what object to look up for the base, it only knows the offset). -
Oh, ok. I did try to grep for index_data, and didn't find anything that looked bad, but the incestuous things that fast-import.c does just made me worry - but I was too lazy to really follow it all. It's one of those In that case, I think Shawn fixed it all, and we're all good, and it's not just hidden well enough that it "just happens" to work. Linus -
Yup. Older fast-imports (pre OFS_DELTA) had to replicate a good chunk of the unpack_entry() logic directly inside of fast-import. But it later occurred to me that OFS_DELTA simplifies the code and lets me reuse more of the existing sha1_file.c implementation. So yea, fast-import only emits OFS_DELTA, and does so only so that it can pull this delta-base-unpack trick on the core, without actually giving the core the corresponding idx file first. -- Shawn. -
