Re: [PATCH v3] Prevent megablobs from gunking up git packs

!MAILaRCHIVE_VOTE_RePLACE
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
To: Junio C Hamano <junkio@...>
Cc: Git Mailing List <git@...>, <danahow@...>
Date: Saturday, May 26, 2007 - 7:48 pm

On 5/26/07, Junio C Hamano <junkio@cox.net> wrote:

No problem.

When the code decides that a blob should not be written to the output file,
then I must make sure it is not used as a delta base.  A large blob
that triggered the size test and _was_ a delta base could be the result
of maxblobsize decreasing or being newly specified,
both without -f/--no-object-reuse,
and we need to tolerate the user forgetting the option.

To make sure that it is not so used,  I re-use the trick from maxpacksize
which ensures that a delta base is not in the previous split pack:
I set the offset field to -1.  Unfortunately,  I only checked for this magic
value when computing usable_delta if pack_size_limit was set.  It turns
out the test doesn't need to be conditional on pack_size_limit,  it works
for all cases;  so since I need to do the test when maxblobsize was specified
and maxpacksize wasn't, I deleted the pack_size_limit test.

Now for the second hunk.  The facts above mean we could have marked
this entry as a re-used delta, but we are unable to re-use the delta
because its delta base is not being written to this pack.  So we fall into
the !to_reuse case even though the size field in the object_entry is the
size of the delta,  not the object.  We can detect this by the type coming
from read_sha1_file being unequal to the type set from the pack (which is
one of OBJ_{REF,OFS}_DELTA).  So I disable the size matching
test in this case.


Recently Nicolas Pitre improved the code as follows:
(1) tree-walking etc. which calls add_object_entry.
    We learn sha1, type, name(path), pack&offset, no_try_delta
    during this step.
(2) NEW: sort a table of pointers to these objects by pack_offset.
(3) Now call check_object on each object, but in the order
     determined in (2).  We learn each object's size during
     this step.  This requires us to inspect each object's header
     in the pack(s).

The result is that we smoothly scan through the pack(s),
instead of jumping all over the place.

If I move sha1_object_info earlier,  before (2),  then I undo
his optimization.  This fact ultimately justifies the first two
hunks that you commented on,  since it means we want
the objects to appear in the object list _before_ we can
decide not to write them,  and thus we need to handle
objects not written and all their consequences
(which didn't seem too strange to me,
since you already have preferred bases).

Thanks,
-- 
Dana L. How  danahow@gmail.com  +1 650 804 5991 cell
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
Re: [PATCH v3] Prevent megablobs from gunking up git packs, Nicolas Pitre, (Sat May 26, 11:15 pm)
Re: [PATCH v3] Prevent megablobs from gunking up git packs, Nicolas Pitre, (Sun May 27, 11:09 am)
Re: [PATCH v3] Prevent megablobs from gunking up git packs, Junio C Hamano, (Sat May 26, 6:51 pm)
Re: [PATCH v3] Prevent megablobs from gunking up git packs, Dana How, (Sat May 26, 7:48 pm)