Hi, I am observing very large data transfers when pulling from the repository at <URL:http://repo.or.cz/w/wortliste.git>. This repository contains one 13 MB text file that compressed is roughly 3 MB large. While I'd expect pulling commits that change only a few lines of the large text file to result in a download of less than, say 10kB, git pull seems to transfer the complete, compressed file. I have observed this several times for different commits. On the other hand, pushing my own commits to the repository is fast (with git+ssh access method). Any ideas what's going on and how to make pulling faster? Best regards, --
When pushing you can use --thin to get a "thin pack" transfer, where the delta-base is not transferred. Not sure how you request this from git fetch these days though :-/ Search the list for "thin packs" perhaps? cheers, m -- martin.langhoff@gmail.com martin@laptop.org -- School Server Architect - ask interesting questions - don't get distracted with shiny stuff - working code first - http://wiki.laptop.org/go/User:Martinlanghoff --
Oh, ok. By the way, how long are you hitting this issue? Just today, I have upgraded the chroot Git from some anonymous 2007-12-08 version to the almost-latest #next. -- Petr "Pasky" Baudis The last good thing written in C++ was the Pachelbel Canon. -- J. Olson --
I don't know for sure. But I think I've had the issue since I started accessing that repository in October 2007 (then with WinGit-0.2-alpha). Best regards, Stephan Hennig PS: Attached is my .git/config. Could the repositoryformatversion line --
Hi, do you use HTTP or native Git protocol for pulling? If HTTP, you have to live with this, sorry - the automatic repacks will create a new pack every time and you will have to redownload the whole history; I tried to avoid this problem but in the end I had to bow down to the agressive repacking strategy because the number of packs was getting out of hand. It is technically possible to implement some alternative more HTTP-friendly packing strategy, but this has always stayed only in idea stage. If you want to implement something, repo.or.cz will become a glad user. :-) -- Petr "Pasky" Baudis The last good thing written in C++ was the Pachelbel Canon. -- J. Olson --
After the line containing "remote: Compressing objects:" had been printed several MB have been transferred. Does it matter that the original clone has been performed with plain git protocol? I have only later changed the url in .git/config to use git+ssh. Best regards, Stephan Hennig --
Seems like you're being bitten by a bug we had some months back, where the client requested full history for new tag objects. Does this bug still show up if you use the latest git from master of git.git? I *think* v1.5.4.3-440-g41fa7d2 fixed the issue, but I'm not 100% certain as the commit message doesn't mention anything about any bugs being solved. Otoh, I vaguely remember the first bug-reporter being told to try 'next', and afair, that solved it No, that doesn't matter in the slightest. -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 --
Hi, Thanks for having a look at this! What does "problem with the pack" mean? Do you think it is a Git problem (client or server side?) or just a misconfiguration? Best regards, Stephan Hennig --
Hi, I thought that the blobs in the pack are just too similar. That makes for a good compression, since you get many relatively small deltas. But it also makes for a lot of work to reconstruct the blobs. I suspected that you run out of space for the cache holding some reconstructed blobs (to prevent reconstructing all of them from scratch). To see what I mean, just look at $ git -p verify-pack -v \ .git/objects/pack/pack-563c2d83940c7e2d8c20a35206a390e2e567282f.pack (or whatever pack you have there). It has this: -- snip -- chain length = 40: 7 objects chain length = 41: 8 objects chain length = 42: 4 objects chain length = 43: 8 objects chain length = 44: 6 objects chain length = 45: 2 objects chain length = 46: 6 objects chain length = 47: 2 objects chain length = 48: 2 objects chain length = 49: 2 objects chain length = 50: 2 objects -- snap -- ... but that could not be the reason, as my current git.git's pack shows this: -- snip -- chain length = 40: 122 objects chain length = 41: 99 objects chain length = 42: 77 objects chain length = 43: 76 objects chain length = 44: 69 objects chain length = 45: 72 objects chain length = 46: 66 objects chain length = 47: 103 objects chain length = 48: 77 objects chain length = 49: 111 objects chain length = 50: 86 objects chain length > 50: 60 objects -- snap -- ... which is much worse. So I tried this: -- snip -- wortliste$ /usr/bin/time git index-pack -o /dev/null .git/objects/pack/pack-563c2d83940c7e2d8c20a35206a390e2e567282f.pack fatal: unable to create /dev/null: File exists Command exited with non-zero status 128 27.12user 11.21system 2:51.02elapsed 22%CPU (0avgtext+0avgdata 0maxresident)k 81848inputs+0outputs (1134major+2042348minor)pagefaults 0swaps -- snap -- Compare that to git.git: -- snip -- git$ /usr/bin/time git index-pack -o /dev/null .git/objects/pack/pack-355b54f45778b56c00099bf45369f8a4f2704a51.pack fatal: unable to create /dev/null: File exists Command ...
Yea, that's going to be ugly. The "cache" you speak of above is held on the call stack as resolv_delta() recurses through the delta chain to reconstruct objects and generate their SHA-1s. There isn't a way to release these objects when memory gets low so your worst case scenario is a 100M+ blob with a delta chain of 50 or more - that will take you 5G of memory to pass through index-pack. jgit isn't any better here. What we need to do is maintain a list of the objects we are holding on the call stack, and reduce ones up near the top when memory gets low. Then upon recursing back up we can just recreate the object if we had to throw it out. The higher up on the stack the object is, the less likely we are to need it in the near future. The more that I think about this, the easier it sounds to implement. It wouldn't be very difficult to get that unpacked size. We just have to deflate enough of the delta to see the delta header and obtain the inflated object size from that. Unfortunately there is not an API in sha1_file.c to offer that information to callers. -- Shawn. --
Hi, Actually, there is... You would only need to tap into the "release_pack_memory()" mechanism, adding a sort of a "smart pointer" that knows how to reconstruct its contents. Ciao, Dscho --
I guess you don't really know how index-pack is organized than. It is not quite that simple. I think we want to use the core.deltabasecachelimit parameter inside of index-pack to limit the size of the cache, but evict based on the callstack depth. Objects deeper back in the callstack should evict before objects closer to the top of the stack. Reconstruction is a bit complex as we don't have the machinary in sha1_file.c available to us, as the index is not available, as we are still in the process of creating the damn thing. I'm working up a patch series to resolve this. I'm going to shutup now and just try to post working code sometime this evening. -- Shawn. --
This should resolve the obscene memory usage of index-pack when resolving deltas for very large files. Shawn O. Pearce (4): index-pack: Refactor base arguments of resolve_delta into a struct index-pack: Chain the struct base_data on the stack for traversal index-pack: Track the object_entry that creates each base_data index-pack: Honor core.deltaBaseCacheLimit when resolving deltas index-pack.c | 137 +++++++++++++++++++++++++++++++++++++++++++++------------ 1 files changed, 108 insertions(+), 29 deletions(-) --
We need to discard base objects which are not recently used if our
memory gets low, such as when we are unpacking a long delta chain
of a very large object.
To support tracking the available base objects we combine the
pointer and size into a struct. Future changes would allow the
data pointer to be free'd and marked NULL if memory gets low.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
---
index-pack.c | 60 +++++++++++++++++++++++++++++++--------------------------
1 files changed, 33 insertions(+), 27 deletions(-)
diff --git a/index-pack.c b/index-pack.c
index 25db5db..db03478 100644
--- a/index-pack.c
+++ b/index-pack.c
@@ -26,6 +26,11 @@ union delta_base {
off_t offset;
};
+struct base_data {
+ void *data;
+ unsigned long size;
+};
+
/*
* Even if sizeof(union delta_base) == 24 on 64-bit archs, we really want
* to memcmp() only the first 20 bytes.
@@ -426,25 +431,25 @@ static void sha1_object(const void *data, unsigned long size,
}
}
-static void resolve_delta(struct object_entry *delta_obj, void *base_data,
- unsigned long base_size, enum object_type type)
+static void resolve_delta(struct object_entry *delta_obj,
+ struct base_data *base_obj, enum object_type type)
{
void *delta_data;
unsigned long delta_size;
- void *result;
- unsigned long result_size;
union delta_base delta_base;
int j, first, last;
+ struct base_data result;
delta_obj->real_type = type;
delta_data = get_data_from_pack(delta_obj);
delta_size = delta_obj->size;
- result = patch_delta(base_data, base_size, delta_data, delta_size,
- &result_size);
+ result.data = patch_delta(base_obj->data, base_obj->size,
+ delta_data, delta_size,
+ &result.size);
free(delta_data);
- if (!result)
+ if (!result.data)
bad_object(delta_obj->idx.offset, "failed to apply delta");
- sha1_object(result, result_size, type, delta_obj->idx.sha1);
+ sha1_object(result.data, result.size, type, delta_obj->idx.sha1);
...We need to release earlier inflated base objects when memory gets
low, which means we need to be able to walk up or down the stack
to locate the objects we want to release, and free their data.
The new link/unlink routines allow inserting and removing the struct
base_data during recursion inside resolve_delta, and the global
base_cache gives us the head of the chain (bottom of the stack)
so we can traverse it.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
---
index-pack.c | 34 +++++++++++++++++++++++++++++++---
1 files changed, 31 insertions(+), 3 deletions(-)
diff --git a/index-pack.c b/index-pack.c
index db03478..6c59fd3 100644
--- a/index-pack.c
+++ b/index-pack.c
@@ -27,6 +27,8 @@ union delta_base {
};
struct base_data {
+ struct base_data *base;
+ struct base_data *child;
void *data;
unsigned long size;
};
@@ -48,6 +50,7 @@ struct delta_entry
static struct object_entry *objects;
static struct delta_entry *deltas;
+static struct base_data *base_cache;
static int nr_objects;
static int nr_deltas;
static int nr_resolved_deltas;
@@ -216,6 +219,27 @@ static void bad_object(unsigned long offset, const char *format, ...)
die("pack has bad object at offset %lu: %s", offset, buf);
}
+static void link_base_data(struct base_data *base, struct base_data *c)
+{
+ if (base)
+ base->child = c;
+ else
+ base_cache = c;
+
+ c->base = base;
+ c->child = NULL;
+}
+
+static void unlink_base_data(struct base_data *c)
+{
+ struct base_data *base = c->base;
+ if (base)
+ base->child = NULL;
+ else
+ base_cache = NULL;
+ free(c->data);
+}
+
static void *unpack_entry_data(unsigned long offset, unsigned long size)
{
z_stream stream;
@@ -452,6 +476,8 @@ static void resolve_delta(struct object_entry *delta_obj,
sha1_object(result.data, result.size, type, delta_obj->idx.sha1);
nr_resolved_deltas++;
+ link_base_data(base_obj, &result);
+
hashcpy(delta_base.sha1, delta_obj->idx.sha1);
if ...If we free the data stored within a base_data we need the struct
object_entry to get the data back again for use with another
dependent delta. Storing the object_entry* makes it simple to call
get_data_from_pack() to recover the compressed information.
This however means we must add the missing baes object to the end
of our packfile prior to calling resolve_delta() on each of the
dependent deltas. Adding the base first ensures we can read the
base back from the pack we indexing, as if it had been included by
the remote side.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
---
index-pack.c | 18 ++++++++++++------
1 files changed, 12 insertions(+), 6 deletions(-)
diff --git a/index-pack.c b/index-pack.c
index 6c59fd3..7239e89 100644
--- a/index-pack.c
+++ b/index-pack.c
@@ -29,6 +29,7 @@ union delta_base {
struct base_data {
struct base_data *base;
struct base_data *child;
+ struct object_entry *obj;
void *data;
unsigned long size;
};
@@ -476,6 +477,7 @@ static void resolve_delta(struct object_entry *delta_obj,
sha1_object(result.data, result.size, type, delta_obj->idx.sha1);
nr_resolved_deltas++;
+ result.obj = delta_obj;
link_base_data(base_obj, &result);
hashcpy(delta_base.sha1, delta_obj->idx.sha1);
@@ -588,6 +590,7 @@ static void parse_pack_objects(unsigned char *sha1)
continue;
base_obj.data = get_data_from_pack(obj);
base_obj.size = obj->size;
+ base_obj.obj = obj;
link_base_data(NULL, &base_obj);
if (ref)
@@ -633,7 +636,8 @@ static int write_compressed(int fd, void *in, unsigned int size, uint32_t *obj_c
return size;
}
-static void append_obj_to_pack(const unsigned char *sha1, void *buf,
+static struct object_entry *append_obj_to_pack(
+ const unsigned char *sha1, void *buf,
unsigned long size, enum object_type type)
{
struct object_entry *obj = &objects[nr_objects++];
@@ -654,6 +658,7 @@ static void append_obj_to_pack(const unsigned char *sha1, void *buf,
...If we are trying to resolve deltas for a long delta chain composed
of multi-megabyte objects we can easily run into requiring 500M+
of memory to hold each object in the chain on the call stack while
we recurse into the dependent objects and resolve them.
We now use a simple delta cache that discards objects near the
bottom of the call stack first, as they are the most least recently
used objects in this current delta chain. If we recurse out of a
chain we may find the base object is no longer available, as it was
free'd to keep memory under the deltaBaseCacheLimit. In such cases
we must unpack the base object again, which will require recursing
back to the root of the top of the delta chain as we released that
root first.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
---
index-pack.c | 43 +++++++++++++++++++++++++++++++++++++++++--
1 files changed, 41 insertions(+), 2 deletions(-)
diff --git a/index-pack.c b/index-pack.c
index 7239e89..84c9fc1 100644
--- a/index-pack.c
+++ b/index-pack.c
@@ -52,6 +52,7 @@ struct delta_entry
static struct object_entry *objects;
static struct delta_entry *deltas;
static struct base_data *base_cache;
+static size_t base_cache_used;
static int nr_objects;
static int nr_deltas;
static int nr_resolved_deltas;
@@ -220,6 +221,20 @@ static void bad_object(unsigned long offset, const char *format, ...)
die("pack has bad object at offset %lu: %s", offset, buf);
}
+static void prune_base_data(void)
+{
+ struct base_data *b = base_cache;
+ for (b = base_cache;
+ base_cache_used > delta_base_cache_limit && b && b->child;
+ b = b->child) {
+ if (b->data) {
+ free(b->data);
+ b->data = NULL;
+ base_cache_used -= b->size;
+ }
+ }
+}
+
static void link_base_data(struct base_data *base, struct base_data *c)
{
if (base)
@@ -229,6 +244,8 @@ static void link_base_data(struct base_data *base, struct base_data *c)
c->base = base;
c->child = NULL;
+ base_cache_used += ...OK now I see what the 'base' pointer I previously dismissed is really needed for. But this patch is suboptimal as it actually recreate the same memory What you actually do is to read the delta data in memory, then recurse down to read more delta data, then recurse down to read the base which might be yet more delta data in memory, etc. etc. Only when you reach the bottom of the stack will you start resolving all those deltas in memory. Instead, the check for a delta object should be done first, and if so then recursion for the base object be performed _before_ reading the currently wanted object data. This way you won't have more than one delta buffer at any time in memory. Nicolas --
Arrgh. Good catch on review. I'll post a replacement patch shortly -- Shawn. --
If we are trying to resolve deltas for a long delta chain composed of multi-megabyte objects we can easily run into requiring 500M+ of memory to hold each object in the chain on the call stack while we recurse into the dependent objects and resolve them. We now use a simple delta cache that discards objects near the bottom of the call stack first, as they are the most least recently used objects in this current delta chain. If we recurse out of a chain we may find the base object is no longer available, as it was free'd to keep memory under the deltaBaseCacheLimit. In such cases we must unpack the base object again, which will require recursing back to the root of the top of the delta chain as we released that root first. The astute reader will probably realize that we can still exceed the delta base cache limit, but this happens only if the most recent base plus the delta plus the inflated dependent sum up to more than the base cache limit. Due to the way patch_delta is currently implemented we cannot operate in less memory anyway. Signed-off-by: Shawn O. Pearce <spearce@spearce.org> --- Version 2 plugs the case Nico noticed, where the patch was causing the exact behavior it was trying to prevent while recovering from what it did to avoid the excessive memory usage in the first place. The change was in get_base_data() where we now unpack the delta after we have unpacked the base. Nico and I both missed that we must also bump base_cache_used when we restore the base, and we must also prune the bases in case this base has caused us to go over the limit. :-) index-pack.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++-- 1 files changed, 46 insertions(+), 2 deletions(-) diff --git a/index-pack.c b/index-pack.c index 7239e89..b4ec736 100644 --- a/index-pack.c +++ b/index-pack.c @@ -52,6 +52,7 @@ struct delta_entry static struct object_entry *objects; static struct delta_entry *deltas; static struct base_data *base_cache; +static size_t ...
You may also set the limit lower than the size of the largest object which is going to bust it too. Same goes for the cache in sha1_file.c. Nicolas --
Hi, s/baes/base/ Otherwise very clear. Thanks, Dscho --
Yea, Dscho also pointed it out. Junio, if you can, please fix this typo in the commit message. Its not really a big deal. I have no plans on posting a replacement patch just for this one small typo. Plenty of my current commits in git.git have typos too. I'm not that good of a typist. And gcc doesn't check my commit messages. ;-) -- Shawn. --
No, it is not a big deal. I do write crappy english myself. I initially had some comments on the patch itself which whould have warranted another patch and then changed my mind. Nicolas --
You don't really need that 'base' pointer, do you? When linking a new child, then the 'child' pointer simply has to be overwritten. there is a window where that 'child' pointer would be invalid after the child structure has been discarded, but no walking of the list should occur at Nicolas --
Well... I don't like this. Not your patch, but rather the direction this whole thing took in the first place, and now we have to bolt extra complexity on top. I have the feeling this is not the best approach, especially since many long delta chains are going deep and all those intermediate objects are simply useless once the _only_ delta child has been resolved and therefore could be freed right away instead. But I have to sleep on it first. Nicolas --
I thought about trying to free a base object if this is the last resolve_delta() call which would require that data, but I realized this is just a "tail-call optimization" and doesn't work in the more general case. This patch series is the best solution I could come up with to handle even the general case. The only other alternative I can come up with is to change pack-objects (or at least its defaults) so we don't generate these massive delta chains. But there are already packs in the wild that use these chains, resulting in performance problems for clients. -- Shawn. --
Hi, However, for cases like Stephan's I assume this is the rule. If you think of it, most use cases for such a big blob are one-user, But the long chains make the pack actually as efficient as it is... Ciao, Dscho --
Perhaps Shawn thought here about limiting delta chain not by its *length*, but by its *size* (as required when unpacking last object in a delta chanin). What do you think about this idea? -- Jakub Narebski Poland ShadeHawk on #git --
Hi, So you mean once the sizes of the reconstructed objects are too big, i.e. when the delta chain is actually _most useful_, we should not do it? Don't think so. Ciao, Dscho --
Sorry for being clueless here, but why does the older versions need to be kept in-memory anyway? Aren't we applying the delta each time we find one, repeatedly creating a new base-object in-memory for each delta? If we aren't doing that, why do we need more than just a small amount of memory just for keeping the delta? Feel free to tell me to go away if I'm being stupid. I'm just curious and probably won't be able to hack up any patches anyway. -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 --
Hi, Think of a delta chain of 49 delta objects, 1 base object. Now reconstruct all of the objects. If you do it one by one, not storing the intermediate results, you end up applying 1 delta for the first, 2 for the second (first the first, then the second), and so on, in total 1 + 2 + 3 + ... + 49 = 49 * 50 / 2 = 1225 times. Compare that to the 49 times when reusing the intermediate results. Ciao, Dscho --
That's only true if you discard the result of applying D1 to DB though.
What I'm wondering is; Why isn't it done like this (pseudo-code):
while (delta = find_earlier_delta(sha1)) {
if (is_base_object(delta)) {
base_object = delta;
break;
}
push_delta(delta, patch_stack);
}
while (pop_delta(patch_stack))
apply_delta(base_object, delta);
where "apply_delta" replaces the base_object->blob with the delta
applied, releasing the previously used memory?
That way, you'll never use more memory than the largest ever size
of the object + 1 delta at a time and still not apply the delta
more than delta_chain_length-1 times.
--
Andreas Ericsson andreas.ericsson@op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
--
Hi, Delta chains are not (necessarily) independent. Ciao, Dscho --
Those delta chains aren't simple chained lists. They are trees of deltas where one object might be the base for an unlimited number of deltas of depth 1, and in turn each of those deltas might constitute the base for an unlimited number of deltas of depth 2, and so on. So what the code does is to find out which objects are not deltas but are the base for a delta. Then, for each of them, all deltas having given object for base are found and they are recursively resolved so each resolved delta is then considered a possible base for more deltas, etc. In other words, those deltas are resolved by walking the delta tree in a "depth first" fashion. If we discard previous delta bases, we will have to recreate them each time a delta sibling is processed. And if those delta bases are themselves deltas then you have an explosion of delta results to re-compute. Nicolas --
Yes, it would be horrible if we had to recompute 10 deltas in order to recover a previously discarded delta base in order to visit new siblings. But its even more horrible that we use 512M of memory in our working set size on a 256M machine to process a pack that is only 300M in size, due to long delta chains on large objects. In such a case the system will swap and perform fairly poorly due to the huge disk IO necessary to keep moving the working set around. We're better off keeping our memory usage low and recomputing the delta base when we need to return to it to process a sibling. Please. Remember that index-pack, unlike unpack-objects, does not hold the unresolved deltas in memory while processing the input. It assumes the total size of the unresolved deltas may exceed the available memory for our working set and writes them to disk, to be read back in later during the resolving phase. At some point it is possible for the completely inflated delta chain to exceed the physical memory of the system. As soon as you do that you are committed to some form of swapping. We can probably do that better from the packfile by reinflating the super compressed deltas than letting the OS page in huge tracts of the virtual address space off the swap device. Plus the OS does not need to expend disk IO to swap out the pages, we have already spent that cost when we wrote the pack file down to disk as part of our normal operation. I don't like adding code either. But I think I'm right. We really need to not allow index-pack to create these massive working sets and assume the operating system is going to be able to handle it magically. Memory is not infinite, even if the Turing machine theory claims it is. -- Shawn. --
Please relax! ;-) And have a look in your mbox. Nicolas --
Thanks to all who have had a look at this issue! From a user's perspective I have one more suggestions and a question: First, it would have helped me to bring this issue onto the list if I had earlier known that this was no misconfiguration, but a memory problem. Even though Git now makes some efforts to substitute runtime for memory to be able to operate with low(er) memory, I think it would still be informative for a user that repository and hardware, resp. core.deltaBaseCacheLimit, are, say, incompatible. If valuable objects have to be discarded due to memory restrictions a warning could be issued to make the user aware of this fact, e.g., Warning! Low memory. Git might be slowing down. Second, while there have been some changes to Git now, as a poor user, how can I make use of that changes? I think, updating my client should only help with pushing. For pulling, I have to wait for repo.or.cz to update to Git 1.6.0, right? Best regards, Stephan Hennig --
Well, if we had known before that this could be a problem, we'd have Well, I disagree. First we don't know how slow git would effectively be since all (my) concerns so far were totally theoretical. It will still work better than, say, 'git verify-pack' nevertheless. And git should Actually that's the other way around. This change will help the receiving side of a transfer. So it should help you when pulling. Nicolas --
Actually, this warning may be a good idea. I'll post an RFC patch for it in a few minutes. If people hate the idea, that's what an RFC is for. :) -- Shawn. --
Its rare that we should exceed deltaBaseCacheLimit while resolving
delta compressed objects. By default this limit is 16M, and most
chains are under 50 objects in length. This affords about 327K per
object in the chain, which is quite large by source code standards.
If we have to recreate a prior delta base because we evicted it to
stay within the deltaBaseCacheLimit we can warn the user that their
configured limit is perhaps too low for this repository data set.
If the user keeps seeing the warning they can research it in the
documentation, and consider setting it higher on this repository,
or just globally on their system.
Suggested-by: Stephan Hennig <mailing_list@arcor.de>
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
---
index-pack.c | 8 ++++++++
1 files changed, 8 insertions(+), 0 deletions(-)
diff --git a/index-pack.c b/index-pack.c
index ac20a46..97533d6 100644
--- a/index-pack.c
+++ b/index-pack.c
@@ -53,6 +53,7 @@ static struct object_entry *objects;
static struct delta_entry *deltas;
static struct base_data *base_cache;
static size_t base_cache_used;
+static int oom_warning;
static int nr_objects;
static int nr_deltas;
static int nr_resolved_deltas;
@@ -481,6 +482,13 @@ static void *get_base_data(struct base_data *c)
if (!c->data) {
struct object_entry *obj = c->obj;
+ if (!oom_warning && verbose) {
+ if (progress)
+ fputc('\n', stderr);
+ warning("One or more delta chains are larger than deltaBaseCache.");
+ oom_warning = 1;
+ }
+
if (obj->type == OBJ_REF_DELTA || obj->type == OBJ_OFS_DELTA) {
void *base = get_base_data(c->base);
void *raw = get_data_from_pack(obj);
--
1.5.6.3.569.ga9185
--
As I said earlier, I don't think this is a good idea, but I'll elaborate a bit more. First, this is a really bad clue for setting deltaBaseCacheLimit. The likelyhood of this warning to actually show up during an initial clone is relatively high, yet this doesn't mean that deltaBaseCacheLimit has to be changed at all. For one, the real time usage of deltaBaseCacheLimit is to cap a cache of objects for multiple delta chains with random access, and not only one chain traversed linearly like in the index-pack case, and that cache is likely to always be full and in active eviction mode -- that's the point of a cap after all. In the index-pack this is only used to avoid excessive memory usage for intermediate delta results and not really a cache. In other words, we have two rather different usages for the same settings. Now don't read me wrong: I think that reusing this setting is sensible, but its value should not be determined by what index-pack may happen to do with it, especially on a first clone. And issuing warnings on the first clone is not the way to give new users confidence either. Secondly, on subsequent fetches, the warning is likely to never appear again due to the fact that the delta chains will typically be much shorter. And that would be true even if in reality the runtime access to the repository would benefit a lot from deltaBaseCacheLimit being raised. And it is the runtime access which is important here, not the occasional fetch. Yet the full delta chains are not likely to be walked in their entirety very often anyway either. Thirdly, if such indication is considered useful, then it should really be part of some statistic/analysis tool, such as verify-pack for example. Such a tool could compute the exact memory requirements for a given repository usage and possibly provide suggestions as to what the optimal deltaBaseCacheLimit value could be. But yet that cache has a hardcoded number of entries at the moment and its hash ...
I was talking about that. Or at least thinking it. But its not a good solution, as Dscho points out those long chains is what makes the pack file such a powerful source code compressor. IMHO the right solution is to continue to allow these long chains, especially since they are typically used for the older, less often accessed objects, and use bounded caching where necessary in the readers to avoid unpacking costs. We already have a bounded cache for the random access case used by most programs. What we missed was the bounded cache in index-pack, and my 4 patch series is an attempt to define that. -- Shawn. --
Thank you for the analysis! As a workaround, I think about splitting the file. Can you suggest a proper file size? Best regards, Stephan Hennig --
To be precise, the original pull has been done via url = git://repo.or.cz/wortliste.git which is a read-only URL. I changed .git/config manually to point to url = git+ssh://xxx@repo.or.cz/srv/git/wortliste.git later, which is the push URL. Best regards, Stephan Hennig --
