Re: .git/info/refs

Previous thread: .git/info/refs by H. Peter Anvin on Wednesday, January 24, 2007 - 12:38 am. (1 message)

Next thread: Re: How to pull only a few files from one branch to another? by Jakub Narebski on Wednesday, January 24, 2007 - 2:32 am. (2 messages)
From: Jakub Narebski
Date: Wednesday, January 24, 2007 - 2:28 am

With new gitweb and new git it is not that expensive. It is now one call
to git-for-each-ref per repository.

Besides, we can't rely that .git/info/refs is up to date, or even exists.
It is for dumb protocols, not for gitweb.
-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git


-

From: H. Peter Anvin
Date: Wednesday, January 24, 2007 - 8:55 am

Well, SOMETHING needs to be done for this page, since it can take 15 
minutes or more to generate.  Caching doesn't help one iota, since it's 
stale before being generated.

	-hpa
-

From: Johannes Schindelin
Date: Wednesday, January 24, 2007 - 9:02 am

Hi,


To me, it seems like all boils down to caching parsed data structures. 
I.e. parse the config, then serialize the parsed data to a file. Don't 
reparse the config unless it is 1 hour older than the config.

Likewise, run for-each-ref, and serialize the parsed data into a file. 
Don't rerun for-each-ref if that file is younger than 15 minutes.

Maybe the same for the first 200 commits of each branch.

(I made those times up, but you get the idea.)

Ciao,
Dscho

-

From: H. Peter Anvin
Date: Wednesday, January 24, 2007 - 9:24 am

A much better idea is to have that data structure updated on repository 
updates, which is the whole point behind .git/info/refs.  On kernel.org, 
at least, if you don't keep .git/info/refs up to date you need to get 
your fingers whacked anyway, since it damages usability for one 
particular class of users.

	-hpa
-

From: Johannes Schindelin
Date: Wednesday, January 24, 2007 - 9:38 am

Hi,


Granted, for some things this might work. However, I would not wreak havoc 
by changing the format of .git/info/refs, rather put the details you 
wanted into .git/info/refs-details.

However, for other things (like showing a certain number of commits), it 
_might_ make sense to cache them (e.g. when literally thousands of people 
look at the 100 last commits of linux-2.6.git), but not for others (e.g. 
the 100th last to the 200th last commit of git-tools.git).

Having said that, it should be relatively easy to store the (parsed, or at 
least easily parseable) 500 last commits of a branch into 
.git/info/commits-<branch>.

This would put the burden of publishing a branch higher, easening the 
overall load on the server.

Jakub?

Ciao,
Dscho

-

From: H. Peter Anvin
Date: Wednesday, January 24, 2007 - 9:41 am

It's not clear to me if it would be wrecking havoc.  After all, if a 
format can't be expanded *at all*, there is something wrong, and adding 
things to the end of a line is a common structured way of expansion. 

Any query that's within a repository is fairly easily cachable 
post-generation.  The front page (and its RSS variant) is a bit of an 
exception, because it involves all repositories at once.

Doesn't mean we couldn't do better, but...

	-hpa
-

From: Johannes Schindelin
Date: Wednesday, January 24, 2007 - 9:52 am

Hi,


The idea of .git/info/refs is to enable dumb transports to fetch something 
akin to intelligently. They don't need that information, and frankly, I 
don't think they should need to understand it.

I also expect that they interpret everything after the sha1 as refname, 
what with our having become quite liberal with refnames (they can contain 
spaces, tabs, and even a small amount of special K). So I don't see a way 
to upgrade the file format.

But as should be clear by now, I'd prefer additional information -- that 
is of no interest to dumb transports anyway -- to be put in an own file.

That also opens the possibility of, say .git/info/perl/, which contains 
_only_ serialized perl objects! I imagine this could be a performance 

... and here we have a problem, right? No single update hook can update 
the _whole_ information.

Ciao,
Dscho

-

From: H. Peter Anvin
Date: Wednesday, January 24, 2007 - 10:06 am

I don't think adding 10 digits to each line is going to be a sizable 




I don't see a problem.

	-hpa

-

From: Jakub Narebski
Date: Wednesday, January 24, 2007 - 1:40 pm

The simple and fast solution would be to make post-update hook contain
the git-for-each-ref with parameters like in git_get_last_activity,
saving e.g. to .git/info/last-committer, and in gitweb read this file
if it exist, run git-for-each-ref otherwise (similar to what we used to
do with .git/info/refs and git-peek-remote in gitweb).

-- 
Jakub Narebski
Poland
-

From: hpa
Date: Wednesday, January 24, 2007 - 1:44 pm

Right, this is basically what I'm saying; the question is only whether or
not this fits into .git/info/refs or should be a separate file.

Either way, I think git-update-server-info should generate all these files.

    -hpa

-

From: Johannes Schindelin
Date: Thursday, January 25, 2007 - 1:14 am

Hi,


Well, no. At least not per default. What you want is _very_ special to 
gitweb. It is _only_ needed by gitweb. And .git/info/refs is for _dumb 
transports_, _not_ for gitweb.

That said, I think it makes sense _in your setup_ to trigger updating 
_another_ file for use in gitweb.

Remember, this is all very, very special for gitweb. So let's separate it 
cleanly from all which is not special for gitweb.

I hope I have made it clear why (at least IMHO) it would be wrong, wrong, 
wrong to change the format of .git/info/refs _only_ for gitweb, which it 
is not meant for to begin with.

So let's introduce another file in .git/info/ especially dedicated to 
gitweb.

Then we are free to introduce real cool performance hacks, like using 
Storable to store the parsed data structures (I was alluding to this in an 
earlier reply, as "serializing"). Then you just retrieve the file -- if it 
exists -- or call for-each-ref (like Jakub said).

By separating this gitweb-special thing cleanly, maybe into a hook, we can 
have a perl script which writes this file. We can write a simple hash, 
which may or may not contain keys, thus being of "extensible format".

By having this perl script, you can -- as root -- run it as the 
appropriate user for each repository where it does not exist yet.

Remains the problem: how do we _force_ this hook enabled site-wide, i.e. 
in _all_ repos?

But that is too easy: just edit the existing template, and then replace 
the update hooks in all repos (possibly verifying that the existing update 
hook indeed matches the old template).

So what problems remain with this approach?

Ciao,
Dscho

-

From: H. Peter Anvin
Date: Thursday, January 25, 2007 - 9:12 am

No, you keep using circular reasoning.

	-hpa
-

From: Johannes Schindelin
Date: Thursday, January 25, 2007 - 9:50 am

Hi,


No. Once again, .git/info/refs is _not_ for gitweb. But I will stop 
arguing about that topic, because I don't have enough time for that.

Ciao,
Dscho

-

From: hpa
Date: Wednesday, January 24, 2007 - 1:45 pm

Right, this is basically what I'm saying; the question is only whether or
not this fits into .git/info/refs or should be a separate file.

Either way, I think git-update-server-info should generate all these files.

    -hpa

-

From: Junio C Hamano
Date: Thursday, January 25, 2007 - 2:28 pm

Do you mean you have 24k _REPOSITORIES_ served by gitweb on
kernel.org?

-

From: H. Peter Anvin
Date: Thursday, January 25, 2007 - 2:37 pm

No, we currently have 250 repositories with a total of 24175 refs.

	-hpa
-

From: Junio C Hamano
Date: Thursday, January 25, 2007 - 2:51 pm

Then that would mean 250 calls to git-for-each-ref, wouldn't it?

-

From: H. Peter Anvin
Date: Thursday, January 25, 2007 - 3:01 pm

Well, I think it was Johannes that said once for each ref.  But either 
which way, it's a totally unacceptable load with resulting unacceptable 
latency.

	-hpa

-

From: Johannes Schindelin
Date: Thursday, January 25, 2007 - 4:33 pm

Hi,


No. I would never say that you have to run for-each-ref for each ref. 
That's plain stupid.

BTW I take some satisfaction in that you finally agreed (in another email) 
that some post-creation caching is necessary.

I would be even more satisfied if you finally agreed that it is a good 
practice to separate conceptually different things, and not continued ad 
infinitum (and ad nauseam) arguing that .git/info/refs should serve dumb 
transports, and gitweb, and eventually bring peace to everybody on this 
planet.

Ciao,
Dscho

-

From: H. Peter Anvin
Date: Saturday, January 27, 2007 - 3:07 pm

I went back and looked at the thread, and I had indeed misread the 
original message, which was from Jakub, not you.  I think I got in the 
"this is surreal" mode as a result of that (invoking for-each-ref 250 

I don't believe I have ever disputed that (in fact, I have pushed very 

I've already said I think it's an aesthetic argument, but I don't really 
care either way, as long as there is only one hook that updates all the 
caches.  I don't want the user to have to juggle an arbitrary and 
increasing number of hooks.

Fair?

	-hpa
-

From: sbejar
Date: Wednesday, January 31, 2007 - 8:38 am

Normally I'm not interested in the "Last Change" column, I just want
to go to the project summary page, and normally I'm not interested in
the last 16 tags (the last three are just enough). For me they should
be show only when explicitly asked.

Santi
-

From: Johannes Schindelin
Date: Thursday, February 1, 2007 - 7:03 am

Hi,

I just had another idea: why not generate the content of the "cover page" 
in a cron job, every minute or so, and save it into a static index.html? 
This should take quite a load from the server, since not even Perl has to 
be started to serve that page.

Ciao,
Dscho


-

From: H. Peter Anvin
Date: Thursday, February 1, 2007 - 9:16 am

Ehm... because it often takes longer than that to generate the page?

We can pre-generate the page before the first hit, but that's not a 
replacement for update-time caching.

	-hpa
-

From: Johannes Schindelin
Date: Thursday, February 1, 2007 - 9:52 am

Hi,


Sorry, I should have been clearer. Plan:

1. echo "Generating" > /htdocs/git/index.html
2. edit crontab to do this every minute:
2.1 gitweb is called directly_, to generate /htdocs/git/index.html.new
2.2 /htdocs/git/index.html.new is _moved_ into /htdocs/git/index.html, 
    overwriting the existing one.

Yes, there could be two instances of this task concurrently. No, it does 

It was only meant as a quick fix for the horrible workload.

Just a thought, feel free to ignore me,
Dscho

-

From: H. Peter Anvin
Date: Thursday, February 1, 2007 - 9:56 am

Yes, it does matter, because it drives the load up further.  If you 
start having this going on in overlapping instances, then you're soon on 

And we have already experimented with it.  It unfortunately doesn't help 
much, it only makes matters worse.

	-hpa
-

From: Matthias Lederhofer
Date: Thursday, February 1, 2007 - 10:32 am

The gitweb overview page has less than one hit per minute?  Otherwise
this should help.
-

From: H. Peter Anvin
Date: Thursday, February 1, 2007 - 10:51 am

We already cache it with a forced duration of some 15 minutes.  The end 
result is exactly the same.

	-hpa
-

Previous thread: .git/info/refs by H. Peter Anvin on Wednesday, January 24, 2007 - 12:38 am. (1 message)

Next thread: Re: How to pull only a few files from one branch to another? by Jakub Narebski on Wednesday, January 24, 2007 - 2:32 am. (2 messages)