I really don't see what gitweb could do that would be somehow better than
apache doing the caching in front of it.. Is there some apache reason why
that isn't sufficient (ie limitations on its cache size or timeouts?)Maybe the cacheability hints from gitweb could be tweaked (a lot of it
should be "infinitely cacheable", but the stuff that depends on refs and
thus can change, could be set to some fixed host-wide value - preferably
some that depends on how old the ref is).Having gitweb be potentially up to an hour out of date is better than
causing mirroring problems due to excessive load.For example, if the git "refs/heads/" (or tags) directory hasn't changed
in the last two months, we should probably set any ref-relative gitweb
pages to have a caching timeout of a day or two. In contrast, if it's
changed in the last hour, maybe we should only cache it for five minutes.Jakub: any way to make gitweb set the "expires" fields _much_ more
aggressively. I think we should at least have the ability to set a basic
rules like- a _minimum_ of five minutes regardless of anything else
We might even tweak this based on loadaverage, and it might be
worthwhile to add a randomization, to make sure that you don't get into
situations where everything webpage needs to be recalculated at once.- if refs/ directories are old, raise the minimum by the age of the refs
If it's more than an hour old, raise it to ten minutes. If it's more
than a day, raise it to an hour. If it's more than a month old, raise
it to a day. And if it's more than half a year, it's some historical
archive like linux-history, and should probably default to a week or
more.- infinite for stuff that isn't ref-related.
Hmm?
Linus
-
Linus Torvalds wrote:
I think the minimum expires (or minimum _additional_ expires: as of now
giweb only does expires +1d for explicit hash requests) should depend onWhat about packed refs?
We can certainly raise expires for tags (tags objects), as they should not
As sha1 is not changeable, everything that is accessed by explicit
sha1 (hash), or by explicit sha1 (hash_base) plus pathname (file_name)
should have effectively infinite expires.Every caching would need some temporary memory, or temporary disk space.
And perhaps mod_perl specific caching would be useful here...P.S. I have added Pasky to Cc:, as he manages http://repo.or.cz public
git repository hosting (much smaller than kernel.org and I think under less
load: but also I think withour kernel.org resources).
--
Jakub Narebski
Poland
-
What it could do better is it could prevent multiple identical queries
from being launched in parallel. That's the real problem we see; under
high load, Apache times out so the git query never gets into the cache;
but in the meantime, the common queries might easily have been launched
20 times in parallel. Unfortunately, the most common queries are also
extremely expensive.-hpa
-
Ahh. I'd have expected that apache itself had some serialization facility,
that would kind of go hand-in-hand with any caching.It really would make more sense to have anything that does caching
serialize the address that gets cached (think "page cache" layer in the
kernel: the _cache_ is also the serialization point, and is what
guarantees that we don't do stupid multiple reads to the same address).I'm surprised that Apache can't do that. Or maybe it can, and it just
needs some configuration entry? I don't know apache.. I realize that
because Apache doesn't know before-hand whether something is cacheable or
not, it must probably _default_ to running the CGI scripts to the same
address in parallel, but it would be stupid to not have the option to
serialize.That said, from some of the other horrors I've heard about, "stupid" may
be just scratching at the surface.Linus
-
If I understand correctly, kernel.org is still running the
version of gitweb Kay last installed there (I am too busy to
take over the gitweb installation maintenance at kernel.org, and
I did not ask the $DOCUMENTROOT/git/ directory to be transferred
to me when I rolled gitweb into the git.git repository).I do not know what queries are most popular, but I think a newer
gitweb is more efficient in the summary page (getting list of
branches and tags). It might be worth a try.-
That's correct. I can transfer that directory to you if you want; I
can't realistically track gitweb well enough to do this myself (in fact,
it was pretty much a condition of having it up there that Kay would keepHow do you want to handle it?
-hpa
-
Well, the reason I haven't asked to is because I don't have
enough time myself, so....-
AFAIK it doesn't have such an option, for basically the reason
you describe. I worked on a project which had much more difficult
to answer queries than gitweb and were also very popular. Yes,
the system died under any load, no matter how much money was thrownIt is. :-)
--
Shawn.
-
Gaah. That's just stupid. This is such a _basic_ issue for caching ("if
concurrent requests come in, only handle _one_ and give everybody the same
result") that I claim that any cache that doesn't handle it isn't a cache
at all, but a total disaster written by incompetent people.Sure, you may want to disable it for certain kinds of truly dynamic
content, but that doesn't mean you shouldn't be able to do it at all.Does anybody who is web-server clueful know if there is some simple
front-end (squid?) that is easy to set up and can just act as a caching
proxy in front of such an incompetent server?Or maybe there is some competent Apache module, not just the default
mod_cache (which is what I assume kernel.org uses now)?Linus
-
Squid in "transparent reverse proxy" mode isn't a bad choice, although
I don't know offhand whether it queues/clusters concurrent requests
for the same URL in the way you want. I suggest the "transparent"
deployment (netfilter/netlink integration) because you can slap it in
with no changes to the origin server and yank it out again if you have
a problem. The challenge is in getting conntrack to scale to a
zillion concurrent sessions, but you could probably find someone in
your crowd who knows something about that. :-)Ignore any documentation that talks about httpd_accel_*. Configuring
transparent mode is a great deal simpler and saner in squid 2.6 than
it used to be; you just add a "transparent" parameter to the http_port
tag. With or without this tag, you set up what used to be called
"accelerator mode" using some parameters to http_port and cache_peer,
as described in
http://www.squid-cache.org/mail-archive/squid-users/200607/0162.html.If transparent mode looks like the right thing for kernel.org, you
might be interested in some netfilter hackery to offload part of the
conntrack session lookup load to a front-end box that blocks DDoS and
acts more or less as an L4 switch plus session context cache. I've
been banging on a proof of concept implementation for a while, and am
currently working on integrating against 2.6.19 by splitting
nf_conntrack into front and back halves that interact via a sort of
Layer 2+ header. I have no idea yet whether it will have any
scalability benefit on dual-x86_64 class hardware (it was originally
conceived for rigid cache architectures where the random access
patterns of session lookups have drastic cache effects).Cheers,
- Michael
-
You certainly can be smarter about it when you know the nature of the
query, though. I do that with the patch viewer scripts.-hpa
-
Do you have a top-ten of queries ? That would be the ones to optimize
for.OG.
-
The front page, summary page of each project, and the RSS feed for each
project.-hpa
-
How about extending gitweb to check to see if there already exists a
cached version of these pages, before recreating them?e.g. structure the temp dir in such a way that each project has a place
for cached pages. Then, before performing expensive operations, check to
see if a file corresponding to the requested page already exists. If it
does, simply return the contents of the file, otherwise go ahead and
create the page dynamically, and return it to the user. Do not create
cached pages in gitweb dynamically.Then, in a post-update hook, for each of the expensive pages, invoke
something like:# delete the cached copy of the file, to force gitweb to recreate it
rm -f $git_temp/$project/rss
# get gitweb to recreate the page appropriately
# use a tmp file to prevent gitweb from getting confused
wget -O $git_temp/$project/rss.tmp \
http://kernel.org/gitweb.cgi?p=$project;a=rss
# move the tmp file into place
mv $git_temp/$project/rss.tmp $git_temp/$project/rssThis way, we get the exact output returned from the usual gitweb
invocation, but we can now cache the result, and only update it when
there is a new commit that would affect the page output.This would also not affect those who do not wish to use this mechanism.
If the file does not exist, gitweb.cgi will simply revert to its usual
behaviour.Possible complications are the content-type headers, etc, but you could
use the -s flag to wget, and store the server headers as well in the
file, and get the necessary headers from the file as you stream it.i.e. read the headers looking for ones that are "interesting"
(Content-Type, charset, expires) until you get a blank line, print out
the interesting headers using $cgi->header(), then just dump the
remainder of the file to the caller via stdout.Rogan
-
This goes back to the "gitweb needs native caching" again.
-hpa
-
It should be fairly easy to add a caching layer, but I wouldn't do it
inside gitweb itself - it gets too mixed up. It would be better to have
it as a separate front-end, that just calls gitweb for anything it doesn't
find in the cache.I could write a simple C caching thing that just hashes the CGI arguments
and uses a hash to create a cache (and proper lock-files etc to serialize
access to a particular cache object while it's being created) fairly
easily, but I'm pretty sure people would much prefer a mod_perl thing just
to avoid the fork/exec overhead with Apache (I think mod_perl allows
Apache to run perl scripts without it), and that means I'm not the right
person any more.Not that I'm the right person anyway, since I don't have a web server set
up on my machine to even test with ;)Linus
-
This is quite nice and easy, if memory-only caching works for the
situation: http://www.danga.com/memcached/There are APIs for C, Perl, and plenty of other languages.
Jeff
-
Actually, just looking at the examples, it looks like memcached is
fundamentally flawed, exactly the same way Apache mod_cache is
fundamentally flawed.Exactly like mod_perl, it appears that if something isn't cached, the
memcached server will just return "not cached" to everybody, and all the
clients will, like a stampeding herd, all do the uncached access. Even if
they have the exact same query. And you're back to square one: your server
load went through the roof.You can't have a cache architecture where the client just does a "get",
like memcached does. You need to have a "read-for-fill" operation, which
says:- get this cache entry
- if this cache entry does not exist, get an exclusive lock
- if you get that exclusive lock, return NULL, and the client promises
that it will fill it (inside the kernel, see for example
"find_get_page()" vs "grab_cache_page()" - the latter will return a
locked page whether it exists or not, and if it didn't exist, it will
have inserted it into the cache datastructures so that you don't have
multiple concurrent readers trying to all create different pages)- if you block on the exclusive lock, that means that some other client
is busy fulfilling it. When you unblock, do a regular "read" operation
(not a "repeat": we only block once, and if that fails, that's it).- any cachefill operation will release the lock (and allow pending
cache queries to succeed)- the locking client going away will release the lock (and allow pending
cache queries to fail, and hopefully cause a "set cache" operation)- a timeout (settable by some method) will also force-release a lock in
the case of buggy clients that do "read-for-modify" but never do the
"modify".The "timeout" thing is to handle the case of buggy clients that crash
after trying to get - it will slow down things _enormously_ if that
happens, but hey, it's a buggy client. And it will still continue to work.Looking at...
Actually, memcached does support an operation that would work for this:
the "add" request, which creates a new cache entry if and only if the
key is not already in the cache. If the key is already present, the
request fails. You can use that to implement a simple named mutex, and
it supports a client-specified timeout. The one thing it doesn't support
that you described is a notion of deleting a key when a particular
client disconnects, but as you say, that should only happen in the case
of buggy clients anyway.Mind you, I'm not convinced memcached is necessarily the right answer
for this problem, but it does provide a way to implement the required
locking semantics.BTW, I'm one of the main contributors to memcached, so if it does end up
looking like a good choice except for some minor issue or another, I may
be able to tweak it to cover whatever is missing. For example, the
"delete a key on disconnect" thing would be fairly straightforward, if
it's actually necessary in practice.-Steve
-
I don't know if fundamentally flawed but (having used memcached) I
don't think it's a big win for this at all.We can make gitweb to detect mod_perl and a few smarter things if it
is running inside of it. In fact, we can (ab)use mod_perl and perl
facilities a bit to do some serialization which will be a big win for
some pages. What we need for that is to set a sensible the ETag and
use some IPC to announce/check if other apache/modperl processes are
preparing content for the same ETag. The first-process-to-announce a
given ETag can then write it to a common temp directory (atomically -
write to a temp-name and move to the expected name) while other
processes wait, polling for the file. Once the file is in place the
latecomers can just serve the content of the file and exit.(I am calling the "state we are serving" identifier ETag because I
think we should also set it as the ETag in the HTTP headers, so well
be able to check the ETag of future requests for staleness - all we
need is a ref lookup, and if the SHA1 matches, we are sorted). So
having this 'unique request identifier' doubles up nicely...The ETag should probably be:
- SHA1+displaytype+args for pages that display an object identified by SHA1
- refname+SHA!+displaytype+args for pages that display something
identified by a refYou _could_ make do with a convention of polling for "entryname" and
"workingon-entryname" and if "workingon-entryname" is set to 1, you
can expect entryname to be filled real soon now. However, memcached is
completely memorybound, so it is only nice for really small stuff or
for a large server farm which has gobs of spare ram.(Note that memcached does have timeouts which means that the
'workingon' value could have a short timeout in case the request is
cancelled or the process dies - the nasty bit in the above plan wouldApache doesn't do it because most web applications don't use the HTTP
procol correctly - specially when it comes to the idempotency of GET.
So in 99% of the cases, we...
First, it would (and could) work only for serving gitweb over mod_perl.
I'm not sure if overhead with IPC and complications implementing are
worth it: this perhaps be better solved by caching engine.But let us put aside for a while actual caching (writing HTML version
of the page to a common temp directory, and serving this static page
if possible), and talk a bit what gitweb can do with respect to
cache validation.In addition to setting either Expires: header or Cache-Control: max-age
gitweb should also set Last-Modified: and ETag headers, and also
probably respond to If-Modified-Since: and If-None-Match: requests.For some pages ETag is natural; for other Last-Modified: would be more
What uniquely identifies contents in "object" views ("commit", "tag",
"tree", "blob") is either h=SHA1, or hb=SHA1;f=FILENAME (with absence
of h=SHA1). If both h=SHA1 and hb=SHA1 is present, hb=SHA1 serves as
backlink. The "diff" views ("commitdiff", "blobdiff") are uniquely
identified by pair of object identifiers (pairs of SHA1, or pairs of
hb SHA1 + FILENAME).Three of those views ("blob", "commitdiff", "blobdiff") have their
"plain" version; so ETag should include displaytype (action, 'a'
parameter).The hb=SHA1;f=FILENAME indentifier can be converted at cost of one
call to git command (but which is a bit expensive as it recurses
trees), namely to git-ls-tree.ETag can be simply args (query), if all h/hb/hbp parameters are SHA1.
Or ETag can be SHA1 of an object (or pair of SHA1 in the case of diff),
but this is little more costly to verify. Although we usually (always?)
convert hb=SHA1;f=FILENAME to h=SHA1 anyway when displaying/generating
page.For objects views we can simply convert refname to SHA1. I'm not sure if
it is worth it. In the cases when for view we have to calculate SHA1 of
object anyway, we can return (and validate) ETag with SHA1 as above.- ETag and/or Last-Modified headers for "log" views: "log",
"shortlog" (is part of summary view), "history", ...
It is. At least for kernel.org, the issue isn't that CGI is expensive,
IMO yes, since most major browsers, caches, and spiders support these
That would be a good start, and suffice for many cases. If the CGI can
simply stat(2) files rather than executing git-* programs, that would
increase efficiency quite a bit.A core problem with cache hints via HTTP headers (last-modified, etc.)
is that you don't achieve caching across multiple clients, just across
repeated queries from the same client (or caching proxy).At least for the RSS/Atom feeds and the git main page, it makes no sense
to regenerate that data repeatedly.Internally, gitweb would need to do a stat() on key files, and return
pre-generated XML for the feeds if the stat() reveals no changes. DittoCGI version of gitweb.
But again, mod_perl vs. CGI isn't the issue.
Jeff
-
IO is the issue, and the CGI startup of Perl is quite IO & CPU
intensive. Even if the caching headers, thundering herds and planet
collisions are resolved, I don't think you'll ever be happy with IO
and CPU load on kernel.org running gitweb as CGI.cheers,
martin
-
I/O - nonexistent; that stuff will be in memory.
CPU - we have more CPU than you can shake a stick at, and it's 95+% idle.
*NOT AN ISSUE*.
-hpa
-
By the way, setting Last-Modified: and ETag: and checking for
If-Modified-Since: and If-None-Match: is easy only for log-like views:
"shortlog", "log", "history", "rss"/"atom". With "shortlog" and
"history" we have additional difficulity of using relative dates there.
And even for those views we need reverse proxy / caching engine
(e.g. Squid in "HTTP accelerator" mode) in front.It would be easier to pre-generate most common accessed views:
"projects_list", "summary" and "rss"/"atom" main for each project, and
just serve static pages. I don't know if we need to modify gitweb for
that.BTW. for single client (rather stupid benchmark, I know) mod_perl is
about twice faster in keepalive mode than CGI version of gitweb for
git.git summary page.--
Jakub Narebski
Poland
-
Note that if we had a new gitweb, we could also used the packed refs.
Those help CPU usage, but they actually help IO patterns more, exactly
because they avoid all the seeking around in the filesystem.So with packed refs, there's no need to go from directory lookup to inode
lookup to data lookup to object lookup for *each* ref - you can do the
"packed-refs" lookup _once_ (which obviously does the dir->inode->data),
and you don't need to do the object lookup at all.Of course, gitweb will then end up doing the object lookup anyway (because
of getting the dates etc for refs), but if you have packed-refs and a
reasonably packed repository, that should still really cut down on IO in a
big way.So there's probably tons of room for making this more efficient: using a
newer gitweb, packing refs, using the cgi cache thing.. It sounds like
what it really needs is just somebody with the competence and time to be
willing to step up and maintain gitweb on kernel.org...Linus
-
Indeed. We have a lot of projects on kernel.org which are like this:
not at all conceptually hard, but a huge time commitment for Doing It
Right[TM]. This is why I sometimes think that it would be a Good Thing
to get paid staff for kernel.org, although I was hoping to defer the
need for that until at least we have our 501(c)3 paperwork done, which
looks like mid-2007 at this point (assuming no further delays.)-hpa
-
Sending Last-Modified: should be easy; sending ETag needs some consensus
on the contents: mainly about validation. Responding to If-Modified-Since:
and If-None-Match: should cut at least _some_ of the page generating time.
If ETag can be calculated on URL alone, then we can cut If-None-Match:As I said, I'm not talking (at least now) about saving generated HTML
output. This I think is better solved in caching engine like Squid can
be. Although even here some git specific can be of help: we can invalidate
cache on push, and we know that some results doesn't ever change (well,I'm not sure if it is worth implementing in gitweb, or is it better left
to caching engine. With the projects list page and summary page there is
additional problem with relative dates, although this can be solved using
Jonas Fonseca idea of using absolute dates in the page and using ECMAScript
(JavaScript) to convert them to relative: on load, and perhaps on timer ;-)What can be _easily_ done:
* Use post 1.4.4 gitweb, which uses git-for-each-ref to generate summary
page; this leads to around 3 times faster summary page.
* Perhaps using projects list file (which can be now generated by gitweb)
instead of scanning directories and stat()-ing for owner would help
with time to generate projects lis pageWhat can be quite easy incorporated into gitweb:
* For immutable pages set Expires: or Cache-Control: max-age (or both)
to infinity
* Calculate hash+action based ETag at least for those actions where it is
easy, and respond with 304 Not Modified as soon as it can.
This might require some code reorganization to not begin writing output
before calculating ETag and ETag comparison (If-Match, If-None-Match).
* Generate Last-Modified: for those views where it can be calculated,
and respond with 304 Not Modified as soon as it can.What can be easily done using caching engine:
* Select top 10 of common queries, and cache them, invalidating cache on push
(depending on query: ...
Indeed. Let me add myself to the pileup agreeing that a combination of
setting Last-Modified and checking for If-Modified-Since for
ref-centric pages (log, shortlog, RSS, and summary) is the smartestIndeed - gitweb should not be saving HTML around bit giving the best
possible hints to squid and friends. And improving our ability toGreat plan. :-)
cheers,
martin
-
Sometimes it is easier to use ETags, sometimes it is easier to use
Last-Modified:. Usually you can check ETag earlier (after calling
git-rev-list) than Last-Modified (after parsing first commit). But
some pages doesn't have natural ETag...Besides, because ETag is HTTP/1.1 we should provide and validate
both.P.S. Any hints to how to do this with CGI Perl module?
--
Jakub Narebski
Poland
-
It's impossible, Apache doesn't supply e-tag info to CGI programs. (it
does supply HTTP_CACHE_CONTROL though apparently)You could probably do it via mod_perl.
Jeff
-
By ETag info you mean access to HTTP headers sent by browser
If-Modified-Since:, If-Match:, If-None-Match: do you?So the cache verification should be wrapped in if ($ENV{MOD_PERL}) ?
--
Jakub Narebski
Poland
-
You can use this attached shell script as a CGI script, to see precisely
what information Apache gives you. You can even experiment with passing
back headers other than Content-type (such as E-tag), to see what sort
of results are produced. The script currently passes back both E-Tag
and Last-Modified of a sample file; modify or delete those lines to suitSorry, I was /assuming/ mod_perl would make this available. The HTTP
header info is available to all Apache modules, but I confess I have no
idea how mod_perl passes that info to scripts.Also, an interesting thing while I was testing the attached shell
script: even though repeated hits to the script generate a proper 304
response to the browse, the CGI script and its output run to completion.
So, it didn't save work on the CGI side; the savings was solely in not
transmitting the document from server to client. The server still went
through the work of generating the document (by running the CGI), as one
would expect.Jeff
>>> It's impossible, Apache doesn't supply e-tag info to CGI programs.
The CGI spec does not at all guarantee that the CGI environment will
contain all the HTTP headers sent by the client. That was the point of
the environment dump script -- you can see exactly which headers are,
and are not, passed through to CGI.CGI only /guarantees/ a bare minimum (things like QUERY_STRING,
PATH_INFO, etc.)It's not meant to output the sample file. It outputs the server
metadata sent to the CGI script (the environment variables). The sampleCertainly. That should help cut down on I/O. FWIW though the projects
list is particularly painful, with its File::Find call, which you'llThis wanders into the realm of mod_cache configuration, I think. (which
I have tried to get working as reverse proxy, and failed serveral times)
If you are not using mod_*_cache, then Apache must execute the CGI
script every time AFAICS, regardless of etag/[if-]last-mod headers.Jeff
-
I have checked that at least Apache 2.0.54 passes HTTP_IF_MODIFIED_SINCE
First, it is better to use $projects_list which is projects index file
in the format:
<project path> SPC <project owner>
where <project path> is relative to $projectroot and is URI encoded; well
at least SPC has to be URI (percent) encoded. <project owner> is owner
of given project, and is also URI encoded (one would usually use '+' in
the place of SPC here).Gitweb now can generate projects list in above format, by using
"project_index" action ("a=project_index" query string), or by clicking
'TXT' link at the bottom of the projects list page in new gitweb: see
http://repo.or.cz by Petr Baudis. The problem is that it generates
projects list from the list of projects it sees, so to generate it from
scratch from the filesystem you have for generating "project_index"
to have $projects_list a directory (changing it to something that
evals to false, e.g. undef or "" makes gitweb use $projectroot for
$projects_list). I have posted how to do this.The project list changes rarely, only on addition/removal of project,
and on changing owner of project; so it can be generated on demand.Second, even with $projects_list being set to projects index file
as of now gitweb runs git-for-each-ref (which scans refs and access
pack file for commit date), checks for description file and reads it;
for $projects_list being directory it also checks project directory
owner. I plan to make it configurable to read last activity from
all heads (all branches) as it is now, from HEAD (current branch)
as it was before, or given branch (for example 'master').Assuming that gitweb is configured to read last activity from single
defined branch, generating ETag = checksum(sha1 of heads of projects)No, it wanders into realm of header parsing by Apache, and NPH (No Parse
Headers) option.Even if Apache does execute CGI script to completion every time, it might
not send the output of the script, but HT...
It is up to the script (CGI or via mod_perl) to set the status to 304
and finish execution. Just setting the status to 304 does not
forcefully end execution as you may want to cleanup, log, etc.cheers,
martin
-
I was thinking not about ending execution, but about not sending script
output but sending HTTP 304 Not Modified reply by Apache.I meant the following sequence of events:
1. Script sends headers, among those Last-Modified and/or ETag
2. Apache scans headers (e.g. to add its own), notices that Last-Modified
is earlier or equal to If-Modified-Since: sent by browser or reverse
proxy, or ETag matches If-None-Match:, and sends 304 instead of script
output
3. Script finishes execution, it's output sent to /dev/nullAgain, I don't know if Apache (or any other web server) does that.
--
Jakub Narebski
Poland
-
It doesn't. You want to take the decision to send a 304, cleanup and
exit _inside_ the CGI. If it was up to apache, then the CGI script
would end up creating the (potentially expensive to produce) content
just to see it sent to /dev/null OR if apache was to terminate
execution of the CGI more violently, the CGI wouldn't have a chance to
cleanup and release resources.So it's a matter of setting the header to 304 and exiting.
cheers,
martin
step 3 includes creating the content that is expensive to create
-
Guys, you're missing something fairly fundamnetal.
It helps almost _nothing_ to support client-side caching with all these
fancy "If-Modified-Since:" etc crap.That's not the _problem_.
It's usually not one client asking for the gitweb pages: the load comes
from just lots of people independently asking for it. So client-side
caching may help a tiny tiny bit, but it's not actually fixing the
fundamental problem at all.So forget about "If-Modified-Since:" etc. It may help in benchmarks when
you try it yourself, and use "refresh" on the client side. But the basic
problem is all about lots of clients that do NOT have things cached,
because all teh client caches are all filled up with pr0n, not with gitweb
data from yesterday.So the thing to help is server-side caching with good access patterns, so
that the server won't have to seek all over the disk when clients that
_don't_ have things in their caches want to see the "git projects" summary
overview (that currently lists something like 200+ projects).So to get that list of 200+ projects, right now gitweb will literally walk
them all, look at their refs, their descriptions, their ages (which
requires looking up the refs, and the objects behing the refs), and if
they aren't cached, you're going to have several disk seeks for each
project.At 200+ projects, the thing that makes it slow is those disk seeks. Even
with a fast disk and RAID array, the seeks are all basically going to be
interdependent, so there's no room for disk arm movement optimization, and
in the absense of any other load it's still going to be several seconds
just for the seeks (say 10ms per seek, four or five seeks per project,
you've got 10 seconds _just_ for the seeks to generate the top-level
summary page, and quite frankly, five seeks is probably optimistic).Now, hopefully some of it will be in the disk cache, but when the
mirroring happens, it will basically blow the disk caches away totally
(when using the "--checks...
If that was the only time that happened, it would be a non-issue, since
that only happens once every 96 hours. However, the problem is that we
now have lots of large datasets that blow out the caches on a much more
frequent basis.-hpa
-
Well, the idea (perhaps stupid idea: I don't know how caching engines
/ reverse proxy works) was that there would be caching engine / reverse
proxy in the front (Squid for example) would cache results and serve it
to rampaging hordes. But this caching engine has to ask gitweb if the
cache is valid using "If-Modified-Since:" and "If-None-Match:" headers.What about the other idea, the one with raising expires to infinity for
immutable pages like "commit" view for commit given by SHA-1? Even if
the clients won't cache it, the proxies and caches between gitweb and
client might cache it...Talking about most accessed gitweb pages, the project list page changes
on every push, the project summary page and project main RSS feed
(now in both RSS and Atom formats) changes on every push to given project.
With a help of hooks they can be static pages, generated by push...
...with the exception that projects list and summary pages have _relative_
dates.--
Jakub Narebski
Poland
-
Sure, if the proxies actually do the rigth thing (which they may or may
I agree, but as mentioned, I think the _real_ problem tends to be the
pages that don't act that way (ie summary pages, both at the individual
project level and the top "all projects" level).Linus
-
squid seems to work well as an HTTP accelerator (reverse proxy).
Apache's mem|disk cache stuff fails miserably.Unfortunately squid development seems to have slowed in recent years.
Jeff
-
For a high-traffic setup like kernel.org, you can setup a local
reverse proxy -- it's a pretty standard practice. That allows you to
control a well-behaved and locally tuned caching engine just by
emitting good headers.It beats writing and maintaining an internal caching mechanism for
each CGI script out there by a long mile. It means there'll be no
further tunables or complexity for administrators of other gitweb
installs.cheers,
martin
-
If gitweb produced cache-friendly headers, squid could definitely serve
as an HTTP front-end ("HTTP accelerator" mode in squid talk).In fact, given kernel.org's slave1/slave2<->master setup, that's a
pretty natural fit for caching files and/or cache-aware CGI output.You could even replace rsync to the slaves, if squid was serving as the
front-end accelerator running on the slaves, communicating to the master.squid is smart enough to hold off a thundering herd, and only pulls
single cacheable copies of files as needed.Jeff
-
It depends on how creatively you think ;-)
Consider generating static HTML files on each push, via a hook, for many
of the toplevel files. The static HTML would then link to the CGI forThis re-opens the question mentioned earlier, is Kay (or anyone?) still
This could be statically generated by a robot. I think everybody would
Or simply generate regular filesystem files into the webspace, as
triggered by a hook. Let the standard filesystem mirroring/caching work
its magic.Jeff
-
You mean that the links in this pre-generated HTML would be to CGI
By the way, thanks to Martin Waitz it is much easier to install gitweb.
I for example use the following script to test changes I have made to gitweb:-- >8 --
#!/bin/bashBINDIR="/home/local/git"
function make_gitweb()
{
pushd "/home/jnareb/git/"make GITWEB_PROJECTROOT="/home/local/scm" \
GITWEB_CSS="/gitweb/gitweb.css" \
GITWEB_LOGO="/gitweb/git-logo.png" \
GITWEB_FAVICON="/gitweb/git-favicon.png" \
bindir=$BINDIR \
gitweb/gitweb.cgipopd
}function copy_gitweb()
{
cp -fv /home/jnareb/git/gitweb/gitweb.{cgi,css} /home/local/gitweb/
}make_gitweb
copy_gitweb# end of gitweb-update.sh
Gitweb can generate this file. The problem is that one would have to
temporary turn off using index file. This can be done by having the
following gitweb_list_projects.perl file:-- >8 --
#!/usr/bin/perl$projects_list = "";
-- >8 --then use the following invocation to generate project index file:
$ GATEWAY_INTERFACE="CGI/1.1" HTTP_ACCEPT="*/*" REQUEST_METHOD="GET" \
GITWEB_CONFIG=gitweb_list_projects.perl QUERY_STRING="a=project_index" \
gitweb.cgi--
Jakub Narebski
Poland
-
Yes, they must be. Otherwise, the gitweb interface changes.
You don't want to pre-generate HTML for every possible git query, that
would cause an explosion of data.Both the HTML generator and CGI would need to know which pages were
pre-generated and which are not.Jeff
-
In Squid 2.6:
collapsed_forwarding on
refresh_stale_window <seconds>
(apply the latter only to stanzas where you want "readahead" of
about-to-expire cache entries)Brief design description at http://devel.squid-cache.org/collapsed_forwarding/.
(I didn't write this code, everything I know about squid leaked
through the Google-shaped pinhole in my tinfoil hat, etc. But if you
go this way I'd like to be in the loop to understand the scalability
issues around netfilter-assisted transparent proxying.)Cheers,
- Michael
-
Yeah, those look like the Right Thing (tm) to do.
That said, I'm not personally convinced that there is much point to using
netfilter for transparent proxying. Why not just use separate ports for
squid and for apache?Linus
-
That's what most people using squid in "http accelerator" mode do. They
put Apache on port 8080 or somesuch.Jeff
-
Just a question of whether you want to be able to yank the squid box
out if it goes pear-shaped, without touching configs on the apache
box. Some people like to stick the proxy in as a no-op at first, then
tell netfilter to divert 1% of sessions to squid and see how it holds
up, retune, ease it in, ease it out, figure out how much operational
flexibility you will have as demand continues to scale. If the squid
and apache are on the same box it's probably less of an issue.Cheers,
- Michael
-
Yeah, this is pretty trivial since one can just do redirects. However,
I still think a backend cache is better, since it can detach itself from
Apache when appropriate (e.g. the background refresh scenario, or timeout.)-hpa
-
There is another thing that probably will be required, and I'm not sure
if something in front of Apache (like Squid) rather than behind it can
easily deal with: on timeout, the process needs to continue in order to
feed the cache. Otherwise, you're still in a failure scenario as soon
as timeout happens.-hpa
-
I would think this would be a great deal easier to handle in an
arm's-length "accelerator" than in the origin server. Only restart
the hit to the origin server if you think that something has actually
gone wrong there. Serve stale data to the client if you have to.
From the page I quoted:"In addition an option to shortcut the cache revalidation of
frequently accessed objects is added, making further requests
immediately return as a cache hit while a cache revalidation is
pending. This may temporarily give slightly stale information to the
clients, but at the same time allows for optimal response time while a
frequently accessed object is being revalidated. This too is an
optimization only intended for accelerators, and only for accelerators
where minimizing request latency is morer important than freshness."I don't know how sophisticated this logic is currently, but I would
think that it wouldn't be that hard to tune up.Cheers,
- Michael
-
True, but it needs to run behind Apache rather than in front of it.
-hpa
-
Memory-only caching is kind of nasty. Memory is a premium resource on
kernel.org.-hpa
-
hmmm. Well, I have been wondering why nobody ever came up with a
system-wide local (==disk) cache for remote and/or calculated objects.
Maybe its time to do something about that.I've been in a daemon-writing mood lately.
Jeff
-
If you want to do side effect generation of cache contents, it might not
be possible to do it that way. At the very least gitweb needs to be
aware of how to explicitly enter things into the cache.All of this isn't really all that hard; I have implemented all that
stuff for diffview, for example (when generating a single diff hunk, you
naturally end up producing all of them, so you want to have themTrue about mod_perl. Haven't messed with that myself, either.
Heh :)
-hpa
-
In the case of Perl scripts, it's not really the fork/exec overhead,
but the Perl startup overhead that you want to try to optimize. But
given your later statement (lots of spare cpu), this ends up just
being a bit of a latency hit. In general, I think mod_perl has a
much bigger impact when you have a database to connect to at startup.
-
I've been playing around with a "native git" cgi thingy the last week
(I call it cgit), and I've been thinking about adding exactly this
kind of caching to it. And since it's basically a standard git command
written in C, it should have less overhead than any perl
implementation.It's far from ready yet, but I'll try to publish some code this
weekend just in case someone finds it interesting.--
larsh
-
Trust me, perl, or CGI, is not the problem. It's all about I/O traffic
generated by git.-hpa
-
Yes, I understand. That's why I've been thinking about internal
caching of pages.It's just a kick doing it in C, playing around with the git internals :-)
--
larsh
-
That's fine, but it does make it harder to maintain.
-hpa
-
Dnia pi
That'll be the winning solution. A combination of
- cache SHA1-based requests forever
- cache ref-based requests a longish time, setting an ETag that
contains headname+SHA1
- on 'revalidate', check the ETag vs the ref and only recompute if
things have changedIn the meantime, the code on kernel.org needs to be updated to the
latest gitweb. On our server, I'd say the newer gitweb is 3~4 times
faster serving the "expensive" summary pages. And much smarter in
terms of caching headers.cheers
martin
-
Doesn't solve the thundering herd problem or the timeout problem at all,
though.-hpa
-
I posted separately about those. And I've been mulling about whether
the thundering herd is really such a big problem that we need to
address it head-on. If we doHTTP caching headers right (that is, a
bit better than now) then the fact that web caches are distributed
means that even a cache restart or cache invalidation won't trigger a
thundering herd.And gitweb rarely has a "new" URL that gets a ton of hits immediately.
Our real problem is the summary page, and the fact that we aren't
setting an effecting ETag there. If we do, a front-end cache plus the
ability to revalidate the ETag cheaply will get us through.We get 99% of the benefit from ETags and cheap revalidations,
specially if they are coupled with a reverse caching proxy,. The
remaining 1% of dealing with the highly infrequent thundering herd can
be addressed with the scheme I've posted 5 minutes ago.cheers
martin
-
Uhm... yes it is.
-hpa
-
Got some more info, discussion points or links to stuff I should read
to appreciate why that is? I am trying to articulate why I consider it
is not a high-payoff task, as well as describing how to tackle it.To recap, the reasons it is not high payoff is that:
- the main benefit comes from being cacheable and able to revalidate
the cache cheaply (with the ETags-based strategy discussed above)
- highly distributed caches/proxies means we'll seldom see a true
cold cache situation
- we have a huge set of URLs which are seldom hit, and will never see
a thundering anything
- we have a tiny set of very popular URLs that are the key target for
the thundering herd - (projects page, summary page, shortlog, fulllog)
- but those are in the clear as soon as the caches are populatedWhy do we have to take it head-on? :-)
martin
-
Because the primary failure scenario is timeout on the common queries
due to excess parallel invocations under high I/O load resulting in
catastrophic failure.-hpa
-
Jakub Narebski wrote:
It could be solved using ECMAScript (if that is an option): Include an exact
time stamp or something that browsers not supporting ECMAScript can
show and others browsers can change the time stamp to make it relative
and do the coloring/highlighting of recent activity. This could also slightly
speed up the script and it might be better to provide an exact time stamp
by default if aggressive caching is applied.--
Jonas Fonseca
-
Hmmm, maybe you could have the summaries and rss feed generated on
push, which could also generate elementary files with lines of the
front page. That would make these top offenders static page serving.OG.
-
There are a lot of things which "could be done" given the proper cache
infrastructure and gitweb support.-hpa
-
| David Newall | Re: Slow DOWN, please!!! |
| Greg Kroah-Hartman | [PATCH 001/196] Chinese: Add the known_regression URI to the HOWTO |
| Fernando Luis | [PATCH] affinity is not defined in non-smp kernels - x86_64 |
git: | |
| David Miller | [GIT]: Networking |
| Jarek Poplawski | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Gerrit Renker | [PATCH 28/37] dccp: Integration of dynamic feature activation - part 3 (client side) |
| Jean-Louis Dupond | tg3 driver not advertising 1000mbit |
