[RFC PATCH v7 10/9] gitweb: Background cache generation and progress indicator

Previous thread: Using overlay filesystem for "other" files idea by Evgeniy Ivanov on Wednesday, December 22, 2010 - 4:02 pm. (3 messages)

Next thread: [PATCH 1/3] Makefile: add NO_FNMATCH_CASEFOLD to IRIX sections by Brandon Casey on Wednesday, December 22, 2010 - 4:58 pm. (7 messages)
From: Jakub Narebski
Date: Wednesday, December 22, 2010 - 4:54 pm

This is preliminary (proof of concept) version of shortened series
intended as replacement (rewrite) of "Gitweb caching v8" series from
John 'Warthog9' Hawley (J.H.).

This series shows how one can manage exception handling using
die_error like die, even in the presence of output caching.  The
output caching engine has an option that allows to turn off (default)
or on caching of error pages.

This series is unfinished; it does not include adaptive cache
lifetime, nor support for other caching engines than the one provided
(like Cache::Cache or CHI), nor does it support background cache
generation or progress info indicator.

This is just intended as proof of concept.

---
Jakub Narebski (9):
      gitweb: Add optional output caching
      gitweb/lib - Cache captured output (using compute_fh)
      gitweb/lib - Very simple file based cache
      gitweb/lib - Simple output capture by redirecting STDOUT to file
      t/test-lib.sh: Export also GIT_BUILD_DIR in test_external
      gitweb: Prepare for splitting gitweb
      gitweb: Introduce %actions_info, gathering information about actions
      gitweb: use eval + die for error (exception) handling
      gitweb: Go to DONE_REQUEST rather than DONE_GITWEB in die_error

 gitweb/Makefile                                |   22 +
 gitweb/README                                  |   46 ++
 gitweb/gitweb.perl                             |  280 +++++++++++++--
 gitweb/lib/GitwebCache/CacheOutput.pm          |   84 ++++
 gitweb/lib/GitwebCache/Capture/ToFile.pm       |  109 ++++++
 gitweb/lib/GitwebCache/FileCacheWithLocking.pm |  452 ++++++++++++++++++++++++
 t/gitweb-lib.sh                                |   11 +
 t/t9500-gitweb-standalone-no-errors.sh         |   20 +
 t/t9501-gitweb-standalone-http-status.sh       |   13 +
 t/t9502-gitweb-standalone-parse-output.sh      |   33 ++
 t/t9510-gitweb-capture-interface.sh            |   34 ++
 t/t9510/test_capture_interface.pl              |  132 +++++++
 ...
From: Jakub Narebski
Date: Wednesday, December 22, 2010 - 4:55 pm

End the request after die_error finishes, rather than exiting gitweb
instance (perhaps wrapped like in ModPerl::Registry or gitweb.psgi
case).

Signed-off-by: Jakub Narebski <jnareb@gmail.com>
---

 gitweb/gitweb.perl |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)


diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index 4779618..724287b 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -1169,6 +1169,7 @@ sub run {
 
 		run_request();
 
+	DONE_REQUEST:
 		$post_dispatch_hook->()
 			if $post_dispatch_hook;
 		$first_request = 0;
@@ -3767,7 +3768,7 @@ EOF
 	print "</div>\n";
 
 	git_footer_html();
-	goto DONE_GITWEB
+	goto DONE_REQUEST
 		unless ($opts{'-error_handler'});
 }
 

--

From: Jonathan Nieder
Date: Wednesday, December 22, 2010 - 6:55 pm

[side note: the "@@ EOF" line above would say "@@ sub die_error {" if

This seems to remove the last user of the DONE_GITWEB label.  Why not
delete the label, too?

When die_error is called by CGI::Carp (via handle_errors_html), it
does not rearm the error handler afaict.  Previously that did not
matter because die_error kills gitweb; now should it be set up
again?

die_error gets called when server load is too high; I wonder whether
it is right to go back for another request in that case.

A broken per-request (or other) configuration could potentially leave
a gitweb process in a broken state, and until now the state would be
reset on the first error.  I wonder if escape valve would be needed
--- e.g., does the CGI harness take care of starting a new gitweb
process after every couple hundred requests or so?

Aside from those (minor) worries, this patch seems like a good idea.
--

From: Jakub Narebski
Date: Saturday, December 25, 2010 - 3:14 pm

Hmmm, I thought that git has Perl-specific diff driver (xfuncname), but
I see that it doesn't.  The default funcname works quite well for Perl
code... with exception of here-documents (or rather their ending).


Well, actually this patch is in this series only for the label ;-)

Anyway, I can simply drop this patch, and have next one in series
(adding exception-based error handling, making die_error work like

Thanks, I missed this (but after examining it turns out to be a 
non-issue).  That will teach me to leave code outside of run() 
subroutine; one of reasons behind creating c2394fe (gitweb: Put all 
per-connection code in run() subroutine, 2010-05-07) was to clarify 
code flow.

A note: using set_message inside handle_errors_html was necessary 
because if there was a fatal error in die_error, then 
handle_errors_html would be called recursively - this was fixed in 
CGI.pm 3.45, but we cannot rely on this; we cannot rely on having new 
enough version of CGI::Carp that supports set_die_handler either.

But actually handle_errors_html gets called only from fatalsToBrowser,
which in turn gets called from CGI::Carp::die... which ends calling
CODE::die (aka realdie), which ends CGI process anyway.

That is why die_error ends with

	goto DONE_GITWEB
		unless ($opts{'-error_handler'});

i.e. it doesn't goto DONE_GITWEB nor DONE_REQUEST if called from

If client (web browser) are requesting connection, we have to tell it
something anyway.  Note that each request might serve different client.
But when the die_error(503, "The load average on the server is too 

'die $@ if $@' would call CORE::die, which means it would end gitweb
process.

For CGI server it doesn't matter anyway, as for each request the process
is respawned anyway (together with respawning Perl interpreter), and I
think that ModPerl::Registry and FastCGI servers monitor process that it
 
Thanks a lot for your comments.

-- 
Jakub Narebski
Poland
--

From: Jonathan Nieder
Date: Sunday, December 26, 2010 - 2:07 am

The default function name discovery already works quite well for Perl
code... with the exception of here-documents (or rather their ending).

 sub foo {
	print <<END
 here-document
 END
	return 1;
 }

The default funcname pattern treats the unindented END line as a
function declaration and puts it in the @@ line of diff and "grep
--show-function" output.

With a little knowledge of perl syntax, we can do better.  You can
try it out by adding "*.perl diff=perl" to the gitattributes file.

Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
---

Maybe something like this?

 Documentation/gitattributes.txt |    2 ++
 t/t4018-diff-funcname.sh        |    2 +-
 userdiff.c                      |   15 +++++++++++++++
 3 files changed, 18 insertions(+), 1 deletions(-)

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 5a7f936..e59b878 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -494,6 +494,8 @@ patterns are available:
 
 - `pascal` suitable for source code in the Pascal/Delphi language.
 
+- `perl` suitable for source code in the Perl language.
+
 - `php` suitable for source code in the PHP language.
 
 - `python` suitable for source code in the Python language.
diff --git a/t/t4018-diff-funcname.sh b/t/t4018-diff-funcname.sh
index 0a61b57..3646930 100755
--- a/t/t4018-diff-funcname.sh
+++ b/t/t4018-diff-funcname.sh
@@ -32,7 +32,7 @@ EOF
 
 sed 's/beer\\/beer,\\/' < Beer.java > Beer-correct.java
 
-builtin_patterns="bibtex cpp csharp fortran html java objc pascal php python ruby tex"
+builtin_patterns="bibtex cpp csharp fortran html java objc pascal perl php python ruby tex"
 for p in $builtin_patterns
 do
 	test_expect_success "builtin $p pattern compiles" '
diff --git a/userdiff.c b/userdiff.c
index 2d54536..fc2afe3 100644
--- a/userdiff.c
+++ b/userdiff.c
@@ -61,6 +61,21 @@ PATTERNS("pascal",
 	 "|[-+0-9.e]+|0[xXbB]?[0-9a-fA-F]+"
 	 "|<>|<=|>=|:=|\\.\\."
 	 ...
From: Jonathan Nieder
Date: Sunday, December 26, 2010 - 3:54 am

Hmm.  Should it be

	|([\x80-\xff]+[\x00-\x7f])

then, to match exactly one multibyte UTF-8 character?
--

From: Jonathan Nieder
Date: Sunday, December 26, 2010 - 4:22 am

Yes, that's what I was thinking of.  v2 will be a two-part series
starting with that.

BTW, the perl token matcher is pretty half-hearted.  In part this is
because "only perl can parse perl" [1] terrifies me and in part it is
because I am too lazy to write down the state machine implied by
PPI/Token/*.pm.

If some tokenization wizard would like to work on it, something like
the following might produce more pleasant word diffs:

	"[%&$][[:space:]]*[0-9]+"	/* $1 */
	"|[%&$][[:space:]]*([[:alpha:]_']|::)([[:alnum:]_']|::)*"	/* $var1 */
	"|[%&$][[:space:]]*\\$([[:alnum:]_]|::)([[:alnum:]_']|::)*"	/* $$var1 */
	"|[%&$][[:space:]]*\\$\\{"     /* $${ introducing complicated expression */
	"|[%&$][[:space:]]*\\$\\$"     /* $$$ introducing complicated expression */
	"|[%&$][[:space:]]*[^[:alnum:]_:'^$]"	/* $! */
	"|[%&$][[:space:]]*\\^[][A-Z\\^_?]"	/* $^A */
	"|[%&$][[:space:]]*\\{\\^[][A-Z\\^_?]\\}"	/* ${^A} */
	"|[%&$][[:space:]]*\\{\\^[][A-Z\\^_?][[:alnum:]_]*\\}" /* ${^Foo} */
	/* ${var} */
	"|[%&$][[:space:]]*\\{[[:space:]]*([[:alpha:]_']|::)[[:alnum:]_:]*[[:space:]]\\}"
	"|[%&$][[:space:]]*\\{"	/* ${ introducing complicated expression */
	...

though it is an unmaintainable mess. :)

[1] perl::toke.c and http://www.perlmonks.org/?node_id=44722
--

From: Jakub Narebski
Date: Sunday, December 26, 2010 - 4:14 pm

Thanks a lot.


Besides here-doc, there are some tricky things that such code should
be aware about.

1. BEGIN {
   	...
   }

   and similar code blocks (END, CHECK, INIT, ...) which I think should
   be marked as 'BEGIN' in diff chunk.

2. sub foo {
    FOO: while (1) {
   		...
   	}
   }

   which should be marked with 'sub foo {', I think

3. =head1 NAME

   Git - Perl interface to the Git version control system

   =cut

   i.e. POD... which I don't know what to do about.


I have not checked what your code does wrt those.

-- 
Jakub Narebski
Poland
--

From: Junio C Hamano
Date: Monday, December 27, 2010 - 10:18 am

I do not think Jonathan's patterns would be fooled by this; it wants to
catch only "package <anything>;" and "sub <anything> {".

Jonathan's pattern set allows them to be indented, and followed by some
garbage at the end., which we might want to tighten.  How many people
start 'package' and the outermost 'sub' indented?

 userdiff.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/userdiff.c b/userdiff.c
index fc2afe3..79569c4 100644
--- a/userdiff.c
+++ b/userdiff.c
@@ -62,8 +62,10 @@ PATTERNS("pascal",
 	 "|<>|<=|>=|:=|\\.\\."
 	 "|[^[:space:]]|[\x80-\xff]+"),
 PATTERNS("perl",
-	 "^[ \t]*package .*;\n"
-	 "^[ \t]*sub .* \\{",
+	 "^package .*;\n"
+	 "^sub .* \\{\n"
+	 "^[A-Z]+ \\{\n"	/* BEGIN, END, ... */
+	 "^=head[0-9] ",	/* POD */
 	 /* -- */
 	 "[[:alpha:]_'][[:alnum:]_']*"
 	 "|0[xb]?[0-9a-fA-F_]*"


--

From: Jakub Narebski
Date: Monday, December 27, 2010 - 3:44 pm

Note that in future Perl 5.14 there would be 'package NAME {' form,
so perhaps it would be better to future-proof and use


Using "sub foo {" is just a recommended programming convention (like e.g.
GNU convention or K&R convention for C code).  I think it would be better
to relax it a bit, either

  +	 "^sub "

or



-- 
Jakub Narebski
Poland
--

From: Jeff King
Date: Monday, December 27, 2010 - 8:52 pm

FWIW, I sometimes do:

{
  my $static;
  sub foo {
    ... use $static ...
  }
}

so perhaps allowing whitespace in front of the keyword is worthwhile
there. I have never indented "package", though.

-Peff
--

From: Jonathan Nieder
Date: Sunday, December 26, 2010 - 2:50 am

I like the current order (first the brief patch to change the
semantics, then the more ambitious change to an eval {} based error

Right, I should have thought a few seconds more.  Respawning


Sorry, that was unclear of me.  I meant that buggy configuration could
leave a gitweb process in buggy but alive state and frequent failing
requests might be a way to notice that.  Contrived example (just to
illustrate what I mean):

	our $version .= ".custom";
	if (length $version >= 1000) {	# untested, buggy code goes here.
		@diff_opts = ("--nonsense");
	}

I think I was not right to worry about this, either.  It is better to
make such unusual and buggy configurations as noticeable as possible
so they can be fixed.


Thanks for a thorough explanation.  For what it's worth, with or
without removal of the DONE_GITWEB: label,

Reviewed-by: Jonathan Nieder <jrnieder@gmail.com>

[1] I can imagine scenarios in which exiting gitweb would help
alleviate the load, involving:

 - large memory footprint for each gitweb process forcing the system
   into swapping (e.g., from a memory leak), or
 - FastCGI-like server noticing the load and choosing to decrease the
   number of gitweb instances.

In the usual case, presumably gitweb memory footprint is small and
FastCGI-like servers limit the number of gitweb instances to a modest
fixed number.
--

From: Jakub Narebski
Date: Sunday, December 26, 2010 - 3:25 pm

I assume that CGI / FastCGI / mod_perl (+ ModPerl::Registry) web server
would know how to regulate number of workers according to the server

I'm sorry I haven't made myself clear.

What I meant here is that gitweb includes the following code

	if (-e $GITWEB_CONFIG) {
		do $GITWEB_CONFIG;
		die $@ if $@;
	}

which means that CGI::Carp::die is called, which might call 
handle_errors_html, and which ends in CORE::die, which ends gitweb
process.  So if there is no way for broken configuration to leave
gitweb in a rboken state _at this point in series_.

Thank you for thinking about this, because it could cause problems
(could because I have not checked if it does or if it doesn't) in the
following patch, when gitweb uses eval / die for error handling.
Then it might happen when $per_request_config is false or CODE that
instead of trying to reread broken config on subsequent requests, we
will run with broken config.  It depends if "die"-ing in 
evaluate_gitweb_config would prevent setting $first_request to false.



Thanks.

-- 
Jakub Narebski
Poland
--

From: Jakub Narebski
Date: Wednesday, December 22, 2010 - 4:55 pm

In gitweb code it is assumed that calling die_error() subroutine would
end request, just like running "die" would.  Up till now it was done by
having die_error() jump to DONE_REQUEST (earlier DONE_GITWEB), or in
earlier version just 'exit' (for mod_perl via ModPerl::Registry it ends
request instead of exiting worker).

Instead of using 'goto DONE_REQUEST' for longjmp-like nonlocal jump, or
using 'exit', gitweb uses now native for Perl exception handlingin the
form of eval / die pair ("eval BLOCK" to trap exceptions, "die LIST" to
raise/throw them).

Up till now the "goto DONE_REQUEST" solution was enough, but with the
coming output caching support and it adding modular structure to gitweb,
it would be difficult to continue to keep using this solution
(e.g. interaction with capturing output).


Because gitweb now traps all exceptions occuring run_request(), the
handle_errors_html handler (set via set_message from CGI::Carp) is not
needed; gitweb can call die_error in -error_handler mode itself.  This
has the advantage that we can now set correct HTTP header (handler from
CGI::Carp::set_message was run after HTTP headers were already sent).

Gitweb assumes here that exceptions thrown by Perl would be simple
strings; die_error() throws hash reference (if not for minimal
extrenal dependencies, it would be probable object of Class::Exception
or Throwable class thrown).

Note: in newer versions of CGI::Carp there is set_die_handler(), where
handler have to set HTTP headers to the browser itself, but we cannot
rely on new enough CGI::Carp version to have been installed.  Also
set_die_handler interferes with fatalsToBrowser.

Signed-off-by: Jakub Narebski <jnareb@gmail.com>
---

 gitweb/gitweb.perl |   26 ++++++++------------------
 1 files changed, 8 insertions(+), 18 deletions(-)


diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index 724287b..c7a1892 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -12,7 +12,7 @@ use strict;
 use warnings;
 use CGI ...
From: Jonathan Nieder
Date: Wednesday, December 22, 2010 - 7:08 pm

The !ref($@) seems overzealous, which is why I am wondering if it

Is the DONE_REQUEST label still needed?

Thanks, I am happy to see the semantics becoming less thorny.
Jonathan
--

From: Jakub Narebski
Date: Saturday, December 25, 2010 - 4:17 pm

First, 'gitweb: Prepare for splitting gitweb' commit is only later in
series... ;-) but that of course is not a serious issue.

Second, more important is that I'd rather gitweb doesn't go "reinvent
the wheel" route.  I'd rather (re)use Exception::Class (like e.g. 
SVN::Web does it) if we go the OO exception handling route.

But if we are going to use Exception::Class, then we can also use



You meant Scalar::Util::blessed here, isn't it? Fortunately Scalar::Util
is core Perl module.

By 'overzealous' do you mean here possibility of catching what we 
shouldn't, i.e. non-gitweb error (not thrown by die_error)?  We can
narrow it to "ref($@) eq 'HASH'", but I don't think it would be ever


Now I should check if this doesn't affect gitweb performance too badly.
IIRC I have chosen 'goto DONE_GITWEB' because I didn't know about 
ModPerl::Registry redefining 'exit' (why it was done), and because of
some microbenchmark showing that it performs better than die/eval (why
this specific solution)...

But I think that the performance hit would be negligible in practice;
making gitweb more maintainable is I think worth the cost.

-- 
Jakub Narebski
Poland
--

From: Jakub Narebski
Date: Monday, January 3, 2011 - 5:35 pm

Unify error handling by treating errors from Perl (and thrown early in
process using 'die STRING'), and errors from gitweb in the same way.
This means that in both cases error page is generated after an error
is caught in run() subroutine.

die_error() subroutine is now split into three: gen_error() which
massages parameters (escaping HTML, turning HTTP status number into
full HTTP status code), die_error() which uses gen_error() and just
throws an error (and does not generate an error page), and
send_error() which catually generate error page based on provided
error / exception.


Sidenote: probably in the future instead of using simple hash for
throwing gitweb exception, gitweb would use some custom error class,
e.g. derivative of Exception::Class (like SVN::Web does it).

Signed-off-by: Jakub Narebski <jnareb@gmail.com>
---
This is sent early to facilitate early comments.  It passes test suite,
but it was not extensively tested.

Now die_error() functions mode like 'die'...

 gitweb/gitweb.perl |   47 ++++++++++++++++++++++++++++++++++++-----------
 1 files changed, 36 insertions(+), 11 deletions(-)

diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index c7a1892..5854f73 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -1153,9 +1154,13 @@ sub run {
 			if $pre_dispatch_hook;
 
 		eval { run_request() };
-		if (defined $@ && !ref($@)) {
+		my $error = $@;
+		if ($error) {
 			# some Perl error, but not one thrown by die_error
-			die_error(undef, undef, $@, -error_handler => 1);
+			$error = gen_error(undef, undef, $error)
+				unless ref($error);
+
+			send_error($error);
 		}
 
 	DONE_REQUEST:
@@ -3730,11 +3735,14 @@ sub git_footer_html {
 #      an unknown error occurred (e.g. the git binary died unexpectedly).
 # 503: The server is currently unavailable (because it is overloaded,
 #      or down for maintenance).  Generally, this is a temporary state.
-sub die_error {
+
+# gen_error()  generates error object from parameters
+# ...
From: Jakub Narebski
Date: Wednesday, December 22, 2010 - 4:55 pm

Currently it only contains information about output format, and is not
used anywhere.  It will be used to check whether current action
produces HTML output, and therefore is displaying HTML-based progress
info about (re)generating cache makes sense.

It can contain information about allowed extra options, whether to
display link to feed (Atom or RSS), etc. in easier and faster way than
listing all matching or all non-matching actions at appropriate place.


Currently not used; will be used in next commit, to check if action
produces HTML output and therefore we can use HTML-specific progress
indicator.

Signed-off-by: Jakub Narebski <jnareb@gmail.com>
---

 gitweb/gitweb.perl |   57 ++++++++++++++++++++++++++++++++++++++++++++++++----
 1 files changed, 53 insertions(+), 4 deletions(-)


diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index c7a1892..e50654b 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -749,6 +749,54 @@ our %allowed_options = (
 	"--no-merges" => [ qw(rss atom log shortlog history) ],
 );
 
+# action => {
+# 	# what is the output format (content-type) of action
+# 	'output_format' => ('html' | 'text' | 'feed' | 'binary' | undef),
+# 	# does action require $project parameter to work
+# 	'needs_project' => (boolean | undef),
+# 	# log-like action, can start with arbitrary ref or revision
+# 	'log_like' => (boolean | undef),
+# 	# has no specific feed, or should lik to OPML / generic project feed
+# 	'no_feed' => (boolean | undef),
+# 	# allowed options to be passed ussing 'opt' parameter
+# 	'allowed_options' => { 'option_1' => 1 [, ... ] },
+# }
+our %actions_info = ();
+sub evaluate_actions_info {
+	our %actions_info;
+	our (%actions);
+
+	# unless explicitely stated otherwise, default output format is html
+	# most actions needs $project parameter
+	foreach my $action (keys %actions) {
+		$actions_info{$action}{'output_format'} = 'html';
+		$actions_info{$action}{'needs_project'} = 1;
+	}
+	# list all exceptions; undef ...
From: Jakub Narebski
Date: Wednesday, December 22, 2010 - 4:57 pm

Add GitwebCache::Capture::ToFile package, which captures output by
redirecting STDOUT to given file (specified by filename, or given opened
filehandle), earlier saving original STDOUT to restore it when finished
capturing.

GitwebCache::Capture::ToFile preserves PerlIO layers, both those set
before started capturing output, and those set during capture.

No care was taken to handle the following special cases (prior to
starting capture): closed STDOUT, STDOUT reopened to scalar reference,
tied STDOUT.  You shouldn't modify STDOUT during capture.

Includes separate tests for capturing output in
t9510/test_capture_interface.pl which is run as external test from
t9510-gitweb-capture-interface.sh.  It tests capturing of utf8 data


This patch was based on "gitweb: add output buffering and associated
functions" patch by John 'Warthog9' Hawley (J.H.) in "Gitweb caching v7"
series, and on code of Capture::Tiny by David Golden (Apache License 2.0).

Based-on-work-by: John 'Warthog9' Hawley <warthog9@kernel.org>
Signed-off-by: Jakub Narebski <jnareb@gmail.com>
---

 gitweb/lib/GitwebCache/Capture/ToFile.pm |  109 +++++++++++++++++++++++++
 t/t9510-gitweb-capture-interface.sh      |   34 ++++++++
 t/t9510/test_capture_interface.pl        |  132 ++++++++++++++++++++++++++++++
 3 files changed, 275 insertions(+), 0 deletions(-)
 create mode 100644 gitweb/lib/GitwebCache/Capture/ToFile.pm
 create mode 100755 t/t9510-gitweb-capture-interface.sh
 create mode 100755 t/t9510/test_capture_interface.pl


diff --git a/gitweb/lib/GitwebCache/Capture/ToFile.pm b/gitweb/lib/GitwebCache/Capture/ToFile.pm
new file mode 100644
index 0000000..d2dbf0f
--- /dev/null
+++ b/gitweb/lib/GitwebCache/Capture/ToFile.pm
@@ -0,0 +1,109 @@
+# gitweb - simple web interface to track changes in git repositories
+#
+# (C) 2010, Jakub Narebski <jnareb@gmail.com>
+#
+# This program is licensed under the GPLv2
+
+#
+# Simple output capturing via redirecting STDOUT to given file.
+#
+
+# This is the ...
From: Jonathan Nieder
Date: Friday, December 24, 2010 - 2:49 am

Micronit: if the license of Capture::Tiny were relevant then we would be
in trouble, I think.  (Apache-2.0 and GPLv2 aren't compatible licenses.)
Luckily


looks trivial enough.  Maybe either avoiding mention of the license or
clarifying that that is not intended to be the sole license for the
stripped-down code would help?
--

From: Jakub Narebski
Date: Sunday, December 26, 2010 - 4:03 pm

Damn, I have thought that Apache-2.0 and GPLv2 are compatibile.  This is
the only reason that I explicitely mentioned the license (that and it is
not usual "licensed like Perl", i.e. dual Artistic Perl License / GPL 

You are right.  I have done similar thing for PerlIO::Util based capture,
though I didn't know about the 'binmode($fh, join(":", ":raw", @unique));'
trick.

So I think we would be in the clear by changing the comment to read:

  +# see also _relayer in Capture::Tiny by David Golden

or something like that.


Or we can try to change gitweb license to GPLv3 / AGPLv3, which is
compatibile (one way only) with Apache-2.0... just kidding :-)
-- 
Jakub Narebski
Poland
--

From: Jakub Narebski
Date: Wednesday, December 22, 2010 - 4:56 pm

This way we can use it in test scripts written in other languages
(e.g. in Perl), and not rely on "$TEST_DIRECTORY/.."

Signed-off-by: Jakub Narebski <jnareb@gmail.com>
---

 t/test-lib.sh |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)


diff --git a/t/test-lib.sh b/t/test-lib.sh
index 48fa516..c077fa4 100644
--- a/t/test-lib.sh
+++ b/t/test-lib.sh
@@ -552,9 +552,9 @@ test_external () {
 		# Announce the script to reduce confusion about the
 		# test output that follows.
 		say_color "" "# run $test_count: $descr ($*)"
-		# Export TEST_DIRECTORY, TRASH_DIRECTORY and GIT_TEST_LONG
+		# Export TEST_DIRECTORY, GIT_BUILD_DIR, TRASH_DIRECTORY and GIT_TEST_LONG
 		# to be able to use them in script
-		export TEST_DIRECTORY TRASH_DIRECTORY GIT_TEST_LONG
+		export TEST_DIRECTORY GIT_BUILD_DIR TRASH_DIRECTORY GIT_TEST_LONG
 		# Run command; redirect its stderr to &4 as in
 		# test_run_, but keep its stdout on our stdout even in
 		# non-verbose mode.

--

From: Jakub Narebski
Date: Wednesday, December 22, 2010 - 4:56 pm

Prepare gitweb for having been split into modules that are to be
installed alongside gitweb in 'lib/' subdirectory, by adding

  use lib __DIR__.'/lib';

to gitweb.perl (to main gitweb script), and preparing for putting
modules (relative path) in $(GITWEB_MODULES) in gitweb/Makefile.

This preparatory work allows to add new module to gitweb by simply
adding

  GITWEB_MODULES += <module>

to gitweb/Makefile (assuming that the module is in 'gitweb/lib/'
directory).

While at it pass GITWEBLIBDIR in addition to GITWEB_TEST_INSTALLED to
allow testing installed version of gitweb and installed version of
modules (for future tests which would check individual (sub)modules).


Using __DIR__ from Dir::Self module (not in core, that's why currently
gitweb includes excerpt of code from Dir::Self defining __DIR__) was
chosen over using FindBin-based solution (in core since perl 5.00307,
while gitweb itself requires at least perl 5.8.0) because FindBin uses
BEGIN block, which is a problem under mod_perl and other persistent
Perl environments (thought there are workarounds).

At Pavan Kumar Sankara suggestion gitweb/Makefile uses

  install [OPTION]... SOURCE... DIRECTORY

format (2nd format) with single SOURCE rather than

  install [OPTION]... SOURCE DEST

format (1st format) because of security reasons (race conditions).
Modern GNU install has `-T' / `--no-target-directory' option, but we
cannot rely that the $(INSTALL) we are using supports this option.

The install-modules target in gitweb/Makefile uses shell 'for' loop,
instead of make's $(foreach) function, to avoid possible problem with
generating a command line that exceeded the maximum argument list
length.

Signed-off-by: Jakub Narebski <jnareb@gmail.com>
---

 gitweb/Makefile    |   17 +++++++++++++++--
 gitweb/gitweb.perl |    8 ++++++++
 2 files changed, 23 insertions(+), 2 deletions(-)


diff --git a/gitweb/Makefile b/gitweb/Makefile
index 0a6ac00..e6029e1 100644
--- a/gitweb/Makefile
+++ ...
From: Jonathan Nieder
Date: Friday, December 24, 2010 - 2:29 am

Spelled out, this means modules would typically go in

	/usr/share/gitweb/lib

Is that the right place?  I suspect something like

	/usr/lib/gitweb/

could make sense in some installations for two reasons:

 - even braindamaged webserver configurations would not serve lib/
   as static files in that case;

 - if some modules are implemented in C for speed, they would need
   to go in /usr/lib anyway to follow usual filesystem conventions.


This explanation and the code below leave me nervous that the answer
might be "no". ;-)

--

From: Jakub Narebski
Date: Sunday, December 26, 2010 - 3:54 pm

Yes, it's true.  It is mainly to support situation where one can install
files in (subdirectory of) cgi-bin, but nowehere else.  That is why the
default is to install modules alongside with gitweb.

The additional advantage is that t/gitweb-lib.sh used by gitweb tests
can very simply test source version of gitweb, with gitweb finding
source version of modules.  But it is not a very large obstacle to

Actually it doesn't matter what web server does with those files when
accessed directly, except for the client (user) confusion if he/she
goes where not invited.  Modules are used by Perl (by gitweb), not by

Ugh, XS!  I sincerely hope that when there would be decision to implement
some features in C for speed, we would be able to use Perl version of
ctypes for C-to-Perl interface, not XS.

Anyway most probable to be implemented in C would be Git.pm, or rather
Perl interface to libgit2.  It is probable that at some point gitweb
would be converted to use Git.pm or its successor.  But I guess that
Git Perl module would be installed somewhere in PERL5LIB, so it would

I have thought that I did provide 'gitweblibdir' as configurable knob,
but I see that in the version I have send I don't do this:

  # Shell quote;
  bindir_SQ = $(subst ','\'',$(bindir))#'
  gitwebdir_SQ = $(subst ','\'',$(gitwebdir))#'
  gitwebstaticdir_SQ = $(subst ','\'',$(gitwebdir)/static)#'
  gitweblibdir_SQ = $(subst ','\'',$(gitwebdir)/lib)#'

But if we are to allow custom gitweblibdir, we would have to change the
way gitweb is to find its modules.  One solution would be inetad of
current

  # __DIR__ is taken from Dir::Self __DIR__ fragment
  sub __DIR__ () {
  	File::Spec->rel2abs(join '', (File::Spec->splitpath(__FILE__))[0, 1]);
  }
  use lib __DIR__ . '/lib';

use simply

  use lib $ENV{GITWEBLIBDIR} || "++GITWEBLIBDIR++";

Of course both gitweb/Makefile and t/gitweb-lib.sh would have to be 
updated: gitweb/Makefile to include replacement rule for '++GITWEBLIBDIR++'
in GITWEB_REPLACE, and ...
From: Jakub Narebski
Date: Wednesday, December 22, 2010 - 4:57 pm

This is first step towards implementing file based output (response)
caching layer that is used on such large sites as kernel.org.

This patch introduces GitwebCaching::SimpleFileCache package, which
follows Cache::Cache / CHI interface, although do not implement it
fully.  The intent of following established convention for cache
interface is to be able to replace our simple file based cache,
e.g. by the one using memcached.

The data is stored in the cache as-is, without adding metadata (like
expiration date), and without serialization (which means that one can
store only scalar data).  At this point there is no support for
expiring cache entries.


The code of GitwebCaching::SimpleFileCache package in gitweb/lib
was heavily based on file-based cache in Cache::Cache package, i.e.
on Cache::FileCache, Cache::FileBackend and Cache::BaseCache, and on
file-based cache in CHI, i.e. on CHI::Driver::File and CHI::Driver
(including implementing atomic write, something that original patch
lacks).  It tries to follow more modern CHI architecture, but without
requiring Moose.  It is much simplified compared to both interfaces
and their file-based drivers.

This patch does not yet enable output caching in gitweb (it doesn't
have all required features yet); on the other hand it includes tests
of cache Perl API in t/t9503-gitweb-caching-interface.sh.

Inspired-by-code-by: John 'Warthog9' Hawley <warthog9@kernel.org>
Signed-off-by: Jakub Narebski <jnareb@gmail.com>
---

 gitweb/lib/GitwebCache/FileCacheWithLocking.pm |  452 ++++++++++++++++++++++++
 t/t9511-gitweb-caching-interface.sh            |   34 ++
 t/t9511/test_cache_interface.pl                |  381 ++++++++++++++++++++
 3 files changed, 867 insertions(+), 0 deletions(-)
 create mode 100644 gitweb/lib/GitwebCache/FileCacheWithLocking.pm
 create mode 100755 t/t9511-gitweb-caching-interface.sh
 create mode 100755 t/t9511/test_cache_interface.pl


diff --git a/gitweb/lib/GitwebCache/FileCacheWithLocking.pm ...
From: Jakub Narebski
Date: Wednesday, December 22, 2010 - 4:57 pm

Add GitwebCache::CacheOutput package, which introduces cache_output
subroutine.  If data for given key is present in cache, then
cache_output gets data from cache and prints it.  If data is not present
in cache, then cache_output runs provided subroutine (code reference),
captures its output, saves this output in cache, and prints it.

It requires that provided $cache supports ->capture_fh method, like
GitwebCache::FileCacheWithLocking introduced in earlier commit, and that
provided $capture supports capturing to file or filehandle via
->capture($code, $file) method, like GitwebCache::Capture::ToFile
introduced in some earlier commit.

Exceptions in $code should be thrown using 'die' (Perl exception
mechanism); one can choose whether error output (output printed when
exception is raised, before raising it) should be saved to cache or not.
By default error output is not cached.

Gitweb would use cache_output to get page from cache, or to generate
page and save it to cache.  The die_error subroutine throws exception,
which will be caught and by default rethrown; error pages would not be
cached.

It is assumed that data is saved to cache _converted_, and should
therefore be read from cache and printed to STDOUT in ':raw' (binary)
mode.


Add t9512/test_cache_output.pl test, run as external test in
t9512-gitweb-cache.  It checks that cache_output behaves correctly,
namely that it saves and restores action output in cache, and that it
prints generated output or cached output, depending on whether there
exist data in cache.

Signed-off-by: Jakub Narebski <jnareb@gmail.com>
---

 gitweb/lib/GitwebCache/CacheOutput.pm    |   84 ++++++++++++++++
 t/t9512-gitweb-cache-output-interface.sh |   34 ++++++
 t/t9512/test_cache_output.pl             |  162 ++++++++++++++++++++++++++++++
 3 files changed, 280 insertions(+), 0 deletions(-)
 create mode 100644 gitweb/lib/GitwebCache/CacheOutput.pm
 create mode 100755 t/t9512-gitweb-cache-output-interface.sh
 create mode 100755 ...
From: Jakub Narebski
Date: Wednesday, December 22, 2010 - 4:58 pm

This commit actually adds output caching to gitweb, as we have now
minimal features required for it in GitwebCache::FileCacheWithLocking
(a 'dumb' but fast file-based cache engine).  To enable cache you need
(at least) set $caching_enabled to true in gitweb config, and copy
required modules alongside generated gitweb.cgi - this is described
in more detail in the new "Gitweb caching" section in gitweb/README.
"make install-gitweb" would install all modules alongside gitweb
itself.

Capturing and caching is designed in such way that there is no
behaviour change if $caching_enabled is false.  If caching is not
enabled, then capturing is also turned off.

Enabling caching causes the following additional changes to gitweb
output:
* Disables content-type negotiation (choosing between 'text/html'
  mimetype and 'application/xhtml+xml') when caching, as there is no
  content-type negotiation done when retrieving page from cache.
  Use lowest common denominator of 'text/html' mimetype which can
  be used by all browsers.  This may change in the future.
* Disable optional timing info (how much time it took to generate the
  original page, and how many git commands it took), and in its place show
  unconditionally when page was originally generated (in GMT / UTC
  timezone).
* Disable 'blame_incremental' view, as it doesn't make sense without
  printing data as soon as it is generated (which would require tee-ing
  when capturing output for caching)... and it doesn't work currently
  anyway.  Alternate solution would be to run 'blame_incremental' view
  with caching disabled.


Add basic tests of caching support to t9500-gitweb-standalone-no-errors
test: set $caching_enabled to true and check for errors for first time
run (generating cache) and second time run (retrieving from cache) for a
single view - summary view for a project.

Check in the t9501-gitweb-standalone-http-status test that gitweb at
least correctly handles "404 Not Found" error pages also in the case
when gitweb ...
From: Jakub Narebski
Date: Friday, December 31, 2010 - 11:03 am

This commit removes asymmetry in serving stale data (if stale data exists)
when regenerating cache in GitwebCache::FileCacheWithLocking.  The process
that acquired exclusive (writers) lock, and is therefore selected to
be the one that (re)generates data to fill the cache, can now generate
data in background, while serving stale data.

Those background processes are daemonized, i.e. detached from the main
process (the one returning data or stale data).  Otherwise there might be a
problem when gitweb is running as (part of) long-lived process, for example
from mod_perl or from FastCGI: it would leave unreaped children as zombies
(entries in process table).  We don't want to wait for background process,
and we can't set $SIG{CHLD} to 'IGNORE' in gitweb to automatically reap
child processes, because this interferes with using
  open my $fd, '-|', git_cmd(), 'param', ...
    or die_error(...)
  # read from <$fd>
  close $fd
    or die_error(...)
In the above code "close" for magic "-|" open calls waitpid...  and we
would would die with "No child processes".  Removing 'or die' would
possibly remove ability to react to other errors.

This feature can be enabled or disabled on demand via 'background_cache'
cache parameter.  It is turned on by default.


When there is no stale version suitable to serve the client, currently
we have to wait for the data to be generated in full before showing it.
Add to GitwebCache::FileCacheWithLocking, via 'generating_info' callback,
the ability to show user some activity indicator / progress bar, to
show that we are working on generating data.

Note that without generating data in background, process generating
data wouldn't print progress info, because 'generating_info' can exit
(and in the case of gitweb's git_generating_data_html does exit).

We don't need to daemonize background process in this case, where
there is no stale data to serve, but progress info is on.  This is
because we have to wait for the background process to ...
From: Jakub Narebski
Date: Monday, January 3, 2011 - 2:33 pm

Instead of having gitweb use progress info indicator / throbber to
notify user that data is being generated by current process, gitweb
can now (provided that PerlIO::tee from PerlIO::Util is available)
send page to web browser while simultaneously saving it to cache
(print and capture, i.e. tee), thus having incremental generating of
page serve as a progress indicator.


To do this, the GitwebCache::Capture::ToFile module acquired ->tee()
subroutine, similar to ->capture(), but it prints while capturing
output.  The ->tee() method (and its worker methods ->tee_start() and
->tee_end()) are available only if PerlIO::tee from PerlIO::Util
distribution is present.  Tests checking if this feature works as
expected were added to t9510 test.

An alternative would be to provide two versions of
GitwebCache::Capture::ToFile.  Note also that PerlIO::tee is not
strictly necessary, as Capture::Tiny shows, but in most generic case
(like done in Capture::Tiny) one needs separate process functioning as
multplexer.


Because tee-ing (printing while capturing) can function as a kind of
progress indicator only for process generating the data for cache
entry, and not for the processes waiting for data to be generated,
therefore 'generating_info' got splitinto 'get_progress_info' and
'set_progress_info'.  You can set now in GitwebCache::FIleCacheWithLocking
those two separately.  You are expected to unset 'set_progress_info'
when using tee-ing capturing engine.  Some tests added to t9511 with
tee-like situation.

As a proof of concept gitweb now uses two slightly different versions
of "Generating..." page; if you worry about interaction between progress
indicator and non-cacheable error pages, you can set 'set_progress_info'
separately to undef.


The cache_output subroutine from GitwebCache::CacheOutput got updated
to use ->tee() subroutine if $capture supports it.  If ->tee() is
used, then of course generated data doesn't need to and shouldn't be
printed; also cache_output unsets ...

In general, and particularly for the large sites that caching is
targeted at, teeing is a really bad idea.  I've mentioned this several
times before, and the progress indicator is a *MUCH* better idea.  I'm
not sure how many times I can say that, even if this was added it would
have the potential to exacerbate disk thrashing and overall make things
a lot more complex.

1) Errors may still be generated in flight as the cache is being
generated.  It would be better to let the cache run with a progress
indicator and should an error occur, display the error instead of giving
any output that may have been generated (and thus likely a broken page).

2) Having multiple clients all waiting on the same page (in particular
the index page) can lead to invalid output.  In particular if you are
teeing the output a reading client now must come in, read the current
contents of the file (as written), then pick up on the the tee after
that.  It's actually possible for the reading client to miss data as it
may be in flight to be written and the client is switching from reading
the file to reading the tee.  I don't see anything in your code to
handle that kind of switch over.

3) This makes no allowance for the file to be generated completely in
the background while serving stale data in the interim.  Keep in mind
that it can (as Fedora has experienced) take *HOURS* to generate the
index page, teeing that output just means brokenness and isn't useful.

It's much better to have a simple, lightweight waiting message get
displayed while things happen.  When they are done, output the completed
page to all waiting clients.

- John 'Warthog9' Hawley

P.S. I'm back to work full-time on Wednesday, which I'll be catching up
on gitweb and trying to make forward progress on my gitweb code again.
--

From: Jakub Narebski
Date: Monday, January 3, 2011 - 5:28 pm

It might be true that tee-ing is bad for very large sites, as it
increases load a bit in those (I think) extremly rare cases where
clients concurrently access the very same error page.  But it might
be a solution for those in between cases.  I think that incrementally
generated page is better progress indicator than just "Generating..."
page.

Anyway this proof of concept patch is to show how such thing should
be implemented.  I don't think that it makes things a lot more complex;
in this rewrite everything is quite well modularized, encapsulated, and
isolated.


But the main intent behind this patch was to avoid bad interaction between
'progress info' indicator (in the process that is generating page, see

On the contrary, with tee-ing (and zero size sanity check) you would be
able to see pages even if there are errors saving cache entry.  Though
this wouldn't help very large sites which cannot function without caching,
it could be useful for smaller sites.
 

Err... could you explain what do you mean by "client is switching from
reading the file to reading the tee"?


Hmmm... I thought that the code is clear.  Generating data, whether it
is captured to be displayed later, or tee-ed i.e. printed and captured
to cache, is inside critical section, protected by exclusive lock.  Only
after cache entry is written (in full), the lock is released, and clients
waiting for data can access it; they use shared (readers) lock for sync.
 
Note that in my rewrite (and I think also in _some_ cases in your version)
files are written atomically, by writing to temporary file then renaming

It does make allowance.  cache_output from GitwebCache::CacheOutput uses
capturing and not tee-ing if we are in background process.  When there
is stale data to serve, cache entry is (re)generated in background in
detached process.
 
Moreover by default cache_output has safety in that error pages generated
by such detached process are cached.

Note also that in my rewrite you can simply (by ...
From: Jakub Narebski
Date: Tuesday, January 4, 2011 - 6:20 am

I was not sure how Perl reacts to ENOSPC (No space left on device), 
which I think it is only error that can be generated in flight as
cache is being generated (or as gitweb output is printed i.e. sent
to browser and captured/tee-ed i.e. saved to cache entry file), so
I have checked this (using loopback to create small filesystem).

The outcomes one worry about are the following:
* Perl dies during printing - this leads to broken page send to
  browser, and no cache entry generated
* Perl prints output without dying at all; the page send to browser
  via tee-in is all right, but cache entry is truncated which results
  in broken page shown to other clients.

But what actually happens is actually different, and quite safe:
* Perl prints output without dying, and dies on closing cache entry
  file with ENOSPC.  This means that client generating data gets correct
  output, and cache entry is not generated.  Other clients with my code
  try their hand at generation and also get correct page, but not save
  it to cache.

This means that no error page about problems with cache is shown, which
is bad.  On the other hand, at least for smaller sites, gitweb keeps 
working as if without cache for newer entries.
  
Note that observed behaviour might depend on operating system / filesystem


Errr... now after rereading your email I see that caching error pages
has one problem: errors that come from the caching engine or capturing

Sent as
  [RFC PATCH v7 2.5/9] gitweb: Make die_error just die, and use send_error
    to create error pages
  Message-ID: <201101040135.08638.jnareb@gmail.com>
  http://permalink.gmane.org/gmane.comp.version-control.git/164466

-- 
Jakub Narebski
Poland
--

Previous thread: Using overlay filesystem for "other" files idea by Evgeniy Ivanov on Wednesday, December 22, 2010 - 4:02 pm. (3 messages)

Next thread: [PATCH 1/3] Makefile: add NO_FNMATCH_CASEFOLD to IRIX sections by Brandon Casey on Wednesday, December 22, 2010 - 4:58 pm. (7 messages)