Re: 'git status' is not read-only fs friendly

Previous thread: Re: [sugar] etoys - binary blob in GIT by Joshua N Pritikin on Friday, February 9, 2007 - 11:20 am. (3 messages)

Next thread: Re: git-pull and tag objects by Junio C Hamano on Friday, February 9, 2007 - 4:19 pm. (13 messages)
From: Marco Costalba
Date: Friday, February 9, 2007 - 12:25 pm

In a repository under a mounted Windows directory (ntfs) I get this error:

$ git status
fatal: unable to create '.git/index.lock': Read-only file system
$

Is this correct? there exist a workaround? I just need to know if
current working directory is clean and report back to qgit user, so
read-only access would be ok for me.

All other commands commonly used to browse a repository seems to work
well, without pretending to write stuff.

Thanks
Marco
-

From: Linus Torvalds
Date: Friday, February 9, 2007 - 12:56 pm

"git status" is kind of strange. It's really technically the engine behind 
the messages for "git commit": both in a very real technical sense (it's 
_literally_ the same script:"git-commit.sh" is not just "git-commit", but 
also "git-status"), but also in a very real historical "that is what the 
code was written for".

And you wouldn't think that it really needs write access, and you'd be 
largely correct, EXCEPT for the fact that git status actually does a 
refresh of the index, to make sure that we don't claim that something is 
dirty just because somebody has touched the file.

IOW, there's an implicit "git update-index --refresh" as part of 
calculating the status, and that's the thing that wants to lock the index 

"git status" doesn't "pretend" to write stuff. It really does. 

You *can* just use "git-runstatus" instead. That's the command that 
actually does all the heavy lifting. But you can see the difference by 
doing this:

	touch Makefile
	git runstatus

vs

	touch Makefile
	git status

Notice how the "runstatus" one claims that Makefile is "modified:". That's 
exactly because it doesn't do the index refresh.

			Linus
-

From: Marco Costalba
Date: Friday, February 9, 2007 - 1:19 pm

Sorry, perhaps it is a silly question, but why git index should be
different after just touching a file?

IOW is it not possible that "git update-index --refresh" exists
without modifing the index, just because ther's nothing to modify?

So, finally, could be possible making "git status" taking the lock
only _after_ has checked there's something new to write to the index?
So to avoid write access in most cases ? (expecially with repo mounted
on a read-only fs)

Thanks
Marco
-

From: Junio C Hamano
Date: Friday, February 9, 2007 - 1:27 pm

It uses the information from lstat(2) that is stored in the
index and taken from the filesystem -- if they are different
(touch changes it as you know, but so does "vi that-file") they
are reported as "modified".

	git checkout -- Makefile
        git diff-files Makefile
	touch Makefile
	git diff-files Makefile

The 'refresh' operation is to update the lstat(2) information in
the index for paths whose contents actually match what is in the
index (and it does not do anything else -- most notably paths
whose contents are different between the working tree and the
index are left as-is).

-

From: Junio C Hamano
Date: Friday, February 9, 2007 - 1:22 pm

Running refresh internally in runstatus without writing the
result out _might_ be an option, but that would largely be
a hack to only help qgit.

Other shapes of "git status", such as "git status <filename>"
and "git status -a", still need to perform the same index
manipulation as "git commit" with the same parameters before
calling git-runstatus, and at that point the extra "internal
refresh" in runstatus is an unwelcome extra cycle.

-

From: Morten Welinder
Date: Friday, February 9, 2007 - 1:29 pm

I might be overlooking something, but couldn't that updated index be
saved elsewhere?  And subcommands be pointed at that, of course.

Morten
-

From: Theodore Tso
Date: Friday, February 9, 2007 - 4:27 pm

You should be able to set the GIT_INDEX_FILE environtment variable to
point the index somewhere else than $GIT_DIR/index, so it's not
located on a read-only filesystem.  See the git(7) man page.

						- Ted



-

From: Marco Costalba
Date: Friday, February 9, 2007 - 1:35 pm

Yes, I agree.

If I modify qgit in running 'git runstatus' as a fallback in case 'git
status' exits with an error (without checking what kind of error
exactly) could be an acceptable path or could hide subtle
side-effects? I have no the knowledge to answer this by hand.

Thanks
Marco
-

From: Linus Torvalds
Date: Friday, February 9, 2007 - 1:59 pm

It's probably better for you to just

 - run "git update-index --refresh" and don't care about the exit value
 - run "git runstatus" unconditionally

which should basically get you something working.

HOWEVER, it's also quite possible that "git-commit.sh" should just do this 
on its own. If the update-index fails, we really only care if we literally 
use the index later to *write* something, ie the commit case. For just 
"git status", maybe we should just silently ignore the error..

		Linus
-

From: Johannes Schindelin
Date: Saturday, February 10, 2007 - 7:19 am

Hi,


So, why don't you just do a

	git diff --name-only HEAD

and check for an empty output???

No need for a patch to Git (so late in the -rc phase), or backwards 
incompatibility...

Ciao,
Dscho

-

From: Marco Costalba
Date: Saturday, February 10, 2007 - 7:31 am

It seems to have the same issues of 'git runstatus' in case of ntfs
filesystems, so I would prefer, eventually, use 'git runstatus' that

Well, it's a _new_ option so I fail to see backwards incompatibility.
Perhaps you are referring to qgit backward incompatibility, but in any
case a new version of qgit is due to fix a parsing bug that shows
after a modification of 'git rev-list' output occurred in git 1.5
development.

  Marco
-

From: Johannes Schindelin
Date: Saturday, February 10, 2007 - 7:41 am

Hi,


Which issues? That the lstat data are not equal on Cygwin and Linux? The 
patch does not help here. Maybe a patch to Linux' ntfs driver 
would, but I fail to see how Git could possibly help here.

If git-diff is trying to write files, _that_ would be a bug.

As for your use of git-status: I think it is wrong. You said you want to 
check if the working directory is clean. Then just do that, and do not try 

You need a new version of _Git_ if you use that option.

And if you can do without depending on a newer Git, it is _bad_ to do it 
nevertheless.

Ciao,
Dscho

-

From: Marco Costalba
Date: Saturday, February 10, 2007 - 7:48 am

Well, I tested the patch and indeed it helps a lot ;-)

It's correct that checking Linux lstat against cygwin one (stored in
index) gives different results, but it's now where the patch makes the
difference rechecking all the files (in memory) to see if are really

That's a true point. Altough if git 1.5 ships _without_ '--refresh'
option in 'git runstatus' for a porcelain tool point of view it means
*do forget* that option until next major release. There's no point in
adding the feature one day after git 1.5 is out; qgit will not use
that feature anyway for next months.


   Marco
-

From: Marco Costalba
Date: Saturday, February 10, 2007 - 7:51 am

I could opt for shipping qgit 1.5.5 _without_ using '--refresh' and
then ship, as example in a month, qgit 1.5.6 that uses the feaure. But
I can do this _only_ if git 1.5 has it.
-

From: Johannes Schindelin
Date: Saturday, February 10, 2007 - 7:59 am

Hi,


Not really. The thing is, git-status does a lot more than what you need. 
And what you need is _only_ what "git diff --name-only HEAD" does already!

It _also_ checks the index, it _also_ only checks the files with different 
stat information, but it does _not_ try to update the index and prepare a 
message to be displayed when committing.

So, what is the big problem about accepting that patching git-status for 
one obscure use is wrong, wrong, wrong, when git-diff already does what is 
needed???

Ciao,
Dscho

-

From: Marco Costalba
Date: Saturday, February 10, 2007 - 8:45 am

Probably I'm doing something wrong, but that's how working dir
detection is currently implemented in qgit:

void Git::getDiffIndex() {

	QString status;
	if (!run("git status", &status)) // git status refreshes the index,
run as first
		return;

	if (!run("git diff-index HEAD", &_wd.diffIndex))
		return;

	// check for files already updated in cache, we will
	// save this information in status third field
	if (!run("git diff-index --cached HEAD", &_wd.diffIndexCached))
		return;

          ...... other stuff .....


The first call to git-status has been there for ages and with the only
goal to refesh the index so to avoid stale data in following 'git
diff-index' calls.

If I have understood correctly you suggest to remove that call because
useless? And rely  'git diff-index' info directly. Of course if there
are no side effects I'will be happy to drop the call, but I'm not sure
it's the safest way to go.

Another option would be to accept a broken working dir detection in
these corner cases. It's a realistic option and probably the best.
Indeed also subsitute 'git status' with 'git runstatus' as long as I
get back _all_ the repo files it seems to me a lesser option, IMHO
it's better failing with empty case than have a big flow of incorrect
data.

Marco
-

From: Nicolas Pitre
Date: Saturday, February 10, 2007 - 8:54 am

Because git-status itself is conceptually a read-only operation, and 
having it barf on a read-only file system is justifiably a bug.

But I agree that attempting to fix it now is probably just too risky not 
to compromize the 1.5.0 release.  Better leave it as a known issue for 
v1.5.0 and fix it later... especially if there is a possible workaround 
in the mean time.


Nicolas
-

From: Junio C Hamano
Date: Saturday, February 10, 2007 - 9:27 am

I do not 100% agree that it is conceptually a read-only operation.

-

From: Nicolas Pitre
Date: Saturday, February 10, 2007 - 9:40 am

It is.  It's the technical issue that makes it not so.  But when a user 
asks for a status, he's clearly not expecting to _write_ anything, even 
less for the command to fail if the file system is read-only.  This is 
like if a file system driver refused to open a file when the file system 
is mounted read-only just because it cannot update the file's atime on 
disk.

Just like the atime example, we may refresh the index while at it when 
preparing the status results.  This is a valid technical concern.  But 
it for sure should not be mandatory for the command to succeed if the 
file system is read-only.


Nicolas
-

From: Junio C Hamano
Date: Saturday, February 10, 2007 - 9:46 am

I do not think so.  It is a workflow issue that user indicates
the cache cleanliness information does not matter anymore.

Is it wrong for "git-status" to be losing the cache cleanliness
information?  The intended audience of that program is those who
are about to make a commit in the repository, as they are asking
"what would I be committing?"  Up to that point, they may have
cared about the reminder they get from "git diff" that they
edited a file and then ended up reverting the whole edit they
did to that file (I find that empty diff from "git diff" often
very useful, although I felt "Huh?"  when I was new to git).
But when they ask "git status", they care more about the real
change, and at that point (since they feel they may be ready to
make a commit -- and that is the whole point of running
"git-status") they do want to lose the cache cleanliness
information.  So "git-status" to be read-write application to
discard the cache-cleanliness information is probably a good
thing.

-

From: Nicolas Pitre
Date: Saturday, February 10, 2007 - 10:03 am

You're making assumption about work flows and using that to justify 

I don't dispute that.  But git-status should certainly not be restricted 

It is... when the file system lets you write.  Like I said this is a 
technically good thing to do.

But a command that is called "status" should provide a "status" even if 
the file system is read-only nevertheless.  The index updating business 
that is done behind the scene is and should be an opportunistic 
optimization, but it should not prevent status reporting.

It is pretty expected that a "commit" command would fail if the file 
system is ro, but not a "status" command.  And this is true 
irrespectively of whatever workflow you might be most likely to use the 
"status" command for.


Nicolas
-

From: Linus Torvalds
Date: Saturday, February 10, 2007 - 10:37 am

It really isn't. 

It's not even a "technical issue". It's a fundamental optimization. Sure, 
you can call optimizations just "technical issues", but the fact is, it's 
one of the things that makes git so _usable_ on large archives. At some 
point, an "optimization" is no longer just about making things slightly 
faster, it's about something much bigger, and has real semantic meaning.

So the fact is, "git status" _needs_ to refresh the index. Because if it 
doesn't, you'll see every file that doesn't match the index as "dirty", 
and that is not just a "technical issue".

And yes, doing an "internal" refresh, like Junio's patch does, hides the 
issue, but it hides it BY MAKING THE OPTIMIZATION POINTLESS!

I suspect Marco is testing some reasonably small git archive. With 
something like git itself, with less than a thousand files (and most of 
them fairly small, so rehashing them all is quick), the optimization may 
_feel_ like just a small technical detail.

Now, try the same thing on the Linux kernel or somethign similar, 
especially with cold caches or not a huge amount of memory.

That "technical issue" is what makes "git status" take less than a second 
for me, and only a bit longer if things aren't cached - because we don't 
actually have to read all the file data. 

Now, it so happens that _if_ things are cached, at least under Linux, 
cached IO is so _incredibly_ fast that you won't even realize how 
expensive an operation you missed. I can SHA1 every file in the kernel 
archive (21432 files right now - 8 million LOC, and 230MB of data) in less 
than a couple of seconds. But that's only because it's all cached for me 
anyway, because I tend to run with lots of RAM, and I do things like "git 
grep so-and-so" which brings it all into cache.

But try the same thing without caches.

Here's something you can do under linux:

	sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"
	git read-tree HEAD
	time git update-index --refresh

and it takes me *40* seconds. ...
From: Nicolas Pitre
Date: Saturday, February 10, 2007 - 11:51 am

There is just a semantic issue that you seem to overlook completely.

According to the dictionarry, "status" is a synonym to a "state".  It is 
_not_ an action.

So, from a _user_ perspective, the git-status command should give back a 
"status".  Of _course_ the user will benefit from the index updating 
business, but as important as this update might be (and I do agree that 
it is fundamental for GIT's performance), this is still a by-product of 
the "status" command.

Therefore, the fact that the index isn't writable should not prevent 
git-status from providing the very result for which its name was chosen.  
The index might as well be brought up to date on disk the next time the 
file system is writable.


Nicolas
-

From: Junio C Hamano
Date: Saturday, February 10, 2007 - 11:33 pm

I think a one paragraph summary of your argument is:

 - index is a good thing -- it is what makes the difference
   between usable and unusable.

 - git-status needs to refresh the index in order to do its
   thing efficiently and usably _anyway_, so once it spends
   cycles to do so, it is senseless not to write the refreshed
   index out when it can.

I do not think anybody disputes that in a repository with 20k+
paths, it is sensible to leave the index stat-dirty for all
paths.  But I think your example

	read-tree HEAD

misses the point by stressing the importance of index too much.
Index is important for the usability and I do not think anybody
is disputing it.

The thing is, nobody switches the index that way without running
"update-index --refresh" afterwards.  Normal people would use
git-reset to switch to a different tree object, and the command
does that for you.  If you are a hardcore, you would know to use
"read-tree -m HEAD" at least to avoid making paths unnecessarily
stat-dirty.  Your example, while it is valid and demonstrates
why the index is a good thing very well, is simply not part of
a normal workflow and not very relevant when discussing the
performance ramifications of what state "git-status" should
leave the index in.

When I said "calling 'update-index --refresh' in git-status
loses stat-dirtiness information", I was certainly _NOT_ talking
about losing the information that 20k+ paths used to be
stat-dirty because the user did "read-tree HEAD" earlier.

At least for me, it is very normal to do something like this.

 * start from a clean index.

 * edit cache.h, diff.h, and diff-lib.c.

 * stop, think, and realize that my earlier edit to change one
   function prototype in diff.h was not needed, and revert the
   change to that line still in the editor.

 * fix things up further by editing other files.

And then, I would run "git diff" to see where I am.  I still
remember that I touched diff.h and I also remember that I once
changed ...
From: Shawn O. Pearce
Date: Sunday, February 11, 2007 - 12:23 am

Indeed.  Except that `git-update-index --refresh` is itself not
very fast on Cygwin+NTFS and large projects (about the size of
the kernel).  So git-status is a real slouch there.  Not running
`git update-index --refresh` saves at least a couple of seconds.

This is why git-gui lets you disable the refresh, and is part of
the reason why it computes the status on its own by diff-index,

Yes.  Which is why if git-gui finds a file that has an empty diff,
but that was reported as modified by diff-files, it tells the user
its about to go waste a few seconds running `update-index --refresh`,
then does so.

In practice I've found it rare that a file is dirty in the index,
but is not actually modified.  The typical culprit appears to
actually be the virus scanner on a Windows system.  For some reason
it feels a need to modify some random XML 'source' files that are
tracked by Git.  Out of 30,000 files it likes to modify about 100.

Not only that, but I think we can do much better with git-runstatus
than we do now.  If we scan the working directory (to search for
untracked files), and we walk the index in parallel, we can update
the index with new stat data if necessary.

Of course that doesn't matter much on Linux; its VFS operations
don't take hours.

-- 
Shawn.
-

From: Johannes Schindelin
Date: Saturday, February 10, 2007 - 1:40 pm

Hi,


Just to fuel the fire even more: Does it make _sense_ running git-status 
when you cannot write? I mean, the only reasonable use cases to ask 
git-status (even interpreting it in the "state" sense you are proposing), 
is when you are _working_ on the files. Which you cannot do without write 
access.

BTW I was not aware that "git diff --name-only HEAD" would not check if 
the file is differing or not, but even then, it is arguably the right 
thing for qgit to show what the index' idea of the status is.

Ciao,
Dscho

-

Previous thread: Re: [sugar] etoys - binary blob in GIT by Joshua N Pritikin on Friday, February 9, 2007 - 11:20 am. (3 messages)

Next thread: Re: git-pull and tag objects by Junio C Hamano on Friday, February 9, 2007 - 4:19 pm. (13 messages)