Re: What's in a name? Let's use a (uuid,name,email) triplet

Previous thread: Correction in post-update.sample hook by Henry Gebhardt on Thursday, March 18, 2010 - 6:16 am. (1 message)

Next thread: tracking moved svn repo by Felipe =?utf-8?Q?S=C3=A1nchez?= on Thursday, March 18, 2010 - 6:23 am. (1 message)
From: Michael Witten
Date: Thursday, March 18, 2010 - 6:23 am

[Empty message]
From: Jon Smirl
Date: Thursday, March 18, 2010 - 6:48 am

You can't go back and edit the history in git so a map of the aliases
is needed.  The easy fix is a .mailmap file. However, the .mailmap
entries need a mechanism to track which entries are correct and which
have been fixed. Read this long and painful thread...
http://lkml.org/lkml/2008/7/28/134

-- 
Jon Smirl
jonsmirl@gmail.com
--

From: Michael Witten
Date: Thursday, March 18, 2010 - 7:26 am

The addition of a uuid would not only likely decrease future trouble
tremendously, but also allow for a much more efficient remapping of
old (name,email) pairs.
--

From: Linus Torvalds
Date: Thursday, March 18, 2010 - 10:27 am

UUID's are some total crazy shit. It's like XML. If you think you need 
them, you're almost certainly wrong. If it's about identifying a unique 
piece of hardware, ok. If it's about identifying people, no.

How about you walk around with a bar-code tattooed to your forehead? Don't 
like the idea? Then think about having to care about a uuid in your 
projects. Same deal.

Nobody is going to associate themselves with a uuid. It's not how humans 
work. It's degrading, and it's work-for-no-gain to anybody who doesn't 
have OCD.

So in practice, the only thing that would happen is that people make up 
random uuid's and they'd be different for every single machine they have, 
because absolutely NOBODY would ever bother to try to save and move their 
uuids around.

So when you point out that emails aren't unique, or that people change 
their emails over time, please realize that the emails are _more_ stable 
than a uuid would ever be. Because an email actually has some emotional 
attachment to the person in question. Yes, they change. So do real names 
too (which change more seldom, exactly because people are way _more_ 
emotionally attached to their real names).

uuid's? I can pretty much guarantee that for me, it would be different for 
every single machine I have. Because I could just not be bothered to care.

			Linus
--

From: Jon Smirl
Date: Thursday, March 18, 2010 - 12:02 pm

On Thu, Mar 18, 2010 at 1:27 PM, Linus Torvalds

We could hash people emails and then build a .mailmap equivalent thus
hiding their identity.

Several things needed to be combined to build that mailmap.
1) a lot of hand work to identify aliases and misspellings
2) work with google to translate email addresses into human names when
names were missing
3) a list of all of the email addresses that had been checked, to make
it easy to identify new ones.

The trouble with hashing it is that all of the tools that use it will
need to be rewritten.

I'd really like to see a more global database constructed that links
commits, lkml discussions and the various distribution bug databases
but apparently it is too much of a threat to developer privacy. You
can achieve the same effect with a few hours in google throwing out
bunches of false positives.  It would be cool to be looking at a
subroutine, poke a button and then see all of the human oriented



-- 
Jon Smirl
jonsmirl@gmail.com
--

From: Linus Torvalds
Date: Thursday, March 18, 2010 - 12:07 pm

So? Why? What's the advantage?

I literally _only_ see disadvantages to the whole thing. If the uuid has 
some meaning (ie it's related to actual _real_ information), then it is 
nothing but a really inconvenient placeholder for the real information, 
adn another source of new problems (like "how do we know they are in 
sync? I edit the .gitconfig file by hand all the time").

And if it doesn't have meaning, then it's just annoying and will never 
ever be attached to anything relevant long-term.

Either way, there are only downsides, no upsides. There is absolutely _no_ 
way that teh uuid would ever actually encode any real meaningful 
information that isn't better represented by the name/email.

			Linus
--

From: Linus Torvalds
Date: Thursday, March 18, 2010 - 12:20 pm

I really see that as a bad thing, not a good thing. It's like enabling 
some crazy shit and making it official.

If you don't want to reveal your real name, use a fake address. Just don't 
expect anybody to want to work with you. 

The LAST thing we want is built-in git support for doing f*cking stupid 
things.  You can do stupid things with it on your own without us helping 
and encouraging you.

		Linus
--

From: Jon Smirl
Date: Thursday, March 18, 2010 - 12:37 pm

On Thu, Mar 18, 2010 at 3:20 PM, Linus Torvalds

Go ahead and commit that .mailmap I made. It really cleans up the
statistics by fixing 500 errors is people's names. Just don't point



-- 
Jon Smirl
jonsmirl@gmail.com
--

From: Linus Torvalds
Date: Thursday, March 18, 2010 - 12:47 pm

How hard is it to understand the notion of "people just don't _care_ 
enough"?

Look at CVS. Look at three _decades_ of CVS. Then look at the 
"identifiers" that thing used. 

Git is much better. Git is better for two reasons:

 - We allow/encourage people to use way more meaningful identifiers

 - Exactly _because_ what we use is meaningful to people, most people 
   bother to try.

And you don't seem to understand that whole "meaningful" part. If you 
don't have the social understanding of how people actually _work_, then 
nothing I say can explain it.

Let me try one more time: do the statistics on "committer information" vs 
"author information" on the Linux kernel repository, and count the types 
of errors that happen. I can explain the errors and why they happen, and 
it has everything to do with how _humans_work_ (*).

If you don't understand that, then there's no point in arguing.

			Linus

(*) I'll give you one answer in the next email. But before you read that 
email, try to think about it, and see if you can guess at patterns.
--

From: Linus Torvalds
Date: Thursday, March 18, 2010 - 12:50 pm

Lookie here:

  [torvalds@i5 linux]$ git log --pretty=full | grep '^Commit: ' | sort | uniq -c | sort -n | grep localdomain
      1 Commit: Jeff Garzik <jgarzik@localhost.localdomain>
      2 Commit: Dave Airlie <airlied@ppcg5.localdomain>
      3 Commit: James Bottomley <jejb@sparkweed.localdomain>
      3 Commit: James Morris <jmorris@localhost.localdomain>
      3 Commit: James Morris <jmorris@macbook.localdomain>
      4 Commit: James Bottomley <jejb@hobholes.localdomain>
     32 Commit: Thomas Graf <tgr@axs.localdomain>
    410 Commit: James Bottomley <jejb@mulgrave.localdomain>
  [torvalds@i5 linux]$ git log --pretty=full | grep '^Author: ' | sort | uniq -c | sort -n | grep localdomain
      1 Author: Alex Deucher <alex@hp.localdomain>
      1 Author: Dave Airlie <airlied@ppcg5.localdomain>
      1 Author: Eduardo Habkost <ehabkost@Rawhide-64.localdomain>
      1 Author: Grzegorz Nosek <root@localdomain.pl>
      1 Author: Izik Eidus <izike@localhost.localdomain>
      1 Author: Jeff Garzik <jgarzik@localhost.localdomain>
      2 Author: Esti Kummer <stkumer@localhost.localdomain>
      2 Author: James Bottomley <jejb@mulgrave.localdomain>
      3 Author: Dave Airlie <airlied@optimus.localdomain>
      3 Author: James Bottomley <jejb@hobholes.localdomain>
      3 Author: James Bottomley <jejb@sparkweed.localdomain>
      4 Author: Cindy H Kao <evans@localhost.localdomain>
      4 Author: Kristian Høgsberg <krh@localhost.localdomain>

See? Mistakes happen. But look at what happens to the committer 
information? Think about it. Really _think_ about it. There is absolutely 
zero _technical_ difference between the two fields. The only difference is 
that "git log" by default shows one, and not the other.

So as a human, which one do you think people care about and fix more 
quickly?

And look at the numbers once more.

			Linus
--

From: Linus Torvalds
Date: Thursday, March 18, 2010 - 1:01 pm

Btw, one other thing you can take away from it is that even when they 
_are_ shown, and even when they _are_ meaningful, people still don't care. 

There's absolutely tons of "(none)" emails even in the _visible_ parts, 
which is really really sad. But it does tell a lot about humans - they 
won't be noticing even _obvious_ mistakes like that.

(And yes, it does say that git should probably have errored out way more 
aggressively about badly set up host/domain names in the "guess at email 
address" code. My bad. Maybe it's still worth fixing for the future)

			Linus
--

From: Junio C Hamano
Date: Friday, March 19, 2010 - 12:39 pm

We made a small step in that direction in 49ff9a7 (commit: show
interesting ident information in summary, 2010-01-13).  I think what it
does is sufficiently loud (but not annoying).


--

From: Reece Dunn
Date: Thursday, March 18, 2010 - 1:31 pm

So... going back to the original problem, we have:

  1/  people making mistakes in the commit logs for whatever reason
(e.g. re-installation or a new computer);
  2/  people changing name (e.g. getting married) or changing email
(e.g. gmail.com to googlemail.com).

The problem is that it may be beneficial to see all the changes Cindy
H Kao made for example, including the ones made
@localhost.localdomain.

Having (user, email, uuid) will not solve the problem (even if the
uuid is from a memorable string) -- consider case 1. If you forget to
setup git, uuid will be blank or some random data, so this will be
worse than the (user, email) identity. As noted, there is also the
issue that git is used in a lot of places and not all git clone
instances are running the same version (e.g. pushing to an older git
client that does not support this new data).

What would be better is having a concept of identity aliases. That is,
a user can say that (for this git project), (user1,email1) is the same
person as (user2,email2). This would allow someone who has
mis-configured their git instance to say what the (user,email) pair
should have been. It also allows people to say that they used to be
called someone and they are now called somebody.

This information should ideally be in some form of (user,email) ->
(user,email) map that is versioned and tracked by git (in a way that
is also backward compatible, which could be tricky).

It also needs to be changeable and version tracked (i.e. with history)
to allow people to undo this; for example, this system would allow me
to say that Linus' (user,email) id is actually an alias for my
(user,email) which is bad. I don't know of a decent way to prevent
this (or someone using the uuid of someone else in the original
proposal), but this approach would at least allow it to be corrected.

There will need to be the related plumbing and porcelain to access and
manipulate this data/meta-data.

Would this be a better approach? Or is there a fatal ...
From: Linus Torvalds
Date: Thursday, March 18, 2010 - 1:59 pm

Yeah. And that's what '.mailmap' is, really.

Does mailmap get annoying? Yes. Is it going to be incomplete? Yes. Do we 
ever even _bother_ to try to make it perfect? No.

In the kernel, for example, we tend to use it _only_ to fix up the real 
name. It's much more capable than that (ie you can use it to fix up email 
addresses too), but we literally haven't cared enough to bother. So you 
still see the "localhost" emails or the "(none)" domains - even if you use 
one of the formats that ask for a "fixed" name and email.

And git itself only fixes up names for certain commands (git blame, git 
shortlog) and with specific format specifiers (%aN and %aE).

The _default_ pretty log format printouts don't do it, for example. Should 
they? Maybe. Or maybe we should have a flag and/or config option to do so 
by default.

				Linus
--

From: Jon Smirl
Date: Thursday, March 18, 2010 - 12:16 pm

On Thu, Mar 18, 2010 at 3:07 PM, Linus Torvalds

I happen to think that the concept of privacy and working on an open
source project are fairly incompatible. But apparently their are
people who think otherwise.  The use would be to reconstruct that
mailmap I made, but with the email addresses replaced with SHA1 hashes
of the email. No human would use the SHA1s, they're just there to



-- 
Jon Smirl
jonsmirl@gmail.com
--

From: Michael Witten
Date: Thursday, March 18, 2010 - 12:32 pm

On Thu, Mar 18, 2010 at 14:07, Linus Torvalds

You've actually just described the current name/email system.

What a uuid provides is that very property of long-term attachment; a
git user can change the name/email pair but keep the same uuid.

You see, the problem is that the name/email pair isn't really an
identifier; it's actually just info about the user's current email
account, which is very useful for everyday workflow, but pretty naive
for historical identification over long periods of time.

As previously discussed in my original email, the 'email' portion of
the name/email pair is the most volatile portion, and that's because
it's only tangentially related to identity (and it certainly has

It IS a name/email pair (if you want or if that is enforced); it's
just one that isn't as volatile.

This notion of a uuid is an attempt to adopt a BETTER MODEL for
identity: The user get's to choose a piece of information that he
himself deems a longterm identifier; it's not about what address you
currently use for email, it's solely about who you are over a long
period of time.
--

From: Martin Langhoff
Date: Thursday, March 18, 2010 - 12:42 pm

WTH are you drinking? I have been using my current name and email
accounts for many years.

They are useful for git and for some things that are even more useful
-- like addressing emails! My best CV is googing for my name / email
addresses -- it will show you my professional career. Including the
time that Linus called my patch "idiotic" :-)

So, these things are attached to something meaningful: my long term
personal identity. A git-only "uuid"? Screw that, I hack on too many
physically different machines, I am not going to be carrying around a
magic string.

cheers,


m
-- 
 martin.langhoff@gmail.com
 martin@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff
--

From: Linus Torvalds
Date: Thursday, March 18, 2010 - 12:40 pm

I don't think you understand what "attachment" means.

Think about your wife, your kids, or your pet. THAT is attachment.

Random 16-letter letter-jumble? No. People will _never_ care. They'll 
simply not care. 

It's true that people _already_ don't care too much about their emails, 
and that typos and simply job changes (or annoying ISP's) will change 
them. But that would be orders of magnitude _worse_ with something like a 

Don't be an idiot.

Try to think like a HUMAN. Not a computer scientist. And ponder.

It's a _social_ issue, not a "let's tattoo this uuid on everybody".

		Linus
--

From: Michael Witten
Date: Thursday, March 18, 2010 - 12:47 pm

On Thu, Mar 18, 2010 at 14:40, Linus Torvalds

I don't think you've read one word that I've written.
--

From: Linus Torvalds
Date: Thursday, March 18, 2010 - 12:52 pm

Oh, I read them. They make no sense.

If the uuid isn't random, but tied to the email address, then it's 
worthless. 

If you like the random 16-letter jumbles, then for christ sake JUST CHANGE 
"git log" to hash the author name for you. You'll get the uuid's. What I'm 
telling you is that NOBODY SANE WANTS TO EVER SEE THEM.

And if nobody wants them, then nobody will maintain them, and they'll be 
much _less_ useful than the emails we already have.

		Linus
--

From: Michael Witten
Date: Thursday, March 18, 2010 - 1:00 pm

On Thu, Mar 18, 2010 at 14:52, Linus Torvalds

No, I'm reasonably certain you didn't.
--

From: Wincent Colaiuta
Date: Thursday, March 18, 2010 - 12:52 pm

On the contrary I get the impression he has waded through everything you've written, and has even been patient enough to put together (now several) replies explaining exactly why it is a misguided idea.

Now it's time for you to read and actually reflect on what's been said. If you're sane and have a modicum of intelligence you'll come to the conclusion that your idea doesn't solve any problem, and in fact only adds a bunch of new ones.

W



--

From: Wincent Colaiuta
Date: Thursday, March 18, 2010 - 12:40 pm

This whole thing is a stupid idea.

If users can't even be bothered keeping a stable email address, what makes you think that they can be assed "doing the right thing" with respect to a meaningless UUID string?

The idea is complicated, over-engineered, brings no benefit and adds only cruft.

W

--

From: Martin Langhoff
Date: Thursday, March 18, 2010 - 3:36 pm

On Thu, Mar 18, 2010 at 1:27 PM, Linus Torvalds

One thing we all forgot to mention here is that even if it was a good
idea (which it is not), implementing it means a flag day: changing in
the pack format, wire protocol and APIs, messing up with compatibility
with users of pre-flag-day git, and rippling out to all the GUIs,
frontends and integration scripts out there.

A veritable mess that would reberberate for years.

Any proposal that touches the core git datamodel... better implement
something that is outrageously wondrously good and impossible to do
any other way.

My guess is that people that parachute into this list and propose
datamodel changes haven't thought this aspect through.

cheers,


m
-- 
 martin.langhoff@gmail.com
 martin@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff
--

From: Nicolas Pitre
Date: Thursday, March 18, 2010 - 4:17 pm

And nobody yet mentioned what should happen when someone sends a patch 
by email.  Most commits in git.git originated from a patch sent via 
email.  Should we start pasting UUIDs in the email body?  What if the 
cut & paste was quickly done and the UUID is missing a character or two?  
Because this does happen.  And because this UUID thing is supposed to be 
a perfect identity representation then we'll need a .uuidmap to correct 
such mistakes of course.

Better improve on the existing .mailmap instead.


Nicolas
--

From: Jon Smirl
Date: Thursday, March 18, 2010 - 4:26 pm

If anyone is interested I can send them a .mailmap that fixes a lot of
the problems in the kernel tree. It's two years old so it will need



-- 
Jon Smirl
jonsmirl@gmail.com
--

From: Nicolas Pitre
Date: Thursday, March 18, 2010 - 4:34 pm

Please just make a patch with it, and post it to lkml and CC Linus and 
Andrew Morton.  Repost a month later if no one picked it up.

I think that 'git log' should really consider the .mailmap by default.  
Otherwise what's the point?   The only time when .mailmap should not be 
considered is when using --pretty=raw or when explicitly told not to.


Nicolas
--

From: Jon Smirl
Date: Thursday, March 18, 2010 - 4:41 pm

Been there, done that. 1000 message flame war ensued about privacy



-- 
Jon Smirl
jonsmirl@gmail.com
--

From: Nicolas Pitre
Date: Thursday, March 18, 2010 - 4:58 pm

Well, you used git itself as the data source to fix up those email 
addresses, right?  If so there is simply no privacy concerns as the data 
is already there and public.  Just don't venture adding emails that are 
not already present in the whole Git history/content at all without 
consent.


Nicolas
From: Jon Smirl
Date: Thursday, March 18, 2010 - 5:16 pm

I'll sent you the file and you can commit it. Please take full credit for it.
http://lkml.org/lkml/2008/7/28/134

All of the data came out of git tree.

-- 
Jon Smirl
jonsmirl@gmail.com
--

From: Linus Torvalds
Date: Thursday, March 18, 2010 - 5:17 pm

Umm. You do realize that what people complained about was mostly that they 
felt a lot of the entries were totally pointless.

For example, you included names whether they were mistyped or not, and 
claimed that everybody needed to always be in the mailmap if they ever 
made any commit.

So I think 99% of the flames were due to just the patch being stupid.

		Linus
--

From: Linus Torvalds
Date: Thursday, March 18, 2010 - 5:50 pm

The part you keep missing is that NOBODY CARES!

For example, I exist in the current git kernel tree with 11 different 
names for just the authorship information:

     32 Linus Torvalds torvalds@evo.osdl.org
   1522 Linus Torvalds torvalds@g5.osdl.org
   4194 Linus Torvalds torvalds@linux-foundation.org
      7 Linus Torvalds torvalds@macmini.osdl.org
      2 Linus Torvalds torvalds@merom.osdl.org
      8 Linus Torvalds torvalds@osdl.org
    166 Linus Torvalds torvalds@ppc970.osdl.org
      4 Linus Torvalds torvalds@ppc970.osdl.org.(none)
      1 Linus Torvalds torvalds@quad.osdl.org
   1606 Linus Torvalds torvalds@woody.linux-foundation.org
    174 Linus Torvalds torvalds@woody.osdl.org

(that's counts, in case you care). And then if you check signed-off lines, 
you'll find some _additional_ oddities where things just got misspelled, 
like

	Linus Torvalds <tovalds@linux-foundation.org>
	Linus Torvalds <torvalds@akpm@linux-foundation.org>

where in one case there's a missing 'r', and in the other it's some odd 
perverse incestuous relationship between me and Andrew (in reality, it's 
me doing a stupid "search-and-replace" on the emails, adding my own 
sign-off to Andrew's and that got a bit too much copy-paste issues)

There's a few other mistakes like that in the sign-offs.

Does anybody care? Certainly not I. There is absolutely zero reason to 
worry about it. I used to find it convenient to see what machines I had 
worked on, so I actually included that. And one of them was clearly 
mis-configured, or git did something wrong when the hostname was already 
in FQDN format. Whatever.

There is no real _value_ in making a .mailcap for each such buggy entry is 
what I'm trying to tell you. Those things are maybe used for statistics. 
On the whole, they are correct. 

			Linus
--

From: Jon Smirl
Date: Thursday, March 18, 2010 - 6:12 pm

On Thu, Mar 18, 2010 at 8:50 PM, Linus Torvalds

I was trying to track how many real people were working on the kernel.
 If we don't collapse the 13 different versions of you down to one
person the number numbers are way off.

-- 
Jon Smirl
jonsmirl@gmail.com
--

From: Nicolas Pitre
Date: Thursday, March 18, 2010 - 6:45 pm

If you have a cleaned up .mailmap file which doesn't include unneeded 
entries then just submit it for inclusion.  If someone else eventually 
cares to check and update it then another patch should come forth at 
that point.  That doesn't have to be any more complicated than that.


Nicolas
--

From: Jon Smirl
Date: Thursday, March 18, 2010 - 7:05 pm

I sent you a copy, feel free to do whatever you want with it.  The
academics doing statistics on Linux will love you for submitting it.

-- 
Jon Smirl
jonsmirl@gmail.com
--

From: Jon Smirl
Date: Thursday, March 18, 2010 - 5:39 pm

On Thu, Mar 18, 2010 at 8:17 PM, Linus Torvalds

I had all of the names in the list so that I could regenerate the list
and diff it against the old version to know which new names needed to
be checked. Looking back I could have eliminated the names without
errors and then added a comment to the file as to the last date all of
the names were checked.  But that is less reliable than recording
which were checked. The problem is that if you lose track of what has
been checked, you are forced to recheck everything and it takes a long
time to recheck everything.




-- 
Jon Smirl
jonsmirl@gmail.com
--

From: Michael Witten
Date: Thursday, March 18, 2010 - 4:34 pm

Actually, those points were touched upon earlier (including my rebuttals).
--

From: Michael Witten
Date: Thursday, March 18, 2010 - 11:42 am

Linus: Don't skim; read.




My anticipation of your response was uncanny:

    >> For instance, the uuid could be... the SHA-1
    >> of some easily remembered, already reasonably
    >> unique information.
    >>
    >> ...
    >>
    >> ...he doesn't want to bother remembering some
    >> long human-hostile string, so he adopts as
    >> his uuid the SHA-1 of some easily remembered
    >> piece of information like the very first
    >> (name,email) pair that he used for git
    >> (Junio C Hamano <junkio@cox.net>)

So, forget the original generality and let's
define the uuid as a SHA-1 of some EASILY
REMEMBERED, already reasonably unique piece of
information such as an old (name,email) pair.

To make life easier on people, git tools could automate
that process; to Junio, his just uuid is an old,
unchanging (name,email) pair:

    $ git config --global user.name  "Junio C Hamano"
    $ git config --global user.email "gitster@pobox.com"
    $ git config --global --uuid "Junio C Hamano <junkio@cox.net>"

which produces something like:

    [user]
        name  = Junio C Hamano
        email = gitster@pobox.com
        uuid  = 6e99d26860f0b87ef4843fa838df2a918b85d1f7

In fact those three steps should probably be
further automated anyway:

    $ git config --global --init
    Full Name? Junio C Hamano
    Email? gitster@pobox.com
    UUID [Junio C Hamano <gitster@pobox.com>]? Junio C Hamano <junkio@cox.net>

Set it and forget it in a completely human way.

Could people still bungle the uuid or enter trash?
Sure, but that's essentially no different than the
current situation. This would be an improvement,
because at least some people would take advantage
of it; in fact, I bet most people would use it
properly because:

    * The information required is easily remembered
      and reproduced; it has that emotional aspect.

    * People have an emotional attachment to getting
      proper attribution for their work, and this
      ...
From: Matthieu Moy
Date: Thursday, March 18, 2010 - 11:47 am

What's the added value of the "SHA-1" thing, here? A hash of a pair
(a, b) is exactly as unique as the pair itself (well, actually even a
bit less if you consider collisions).

-- 
Matthieu Moy
http://www-verimag.imag.fr/~moy/
--

From: Michael Witten
Date: Thursday, March 18, 2010 - 11:57 am

On Thu, Mar 18, 2010 at 13:47, Matthieu Moy

Your observation is correct, but I'm pushing for the SHA-1 string
because it could be efficiently parsed, stored, and used; it's
essentially an optimization (or a preparation for an optimization).

If that's not a good way to approach it, then I'd be satisifed with
just a straight (name,email) pair or any other reasonably unique
string.

On a more general note, the idea of a uuid is to distribute the
process of canonicalizing identities. Does that not make perfect
sense?
--

From: Nicolas Pitre
Date: Thursday, March 18, 2010 - 12:12 pm

Even with _that_, I bet many people will simply no bother.  You may as 
well just use your current name and email address.  Oh wait, Git is 

Even then, some people _will_ manage to screw up some of their UUID 
configs.  And you'll inevitably end up in the same situation that we 
have today i.e. different identification credentials that have to be 

[...]

Most people _already_ use their name/email configuration properly.  And 
those who really care are managing a stable email address already.  so 

I don't buy that either.  If anything, it is way better to fix the 
current .mailmap mechanism to catter for changing email addresses.  
That's what people use to contact people anyway as I doubt you could 
send any congratulations or job offers solely by using the Git's UUID.  
So you must link back to some form of email address in the end, and 
preferably the current one, otherwise the UUID is useless.  In that case 
then why not simply using that email address in the first place?

The real solution is actually to improve the .mailmap so that any 
individual could decide that for this or that name/email pair to be 
found in the repository then here's the current email that should be 
displayed instead.  Currently this applies partially and only to 
git-shortlog.


Nicolas
--

From: tytso
Date: Thursday, March 18, 2010 - 1:44 pm

The problem is that people don't get emotionally attached to a UUID.
And even if the UUID is generated algorithmically, they need to
remember, gee, was my UUID generated using:

	Theodore Y. Ts'o <tytso@mit.edu>
	Theodore Tso <tytso@mit.edu>
	Theodore T'so <tytso@valinux.com>  (*) 
	Theodore Y Tso <theotso@us.ibm.com
	Ted Tso <tytso@google.com>
	Theodore Tso <tytso@google.com>
	<etc.>

(*) The VA Linux folks screwed up where the apostrophe goes in some
press release, and the mispelling of my last name has followed me for
the last ten years since then.

More importantly, there's a lot more to someone's reputation than just
Git.  What about reviews of other people's patches on LKML?  Can you
**honestly** expect people to say,

   Hi, I'm <dd1b51a1-ce2a-41fd-ae89-f68b7f0ace85> and here are the things
   that you need to fix with your patch....

People who give thoughtful reviews of other people's code count for a
lot, and people are not going to track that sort of thing by UUID.
They are going to track it by name and e-mail address.

Or what about papers?  Can you honestly expect that it would matter
even one iota if someone put in a bibliography of a paper

R. Card (14a8da4b-0231-497b-aa66-1809cc9727f9), T. Y. Ts'o
(dd1b51a1-ce2a-41fd-ae89-f68b7f0ace85), and S. Tweedie
(9052e458-32cc-11df-93b8-0016eb0fac40), "Design and implementation of
the second extended filesystem," in Proceedings of the 1994 Amsterdam
Linux Conference, 1994.

Is that going to contribute to my identity any?   I don't think so.


Finally, if someone misses one of my commits in a git changelog, so
what?  People don't guage impact by the number of commits.  There are
some people who have huge numbers commits, but they are all spelling
corrections.  A developer's reputation is developed over many months
or years of contributions; of interactions over e-mail; interactions
in hallway discussions at conferences; papers which they author; etc.
It's not just about git commits.

   	      	 	       	  ...
From: Michael Witten
Date: Thursday, March 18, 2010 - 2:12 pm

Look, there is a huge misunderstanding.

This is all that I'm saying: Keep git exactly the way it is, but add
one extra piece of identifying information for each person.

That's it.

Nothing is being taken away.

You can still see/grep/access the full names and email addresses just
as before, only now there will be another piece of information on
which to filter (or ignore it if you want).

In the most general form of my proposal, the idea is to let the user
choose some piece of information that he himself deems to be uniquely
identifying over a long period of time. However, I think it would be
smart to reduce that information to a SHA-1 (at least when it's
recorded in, say, a commit).

Essentially, the goal is to distribute the task of maintaining aliases.
--

From: Martin Langhoff
Date: Thursday, March 18, 2010 - 2:19 pm

But something is added.

Good design is not when there's nothing more to add, it's when there's
nothing left _to remove_.

Git is what it is thanks to removing superfluous crud from its core
datamodel. Don't be surprised that there is a very strong resistance

Already achieved with mailcap. No need to mess with the secret of
git's success (the tight datamodel).

cheers,



m
-- 
 martin.langhoff@gmail.com
 martin@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff
--

From: Michael Witten
Date: Thursday, March 18, 2010 - 2:29 pm

On Thu, Mar 18, 2010 at 16:19, Martin Langhoff

Well, that's rather egotistical considering you're probably not the
only Martin Langhoff in this world. I'd advocate something like
"Martin Langhoff <martin.langhoff@gmail.com>".

At worst, things will be just like they have always been.

Most likely, all that will happen is identification entropy won't
increase nearly so rapidly and there might be other benefits such as
shortlog speed improvements.
--

From: Martin Langhoff
Date: Thursday, March 18, 2010 - 2:39 pm

So you are saying we should change the core datamodel of git to say...

No, we'll have another way to have data mismatches. There are _more_
moving parts in your model. That's what Linus is pointing out.

This is a case where an ancillary "fixup table", in the form of
mailmap, works best. Don't move the fixup table to the core of the
datamodel, it just doesn't belong there.

Here's a hint: using your "uuid" model, I'll get some commits into a
project with the wrong uuid. Because I made a typo, or changed
machines (and a random uuid got created), whatever reason. So now in
my project I appear under 2 uuids.

What should we do in that case? Use mailmap to map the stray uuid to
the "real" one?... Have we done a lot of work to get back to square 0?

cheers,


m
-- 
 martin.langhoff@gmail.com
 martin@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff
--

From: Michael Witten
Date: Thursday, March 18, 2010 - 2:46 pm

On Thu, Mar 18, 2010 at 16:39, Martin Langhoff

--

From: Michael Witten
Date: Thursday, March 18, 2010 - 3:02 pm

On Thu, Mar 18, 2010 at 16:55, Martin Langhoff

You missed the other line (probably gmail's fault):

Most likely, all that will happen is identification entropy won't
increase nearly so rapidly and there might be other benefits
such as shortlog speed improvements.
--

From: Nicolas Pitre
Date: Thursday, March 18, 2010 - 4:37 pm

The shortlog speed improvement is certainly not going to compensate for 
all the added human time needed to process the extra piece of 
information.


Nicolas
--

From: Michael Witten
Date: Thursday, March 18, 2010 - 4:44 pm

What added human time?
--

From: Nicolas Pitre
Date: Thursday, March 18, 2010 - 5:03 pm

The time that humans will have to spend on this UUID 
setup/fixing/whatnot.


Nicolas
--

From: Michael Witten
Date: Thursday, March 18, 2010 - 5:27 pm

Compatibility concerns aside, there is virtually no overhead. Indeed,
there would be less overhead than there is now in terms of fixing.
--

From: Nicolas Pitre
Date: Thursday, March 18, 2010 - 5:32 pm

In a perfect world maybe.  Let's talk about it again when we get there.


Nicolas
--

From: Reece Dunn
Date: Thursday, March 18, 2010 - 3:06 pm

You have 3 pieces of information that can change by adding uuid instead of 2.

Are people going to remember that they need to set a uuid when
checking things into git? Different uuids? Forgetting the key string
to generate the hash for the uuid?

The uuid is another source of permutations that will see an increase
in identity triples. It is also another thing that needs to be stored
in a commit on disk and in memory, printed out in the shortlog and
checked by people.

Even if you generate a SHA-1 hash from a memorable bit of data, the
resulting hash is not readable. It is something that could cause
collisions with partial hashes in treeish queries (does 12ab34 refer
to a commit, or to a persons uuid?). It is also meaningless to the
user: I want to find Ted Ts'o's (I hope I've got the apostrophe in the
correct place) commits - how do I know what uuid refers to his
commits? How can I find it out?

It is just adding more resistance, whereas with a well-configured
.mailmap I could use one of his known email addresses, something that
is easy to find and remember.

From what Linus and others have said, .mailmap is the way to fix name
and/or email changes. It may need more work to expose it to more
commands, but that is the simplest, cleanest and most elegant approach
to fixing the problem you specified.

What about .mailmap does not solve your problem? Is it that it does
not work for `git log`? If so, then write a patch to allow `git log`
to use that information when you specify a certain flag (or pretty
format string).

NOTE: It is not just the author/committer that needs to remember/use
the uuid - it is people doing analysis on commits, curious people,
automated scripts and many others.

- Reece
--

From: Martin Langhoff
Date: Thursday, March 18, 2010 - 2:55 pm

Of course we all read that line. You are proposing a change that will
mean a flag day -- that is, old versions of git won't be able to read
"new" repositories (and "new" git will have to be backwards compat for
X releases...). This is major breakage.

Inflict a painful change on our userbase for... what exactly? Ah, "At
worst, things will be just like they have always been."

I don't think you understand what you've been proposing.

Is it clearer now why you get a clear "no" from all quarters? Huge
cost, no upside?



m
-- 
 martin.langhoff@gmail.com
 martin@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff
--

From: Michael Witten
Date: Thursday, March 18, 2010 - 2:57 pm

On Thu, Mar 18, 2010 at 16:39, Martin Langhoff

You see, Martin, you might want/need to stop using "Martin Langhoff
<martin.langhoff@gmail.com>" as your email account, but there's no

Mismatches in UUIDs will be the only thing worth worrying about;
fortunately, UUIDs won't change as frequently because they would be
rarely typed by git users and they are not subject to changing email
systems or changing names.
--

From: Paolo Bonzini
Date: Friday, March 19, 2010 - 5:34 am

While a gnu.org or gmail.com will (most likely) stay with some person 
forever, hindsight is 20/20 and many people may generate his UUID from a 
work email.  So, suppose I make my UUID based on <pbonzini@redhat.com> 
what will guarantee that in 20 years I won't find a new career as a 
bartender, and Red Hat wouldn't hire someone with my same name, and give 
him the same email address?

Heck, some people use gmail only for their personal email, and they 
rightly cannot be bothered to create another account to solve a problem 
they don't understand and they probably do not have.

For the UUID to make sense, it would need to be what the acronym says: 
universally unique.  An SHA-1 value is _not_ universally unique, it is 
just a one-way function.  There are tons of git repos out there with a 
blob hashing to e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 or 
257cc5642cb1a054f08cc83f2d943e56fd3ebe99.

I have an idea.  Start your own website uuidemail.com.  One registers 
and gets an alias for their email, something like 
8aacc35ffca0d34fccf8a750e84e3a81bdcb940b@uuidemail.com.  Then people can 
start using 
8aacc35ffca0d34fccf8a750e84e3a81bdcb940b+pbonzini--redhat.com@uuidemail.com 
as their git user.email.  I bet nobody will.

Paolo

ps: Yes, in a perfect world it would be nice for people to know that I 
am the same person independent of whether I contribute as 
bonzini@gnu.org or pbonzini@redhat.com.  But we're not in a perfect 
world, so amen.
--

From: Michael Witten
Date: Friday, March 19, 2010 - 5:43 am

Firstly, the UUID need not be a name/email pair.

Secondly, you're being ridiculous; even if that ridiculous scenario
played out not-infrequently, there would still be less identity
confusion in git repos over time, because changing real life names,
and changing email accounts do happen frequently and are not

This doesn't make any sense. Why does anybody need to create another

The SHA-1 is supposed to be an optimization; it's not essential, as
I've already explained; I also get the feeling that you're being

This is nonsense that betrays your misunderstanding.
--

From: Paolo Bonzini
Date: Friday, March 19, 2010 - 5:53 am

It's not a matter of frequency.  If you want a "UU" identification,

Why?  What does (name, email, uuid) provide over (name, concat(uuid, 
email))?  Nothing.

But the point is, neither really provides anything over (name, email).

Paolo
--

From: Michael Witten
Date: Friday, March 19, 2010 - 6:03 am

I've got news for you. The UUIDs generated by uuidgen CAN collide:

    The new UUID can reasonably be considered unique
    among all UUIDs created on the local system, and
    among UUIDs created on other systems in the past
    and in the future.

You're creating a straw man argument; conceptually, what I propose is
better than what the current system provides because it would decrease

Go read the thread until you understand.
--

From: Paolo Bonzini
Date: Friday, March 19, 2010 - 6:08 am

Maybe you have to define entropy.  For human consumers, "Paolo Bonzini 
<pbonzini@redhat.com>" has considerably less "entropy" than 
8aacc35ffca0d34fccf8a750e84e3a81bdcb940b, as does even "Paolo Bonzini 
<bonzini@gnu.org, pbonzini@redhat.com>".  For non-human consumers, a 

I am not alone.

Paolo
--

From: Michael Witten
Date: Friday, March 19, 2010 - 6:13 am

As I've stated before many times, the SHA-1 is not necessary to the proposal.

Please go read.
--

From: Wincent Colaiuta
Date: Friday, March 19, 2010 - 6:41 am

Stop telling people to go read your idiotic proposal. It has _already_ been read with great attention, and multiple people have shown immense patience repeatedly explaining to you why the idea is stupid. Your continued trolling is really starting to grate.

The overwhelming, sustained opposition to your idea should already be enough indication that such a proposal will _never_ be accepted into the Git codebase, so right now you're just wasting people's time.

w

--

From: Michael Witten
Date: Friday, March 19, 2010 - 6:59 am

I've shown immense patience repeatedly explaining why these
'explanations' are strawmen or based on misunderstandings and bad
assumptions.

It's true that I have been receiving perfectly valid complaints. The
problem is that almost all of them have nothing to do with what I've
been saying because people see 'uuid' and a few examples with hex

I long ago gave up the notion that it would be included in the git codebase.

Instead, I've been defending the idea, which is a simple but vast
improvement over the current system; had it been in place since the
beginning, a lot of trouble could have been reduced.

Indeed, the only thing that makes this great idea a bad idea is
COMPATIBILITY CONCERNS; that's it.
--

From: Martin Langhoff
Date: Friday, March 19, 2010 - 7:13 am

No, you haven't. _You_ are misunderstanding.

We have what you want: email + name, and a mapping mechanism (mailmap)

Good... at last! But don't put ALL CAPS when you are in the wrong,
mate. And wasting a lot of people's time.



m
-- 
 martin.langhoff@gmail.com
 martin@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff
--

From: Linus Torvalds
Date: Thursday, March 18, 2010 - 2:27 pm

The thing is, you don't seem to realize that most authorship is over 
email.

Let's take some numbers from the kernel archive, for example. Here's _one_ 
trivial way to count it:

 - number of commits where author/committer email matches (presumably 
   _not_ emailed, although sometimes people commit their own patches that 
   were emailed around):

	[torvalds@i5 linux]$ git log --no-merges "--pretty=format:%h-%ae%n%h-%ce" | uniq -d | wc
	  33473   33473  959167

 - total number of commits:

	[torvalds@i5 linux]$ git rev-list --no-merges HEAD | wc
	 176415  176415 7233015

IOW, less than a fifth of the patches were done by the person who actually 
committed things. 80%+ of all changes were committed by somebody else than 
the author.

How do you think the authorship information can be transferred sanely, 
considering that the author didn't even use git in the first place? 
Really?

That's where the typos/mistakes/missing-info really happens. And it often 
starts out with incomplete information, because the person has a bad email 
setup, and the thing only has an email address to begin with, ie the 
"From:" might literally say just "tytso@mit.edu" or something (to pick an 
example from the Cc list in this discussion - when Ted sends real emails, 
they tend to have proper naming).

Sometimes we'll edit the messages to have the "From: xyz <abc>" thing at 
the top, fixing up the incomplete thing then. Typos happen there. Or the 
patch will simply come in two different ways, so there's no typo, yet 
there are two different emails that get author attribution.

The thing is, development really is about human interaction. Yes, there's 
a tool involved (git), and once the data is in the tool we won't lose it 
any more, but this is about getting the data _into_ the tool in the first 
place.

And the data you want to add simply DOES NOT EXIST. And we can't make it 
exist. The fact that even the trivial and obvious data that git _does_ ask 
for gets to be incomplete ...
From: Michael Witten
Date: Thursday, March 18, 2010 - 2:44 pm

On Thu, Mar 18, 2010 at 16:27, Linus Torvalds

That is a really good point, and something I'll have to consider more
thoroughly.

However, I do NOT claim that my proposal will add information where
there is none, only that it will reduce the rate at which entropy
increases.
--

From: Jon Smirl
Date: Thursday, March 18, 2010 - 4:12 pm

On Thu, Mar 18, 2010 at 5:27 PM, Linus Torvalds

If I recall correctly the top source of errors is variations in the
domain name of the email address. Second place was mangling of names
from non-ASCII charsets. Third place was human typos. Fourth was
inconsistency in the human name, like Ted's example.

A really simple check would be for git to say - I've never seen this
name/email combo before, are you sure it is correct before I commit
it.




-- 
Jon Smirl
jonsmirl@gmail.com
--

From: A Large Angry SCM
Date: Thursday, March 18, 2010 - 3:17 pm

[Much text deleted]

The formatting of the information in the author & committer fields are a 
_social_ convention (with a little help from the tools).You can actually 
use this proposed "feature" now for your own commits by appending the 
UUID string to you name config setting, environment variable and/or GCOS 
field today and everything will work. You can even make it a requirement 
for projects that you control. But don't expect all other projects to do 
so also as they may not care.
--

From: Sitaram Chamarty
Date: Thursday, March 18, 2010 - 7:47 pm

[all snipped]

Great Gods above... 50+ emails, including many from Linus himself,
trying to respond to a non-solution to a non-problem...

slow day?
--

From: Nazri Ramliy
Date: Thursday, March 18, 2010 - 10:17 pm

Nah.. just wait until someone mentions either Hitler or Nazis.

nazri.
--

From: Michael Haggerty
Date: Friday, March 19, 2010 - 1:41 am

A UUID doesn't need to be a big hex number.  All it has to be is a
"Universally Unique Identifier".  Like, oh, for example, your

                   *** EMAIL ADDRESS ***

[1].  There is even already a way to fix up mistakes or unavoidable
email address changes, namely the .mailmap file.

So if you are exercised about having a persistent identity, simply find
an email provider that is unlikely to ever give your email address to
somebody else, and use that address consistently.  Encourage other
people to do the same and to keep their .mailmap entries up to date.

(Not that it's likely to happen, but having people maintain opaque UUIDs
is even *less* likely.)

Michael

[1] The only non-UUID property of legitimate email addresses is that the
username part or even the domain name part of an email address can be
recycled.  But with a reputable email provider this shouldn't be a
problem.  For the purpose of the UUID it is not even a problem if the
email address becomes defunct, as long as it is not taken over by
somebody else.
--

From: Michael Witten
Date: Friday, March 19, 2010 - 4:39 am

*facepalm*

You've just repeated everything that I've said; go look at the rest of
the thread, where I spend plenty of time correcting the same hangups
about my choice of the word UUID and my use of hex digits.

I'm only observing that the current name/email system pair conflates
an individual with his current email system and that it would be
worthwhile to ALLOW an individual to FURTHER describe himself by
including another piece of information that is solely meant as
identification within git. That piece of information could be whatever
a user deems to be uniquely identifying for himself. You could use
"Michael Haggerty <mhagger@alum.mit.edu>" as your uuid, and you could
still use it after you change the `email' config variable to something
else.

There is MUCH LESS CHANCE of such a uuid getting trashed by typos,
changing names, and changing email addresses; of course it can still
get messed up, but the rate at which something like .mailmap would
need to be updated would likely be greatly decreased and it would make
gathering statistics easier (especially for the individuals who take
advantage of such a uuid for describing themselves---and it only
requires setting one config variable to something easily remembered by
that person).

I cover all of this numerous times in numerous rebuttals; don't
contribute to a thread with more than 60 emails without having read at
least some of them. If you don't care to read so much, then perhaps
jump here:

    http://marc.info/?l=git&m=126894679711600&w=2

In the end, there is probably only one legitimate problem with my
proposal: It might break compatibility with older repo formats/tools.
I'm not sure about that.

Sincerely,
Michael Witten
--

From: david
Date: Friday, March 19, 2010 - 4:45 am

here is where you are missing the point.

no, there is not 'much less chance' of it getting messed up.

you seem to assume that people would never need to set the UUID on 
multiple machines.

if they don't need to set it on multiple machines, then the e-mail/userid 
is going to be reliable anyway

if they do need to set it on multiple machines and can't be bothered to 
keep their e-mail consistant, why would they bother keeping this 
additional thing considtant? Linus is pointing out that people don't care 
now about their e-mail and name, and will care even less about some 
abstract UUID

people who care will already make their e-mail consistant.

>
From: Mike Hommey
Date: Friday, March 19, 2010 - 4:54 am

While I don't agree with the need for that uuid thing, I'd like to
pinpoint that people who care can't necessarily make their e-mail
consistant. For example, Linus used to use an @osdl.org address, and
he now uses an @linux-foundation.org address. It's still the same Linus,
but the (name, email) pair has legitimately changed.

Mike
--

From: Reece Dunn
Date: Friday, March 19, 2010 - 5:09 am

So create an aliases list that maps one (name,email) to another that
is from the same person. There is no need for an additional item (a
uuid) to solve this problem. It also means that searching on any
(name,email) pair will find the others, so you only need to
remember/find one of the identities for the person you are interested
in finding the commits for.

AFAICS, mailmap is about correcting mistakes (primarily in the
reported name for a given email address). In this case, mailmap and
this aliases-map will work in conjunction with each other to give what
the original poster wanted. However, I haven't seen any of his replies
that answer this (or sufficiently address why mailmap does not solve
his problem).

- Reece
--

From: Michael Witten
Date: Friday, March 19, 2010 - 5:16 am

See:

    http://marc.info/?l=git&m=126900051102958&w=2

The idea is to distribute the responsibility for maintaining a
consistent identity AND to make that responsibility EASY.

The extra uuid `field' can only suffer from typos, while the
name/email pair can suffer from typos, changing email accounts, and
changing real life names. If the uuid `field' does get bungled by a
typo or is not used, then we're no worse off than we were before.
--

From: Michael Witten
Date: Friday, March 19, 2010 - 5:18 am

I should add that because the uuid `field' would be typed pretty much
only as a config variable and then used by git tools from thenceforth,
the rate at which typos can occur is much less than for the name/email
pair.
--

From: Reece Dunn
Date: Friday, March 19, 2010 - 7:57 am

What specific problem(s) are you trying to solve?

The main issue is identifying who made what changes to a repository
(e.g. by a script, or database/statistics algorithms). The mailmap
file allows for corrections to a canonical (name,email) pair for a
specified repository.

For identifying the same person working across multiple projects,
ideally they should keep the canonical (name,email) pair consistent
across all projects, with mailmap files in the respective projects to
keep the canonical form correct.

This canonical (name,email) pair is then a unique identifier for that
person and then effectively becomes a uuid. There is no need to add an
extra uuid field that needs *more* work fixing up errors and making
consistent.

If you change email address or name, *and* care enough about it being
consistent, there is no reason why you cannot update the mailmap file
to use the new canonical (name,email) pair.

Oh, and you are expressing it wrong (if I understand you correctly)...

What you are after is a string U (the uuid) that is used to identify a
person irrespective of their name and email. At the moment
   U = (name,email)
is used to achieve that, with mailmap to normalise the variations.

What you are trying to express is:
    U <=> (name,email)
where U can be any unique string. This is different from using a
(name,email,uuid) triple to identify someone.

So, lets say that I choose U=abc to identify myself uniquely, so that:
    "abc" <=> "Reece Dunn <msclrhd@gmail.com>"
    "abc" <=> "Reece Dunn <msclrhd@googlemail.com>"
    "abc" <=> "Reece Dunn <msclrhd@hotmail.com>"
    "abc" <=> "Reece H. Dunn <msclrhd@gmail.com>"
    "abc" <=> "Reece H Dunn <msclrhd@gmail.com>"

I would still need to define all these variations when and as they
occur in a repository to fixup any typos and email address changes
that occur, so why not just pick U = "Reece H. Dunn
<msclrhd@gmail.com>" as the canonical form instead of "abc" or some
other string?

As has been said, mailmap ...
From: Michael J Gruber
Date: Friday, March 19, 2010 - 8:26 am

Reece Dunn venit, vidit, dixit 19.03.2010 15:57:


[Attention, conspiracy theories below!]

The problem seems to be that some people are interested in statistics,
so some are interested in consistent author information, but this
requires others (the authors) to maintain this information, at least on
large projects where this information cannot be kept consistent by a few
people. So, some people are looking for a way to enforce this on the
others... Of course, one could also rephrase this is as "help authors
maintain their authorship information in a consistent way" ;)

Michael
--

From: david
Date: Friday, March 19, 2010 - 9:05 am

but a UUID doesn't help you.

if you can force people to have a consistant UUID, you can force them to 
have a consistant e-mail address (and submit mapping updates if it 
changes)

if you can't force people to maintain a consistant e-mail, why do you 
think they would maintain a consistant UUID?

David Lang
--

From: Michael Witten
Date: Friday, March 19, 2010 - 10:16 am

Firstly, please note that a UUID is defined in this context as any
string that the user deems for himself to be uniquely identifying of
himself; a UUID allows a user to determine his canonical
representation from the very start.

There's no forcing; there can't be. This is meant to help users manage
their own identities.

A UUID is basically only subject to change due to:

    * typos when configuring

A name/email pair (as in the user.name and user.email variables) is
subject to change due to:

    * typos when configuring
    * legal name changes
    * email account switching

Naturally, older commits and wrong UUIDs would need mappings, but
that's no different than the current situation except for the fact
that UUIDs would not change as frequently.

That aside, an alternative solution that is not as powerful but that
is less invasive would be to allow users to transmit authorship
information as part of the patch payload separate from the usual email
headers (or something like this). Erik Faye-Lund suggests this is
already easily done, but I'm not so sure.
--

From: Jon Smirl
Date: Friday, March 19, 2010 - 5:25 am

git already supports aliases via the .mailmap file. Pick one
name/address pair that you like and then use .mailmap to map all of
the variations into the primary one. Granted some git tools don't
process .mailmap, but it is easier to fix the tools that create a new
ID system.

Look at the .mailmap in the current kernel tree. It fixes a few
problems. I have a much larger one that fixes most address issues.




-- 
Jon Smirl
jonsmirl@gmail.com
--

From: Reece Dunn
Date: Friday, March 19, 2010 - 5:40 am

Indeed. I wasn't aware that mailmap catered for this as well.

- Reece
--

From: Michael Witten
Date: Friday, March 19, 2010 - 5:09 am

Indeed.

This is because the name/email pair (as in the 'name' and 'email'
config variables) CONFLATES the idea of identity and current email
account.
--

From: Mark Brown
Date: Monday, March 22, 2010 - 5:06 am

You're assuming they aren't conflated - for example, when people do work
both personally and for their employer they often use distinct e-mail
addresses to identify how the work was funded.
--

From: Michael Witten
Date: Monday, March 22, 2010 - 7:38 am

Indeed.

The model I propose handles this case much better, as I explain here:

    http://marc.info/?l=git&m=126900051102958&w=2

Specifically:

    > if they do need to set it on multiple machines and
    > can't be bothered to keep their e-mail consistant,
    > why would they bother keeping this additional thing
    > considtant? Linus is pointing out that people don't
    > care now about their e-mail and name, and will care
    > even less about some abstract UUID
    
    The user doesn't have a damn choice!

    [These first few paragraphs aren't completley correct;
     there's an explanation below them. It's mainly just
     setting up for the important part below.]
    
    The email can't be kept consistent over time because
    the tools expect it to be and/or use the actual
    physical email used to send/receive stuff. It's
    information that CONFLATES identity with whatever
    tool/system you're using.
    
    For instance, Michael Haggerty cannot reasonably use
    
        [user]
            name  = Michael Haggerty
            email = mhagger@MIT.EDU
    
    because he likely no longer has that email account
    to use. He is forced to change it and therefore
    forced to make his identity confused.

    [The above isn't quite true; my mistake. Michael
     could actually keep "mhagger@MIT.EDU" but inform
     tools like "git send-email" to send patches from
     another email address; this way, send-email will
     emit the necessary information to carry that
     authorship identity ("mhagger@MIT.EDU") along
     with the patch.
    
     However, it's still the case that Michael Haggerty
     is essentially stuck with "mhagger@MIT.EDU" for
     his identification---a problem that my proposal
     essentially fixes, as described now:]
    
    I'm proposing ALLOWING him to say:
    
        [user]
            uuid  = Michael Haggerty <mhagger@MIT.EDU>
            name  = Michael Haggerty
            email = ...
From: Erik Faye-Lund
Date: Wednesday, March 24, 2010 - 12:18 pm

...which is the exact same situation as above, where he's "stuck"
using "mhagger@MIT.EDU" for identification. I don't see how this
changes anything (except allowing to distribute an updated
contact-email... But let's face it, git-repos aren't Facebook)

-- 
Erik "kusma" Faye-Lund
--

From: Michael Witten
Date: Wednesday, March 24, 2010 - 12:23 pm

I don't see how you can't see it.

Oh well.
--

From: Michael Witten
Date: Friday, March 19, 2010 - 5:08 am

I covered that in the first email, highlighting the importance of
using an easily remembered, already reasonably unique piece of

The problem is that the name/email pair (as in the 'name' and 'email'
config variables) is NOT ONLY subject to typos, but it is ALSO subject
to changing email accounts and changing real life names.

If you don't use the uuid `field' that I propose, then everything
would be just like it was before. If you do use it, then you can
easily identify all of your own contributions regardless of what your

The user doesn't have a damn choice!

The email can't be kept consistent over time because the tools expect
it to be and/or use the actual physical email used to send/receive
stuff. It's information that CONFLATES identity with whatever
tool/system you're using.

For instance, Michael Haggerty cannot reasonably use

    [user]
        name  = Michael Haggerty
        email = mhagger@MIT.EDU

because he likely no longer has that email account to use. He is
forced to change it and therefore forced to make his identity
confused.

I'm proposing ALLOWING him to say:

    [user]
        uuid  = Michael Haggerty <mhagger@MIT.EDU>
        name  = Michael Haggerty
        email = mhagger@ALUM.mit.edu

Heck, let's say he works at Red Hat as well; he might make some
commits under this config AT WORK:

    [user]
        uuid  = Michael Haggerty <mhagger@MIT.EDU>
        name  = Michael Haggerty
        email = mhagger@redhat.com

Then, he can make, say, commits to the Linux kernel repo for both work
and hobby related issues and still be recognized as the same person.
That is, he can have some commits under "Michael Haggerty
<mhagger@ALUM.mit.edu>" and other commits under "Michael Haggerty
<mhagger@redhat.com" and still link them all together as the same
identity with just the uuid "Michael Haggerty <mhagger@MIT.EDU>".

Sincerely,
Michael Witten
--

From: Michael Haggerty
Date: Friday, March 19, 2010 - 7:08 am

No, my point is to use the *existing* email address as the UUID

Give me a break.  It's not so damn hard to keep an email address over
time.  And if it changes, I can update the .mailcap file to map my old
email address to the new one and *presto* I have a new, equally valid

Wrong.  I've read the whole idiotic thread.  To prove it I'll summarize
it for you: you argue the same point over and over again while ignoring
the legitimate objections of just about every other participant.

Adding a new UUID field is obviously a non-starter, so I suggested a way
to get the same (very marginal) benefit from the fields that are already
present in every git repository.

Michael
--

From: david
Date: Friday, March 19, 2010 - 10:02 am

if you are now proposing using the e-mail address, that already 
exists and is supported by the tools, it sounds like you are just 
withdrawing your proposal (other than possibly proposing that the e-mail 
field gets renamed to UUID????)

David Lang

--

From: Michael Witten
Date: Friday, March 19, 2010 - 10:06 am

You're responding to a different Michael.
--

From: Avi Kivity
Date: Wednesday, March 24, 2010 - 11:50 am

I guess he should have checked the UUID.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Jakub Narebski
Date: Friday, March 19, 2010 - 7:08 am

This is non-solution to non-problem.

First, the user.name and user.email does not need to be name and email
from some email account.  It might be some "canonical name" and 
"canonical email".

Second, there are (I think) two main sources of 'unstability' in
(name,email) pairs, namely A) misconfigured git (when fetching/pushing
using git itself), B) wrong name in email etc. (when sending patches
via email, 80% of patches in Linux kernel case).

In the case of misconfigured git (case A) using UUID wouldn't help,
and only make it worse (you would have to configure the same UUID on
each machine).  What would help here is for git to be more strict and
perhaps forbid (some of) autogenerated names and emails.

In the case of sending patches via email, you can use in-body 'From:'
to provide (name,email) part that is different than account used to
send email.  In the case of UUID you would need the same: some way to
provide UUID in patch (in email).  UUID has the disadvantage of being
required also when (name,email) in From: email header is good user ID.
So UUID wouldn't help there either.


What could help in both cases is .mailmap being used (perhaps on
demand) in more git commands.  See Documentation/mailmap.txt
or e.g. git-shortlog(1) manpage.  It is quite advanced tool for
correcting mistakes (it can correct *both* user name, which is
most common usage, but also email address).

-- 
Jakub Narebski
Poland
ShadeHawk on #git
--

From: Jon Smirl
Date: Friday, March 19, 2010 - 7:33 am

Another top source is mangling of non-ASCII charsets when they go
though the email system. Are the git work flow tools safe for
alternative charsets? Do the email tools look at the charset header of
the email message? Check people's names in the kernel commits and
you'll find lots of examples of this type of mangling.

Or people not using UTF-8. There are files in the kernel where
people's names are in conflicting codepages. Should git try to look



-- 
Jon Smirl
jonsmirl@gmail.com
--

From: Michael J Gruber
Date: Friday, March 19, 2010 - 7:52 am

Or even the quoting of quotes for nick names, appearing as 'nick',

You and others are proving a very important point here: This is really
an lkml proxy fight being taken to the git list, after the futile
mailmap-ification there.

People may disagree on the best approach in general, but this thread
clearly shows:

- The Git community is happy with mailmap for git.git.
- The Git community does not see any need for amending the mailmap
mechanism.
- How you actually use mailmap (leniently or enforcing) is a per-project
decision, just like the patch workflow, the meaning and use of s-o-b
lines, the requirement for full names and many other things.

But since the git list is hosted on kernel.org we can't really complain
about providing room for an lkml discussion ;)

Michael
--

From: Michael Witten
Date: Friday, March 19, 2010 - 7:40 am

The vast majority of patches come in through email; the git tools
expect the user.name and user.email to reflect physical email account
information.

You would be correct if it were not for the fact that git currently

The uuid string would be typed pretty much only during configuration;
from there, it's basically just handled by the git tools. Hence, the
uuid can indeed suffer from typos, but the name/email pair can suffer
from not only typos but also real life name changing and email account
switching.

There would still be the same problem of variations in uuid for one
person, but the problem would very likely be greatly reduced; if a
person doesn't use the uuid properly or at all, then we're in the
exact same situation we were before. Those who do use it, though, will
be much better off.

Strictness about names and emails is difficult, and keeping something
like the current .mailmap file up-to-date is a centralized process.
The uuid field would distribute the responsibility of maintaining
identity and make that responsibility easy because the user-chosen
string is easy for that user to remember and is typed only very

That's a good solution that I've considered, except for 2 reasons:

    * It involves much more opportunities for typos and/or the
      configuration of a non-git tool for a git-specific purpose.

    * Many if not most email services will refuse to send messages

Yes, but that's automated by tools like git's format-patch. Not using
something like format-patch or some other git interface is an
'out-of-band' communication and that author has essentially chosen not
to care about his identity.

The use of the uuid field and allowing git tools to handle it is just
a way to give a person who does care about his identity to keep it


The disadvantage here is that it centralizes identity management and
it is more demanding because the name/email pair is quite unstable.

On the other hand, something like a uuid field would distribute that
management to ...
From: Erik Faye-Lund
Date: Friday, March 19, 2010 - 7:56 am

What git tools would that be? The only one I know of that does
anything near assuming that is git send-email, and it only uses
user.email if neither sendemail.from is configured nor --from option
is specified. And even when it does, it prompts the user so it can be
changed if called from a terminal. So I wouldn't say that it assumes
anything about the "physicalness" of user.email, it just uses it's as
the most sane default unless anything else has been specified.

-- 
Erik "kusma" Faye-Lund
--

From: Michael Witten
Date: Friday, March 19, 2010 - 8:05 am

It's useless to spoof the From field because many email services won't
send it, a point I already covered in the email you quoted.

When a patch is finally emailed, it's the From field that is used for
Author attribution.

You see? Your identity has been tied to whatever email service you
happen to use at any given time rather than to something with more
long term stability.
--

From: Michael Witten
Date: Friday, March 19, 2010 - 8:12 am

A lot of trouble could probably be avoided if the Authorship
information could be sent as something separate from the From field. I
don't think it would be quite as powerful as having a uuid, but it
would be less invasive and probably practically as effective.
--

From: Erik Faye-Lund
Date: Friday, March 19, 2010 - 8:25 am

The From-field isn't assumed to be a physical-address, but the
From-header is. If the From-field and the From-header are identical,
the From-field doesn't get emitted. This is the same mechanism that is
used when people forward patches from other authors, and there's no
attempts to validate the From-field, only the From-header.

So no, the author-email shouldn't need to be a physical address as far
as send-email is concerned.

-- 
Erik "kusma" Faye-Lund
--

From: Reece Dunn
Date: Friday, March 19, 2010 - 8:12 am

I don't get this - it is the other way around.

For the mailmap file, you check that file into the git repository
itself. Therefore, by implication, mailmap *is* distributed. It is
therefore kept locally and accessed locally. It also does not suffer
from configuration issues, as you don't need to re-enter it if you
change your computer.

For a uuid to work the way you intend it, there would need to be some
universal central server that would be queried to look up and resolve
the uuid so you can get consistent user identification information for
every git command by every person/script from every git repository.
This is never going to fly for all the reasons distributed VCSs were
created in the first place.

Unless by distributed you mean in the .git/config file, which is
always local and never distributed to others. However, the uuid data
in the repository will be distributed in the repositories, so how is
this any better than what git has now?

- Reece
--

From: Jakub Narebski
Date: Friday, March 19, 2010 - 5:21 pm

It is not true.  From the git-config(1) manpage, the description (meaning)
of user.name and user.email is:

  user.email::
        Your email address to be recorded in any newly created commits.
        Can be overridden by the 'GIT_AUTHOR_EMAIL', 'GIT_COMMITTER_EMAIL', and
        'EMAIL' environment variables.  See linkgit:git-commit-tree[1].

  user.name::
        Your full name to be recorded in any newly created commits.
        Can be overridden by the 'GIT_AUTHOR_NAME' and 'GIT_COMMITTER_NAME'
        environment variables.  See linkgit:git-commit-tree[1].
 
As you can see there is nothing about email, and physicsl email account.

It is true that git-send-email asks about the "From" email address to
send email from with user.name + user.email as default value...
unless either sendemail.from or --from option is used.  

You do not need (in theory at least) to change user.name nor user.email
with real life name changing (like marriage or adoption) and email 
account switching.


Actually git-send-email would automatically add in-body "From:" header
if it is different from the "From:" address for email, and git-am would
automatically prefer in-body "From:" over sender (in-header "From:")
for authorship information.

Sender can be different from author of the patch, there is no problem
with that.

What git can improve here (and perhaps already does it) is handling of
non-ASCII characters in name (e.g. when commit message does not contain
non US-ASCII letters, but user.name does).  Perhaps it got corrected
(improved) already.


P.S. Backward compatibility (older git-am) would probably require
UUID in the form of canonical name+email, and use of in-body "From:"

git-send-email *already* automatically deals with sender != author.


How in-tree .mailmap file (in-tree like .gitignore and .gitattributes)
is *centralized identity management*?  It is as distributed as git
repositories are.

On the other hand user.uuid is not distributed; for security ...
Previous thread: Correction in post-update.sample hook by Henry Gebhardt on Thursday, March 18, 2010 - 6:16 am. (1 message)

Next thread: tracking moved svn repo by Felipe =?utf-8?Q?S=C3=A1nchez?= on Thursday, March 18, 2010 - 6:23 am. (1 message)