Re: 463 kernel developers missing!

Previous thread: [PATCH] net/niu: Fix compile warnings by Takashi Iwai on Monday, July 28, 2008 - 7:24 am. (1 message)

Next thread: [RESEND][PATCH 00/12] ipwireless: driver updates by David Sterba on Monday, July 28, 2008 - 7:52 am. (15 messages)
From: Jon Smirl
Date: Monday, July 28, 2008 - 7:45 am

Here's a new .mailmap file for the kernel that cleans up the horrible
mess of names and email addresses in the log.  To use it put it at the
root of your kernel tree and type 'git shortlog'. Before the clean up
there were 4,284 developers, after 3,821. There are 5,051 unique
emails.

The mailmap file contains all email addresses that have been used to
submit patches to the kernel. Don't freak out about your email address
being in the file, if it is in the file it is already in Google since
the kernel log is already in Google.

Putting all the email addresses and names into this file allows it to
be used as a basis for future validation. Since I don't know perl, can
someone whip up a patch to checkpatch.pl that validates the emails in
new patches against the ones in mailmap? Then if you aren't in mailmap
part of your commit needs to include a new entry for mailmap.

Another useful script would take the output of "git log | grep ^Author
| sort -u" and diff the list of email address against the mailmap
file.  Any new emails found are new people that need to be added to
mailmap. Only the emails should be checked, not the names.

Please excuse any errors I made in the clean up process, a large
portion of it was done manually. After the base file is in we can
patch it to fix the errors. For those of you using a dozen aliases,
you might want to order them so that your current email is the last
one in the list. James Bottomley has the most aliases, 13.

PS It's not a diff because it would be too big to post.

-- 
Jon Smirl
jonsmirl@gmail.com
From: Jon Smirl
Date: Monday, July 28, 2008 - 8:20 am

Some stats using the new mailmap file:

Patches, number of developers
1, 1591
2, 484
3, 254
4, 152
5, 118
more, 1242

Top twenty developers:

Linus Torvalds (4350):
Andrew Morton (1789):
Adrian Bunk (1774):
Al Viro (1735):
Ingo Molnar (1393):
Ralf Baechle (1367):
Jeff Garzik (1291):
Takashi Iwai (1196):
Tejun Heo (1092):
Bartlomiej Zolnierkiewicz (1071):
David S. Miller (1069):
Patrick McHardy (1031):
Stephen Hemminger (1017):
Russell King (985):
Andi Kleen (973):
Thomas Gleixner (953):
Alan Cox (822):
Paul Mundt (813):
Dave Miller (787):

Make your own list:
   git shortlog | grep -v ^[[:space:]] | grep -v ^$ | sort -t "(" -g -k 2 -r


-- 
Jon Smirl
jonsmirl@gmail.com
--

From: Adrian Bunk
Date: Monday, July 28, 2008 - 8:35 am

Dave Miller = David S. Miller


cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

--

From: Jon Smirl
Date: Monday, July 28, 2008 - 8:45 am

Easily fixed, I just missed one of his aliases.
12% of all name/email pairs have errors in them, I didn't catch them all.

Linus Torvalds (4350):
David S. Miller (1856):
Andrew Morton (1789):
Adrian Bunk (1774):
Al Viro (1735):
Ingo Molnar (1393):
Ralf Baechle (1367):
Jeff Garzik (1291):
Takashi Iwai (1196):
Tejun Heo (1092):
Bartlomiej Zolnierkiewicz (1071):
Patrick McHardy (1031):
Stephen Hemminger (1017):
Russell King (985):
Andi Kleen (973):
Thomas Gleixner (953):
Alan Cox (822):
Paul Mundt (813):


-- 
Jon Smirl
jonsmirl@gmail.com
--

From: Adrian Bunk
Date: Monday, July 28, 2008 - 8:52 am

I use "cg-log --summary", but there's most likely also some easy way 

There were two:
Dave Miller <davem@davemloft.net>

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

--

From: Jon Smirl
Date: Monday, July 28, 2008 - 8:58 am

Added that one, now he has 11 aliases.
Two more developers disappear.

Linus Torvalds (4350):
David S. Miller (1858):
Andrew Morton (1789):
Adrian Bunk (1774):
Al Viro (1735):
Ingo Molnar (1393):
Ralf Baechle (1367):
Jeff Garzik (1291):
Takashi Iwai (1196):
Tejun Heo (1092):
Bartlomiej Zolnierkiewicz (1071):
Patrick McHardy (1031):
Stephen Hemminger (1017):
Russell King (985):
Andi Kleen (973):
Thomas Gleixner (953):
Alan Cox (822):
Paul Mundt (813):


-- 
Jon Smirl
jonsmirl@gmail.com
--

From: Jon Smirl
Date: Monday, July 28, 2008 - 9:12 am

Andrew was getting credit for Len Brown's patches, new top 20.

Of course the list has errors in it, but they are easy to find and
fix. Once they are fixed they will never change since they reflect the
existing history in the tree.

Let's get the main patch in and then everyone can send in little
patches to fix the errors. I've removed all of the errors I can via
scripts, now it is a manual process to spot the aliases.

Linus Torvalds (4350):
David S. Miller (1858):
Adrian Bunk (1774):
Al Viro (1735):
Ingo Molnar (1393):
Ralf Baechle (1367):
Jeff Garzik (1291):
Andrew Morton (1280):
Takashi Iwai (1196):
Tejun Heo (1092):
Bartlomiej Zolnierkiewicz (1071):
Patrick McHardy (1031):
Stephen Hemminger (1017):
Russell King (985):
Andi Kleen (973):
Thomas Gleixner (953):
Alan Cox (822):
Paul Mundt (813):
Jean Delvare (771):


-- 
Jon Smirl
jonsmirl@gmail.com
--

From: Christian Borntraeger
Date: Monday, July 28, 2008 - 10:15 am

git-shortlog --no-merges

You can also avoid the grepping:

git-shortlog -n -s --no-merges
--

From: Jon Smirl
Date: Monday, July 28, 2008 - 10:25 am

Top 20 without the merges.
Apparently Linus really doesn't write code.

  1770	Adrian Bunk
  1735	Al Viro
  1716	David S. Miller
  1367	Ralf Baechle
  1280	Andrew Morton
  1277	Ingo Molnar
  1196	Takashi Iwai
  1091	Tejun Heo
  1071	Bartlomiej Zolnierkiewicz
  1031	Patrick McHardy
  1016	Stephen Hemminger
   969	Andi Kleen
   945	Thomas Gleixner
   904	Russell King
   822	Alan Cox
   809	Paul Mundt
   771	Jean Delvare
   727	Trond Myklebust
   721	Michael Krufky



-- 
Jon Smirl
jonsmirl@gmail.com
--

From: Willy Tarreau
Date: Monday, July 28, 2008 - 10:19 pm

git shortlog --no-merges

Willy

--

From: Adrian Bunk
Date: Monday, July 28, 2008 - 8:45 am

The charset of the names is pretty random - that should be fixed at some 

200 kB would be OK for linux-kernel (AFAIR the current limit
is 400 kB). But to prevent charset problems a compressed attachment 

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

--

From: Jon Smirl
Date: Monday, July 28, 2008 - 8:55 am

Follow on patches can fix the charset issues, right now they are
simply copied from the log messages.  I've tried to preserve them as

It also saved the mail server from sending out a couple hundred GB of mail.

The main change is including every email in the mailmap and not just
the exceptions. By putting all emails into the file it becomes
possible to use the file for validation. And we need validation, the
current log has a 12% error rate.

I'll send it in patch form to whoever is going to send it upstream.


-- 
Jon Smirl
jonsmirl@gmail.com
--

From: Joe Perches
Date: Monday, July 28, 2008 - 9:20 am

Why?

--

From: Jon Smirl
Date: Monday, July 28, 2008 - 9:47 am

So that we can validate against future misspellings of your name and
email address. It's all about doing validation and stopping the
errors. The current log has over 1,000 typos in the names and emails.
Getting rid of the typos makes it possible to generate clean
statistics.

If you're paranoid get a new gmail address for each commit. From
reading the list of authors I'd say we have about 10 paranoid people


-- 
Jon Smirl
jonsmirl@gmail.com
--

From: Stefan Richter
Date: Monday, July 28, 2008 - 10:07 am

It actually doesn't.
-- 
Stefan Richter
-=====-==--- -=== ===--
http://arcgraph.de/sr/
--

From: Simon Arlott
Date: Monday, July 28, 2008 - 9:54 am

Just because anyone can grep the kernel log for email addresses [to 
send spam to], doesn't mean that you need to do it for them.

Please read git-shortlog(1) and then remove me from this file because 
it won't change anything.

-- 
Simon Arlott
--

From: Jon Smirl
Date: Monday, July 28, 2008 - 10:05 am

From: Simon Arlott
Date: Monday, July 28, 2008 - 10:10 am

No, I've submitted patches using three email addresses (well, two - one is 

Try running "git shortlog" too, you'll see I only appear once using the 
existing 99-line .mailmap file.

-- 
Simon Arlott
--

From: Jon Smirl
Date: Monday, July 28, 2008 - 10:22 am

That's the whole point of this list. When you submit patches in the
future we can check your name/email against the list and flag it if it
isn't there. That will alert you that you've made a typo.

A later version of this list could separate the valid current
names/addresses from the entries that are fixing typos or that have
old emails. That would improve the validation. But I don't have an
automated way to tell me which alias is the current one. Access to the
current LKML subscriber list would supply the needed info as to which


-- 
Jon Smirl
jonsmirl@gmail.com
--

From: Michael Krufky
Date: Monday, July 28, 2008 - 10:58 am

Please dont use LKML subscriptions as the authority of one's preferred
email address.

I, for instance, am *only* subscribed to LKML using my gmail account.
I prefer that nobody ever email my gmail account directly -- I use my
gmail account as a filter -- gmail filters my mails and fwd's specific
mails to my other specific email addresses -- I rarely read gmail
directly, and I am unlikely to ever read an email addressed to my
gmail box.

I favor my "at linuxtv dot org" account for my kernel work, and I hope
that is the email that shows up as primary for me (I would only guess
that I have one or two aliases in this .mailmap file.)

Regards,

Mike
--

From: Jon Smirl
Date: Monday, July 28, 2008 - 11:10 am

I was only going to use it to help decide which alias was the right



-- 
Jon Smirl
jonsmirl@gmail.com
--

From: Simon Arlott
Date: Monday, July 28, 2008 - 11:11 am

Validation of what?
The current alias is the most recent active one.

I'm not subscribed to the LKML with a public address.
Please Cc: me if you submit a patch for this so I can add a Nacked-By: 
and/or cleanup the list by removing redundant entires. Even if git-shortlog 
is changed to distinguish between people who share the same name, it would 
still be a list of exceptions rather than everyone.

-- 
Simon Arlott
--

From: Jon Smirl
Date: Monday, July 28, 2008 - 11:19 am

Comparing to LKML subscribers would not be used to generate new email
addresses. It would only help to identify which existing alias is your
current one.  But we don't have to do it that way, I can use the email
of your most recent commit as your current address and you can fix it
if it is wrong. Only email addresses that appear in the kernel log


-- 
Jon Smirl
jonsmirl@gmail.com
--

From: Simon Arlott
Date: Monday, July 28, 2008 - 11:34 am

No, all I said was that one is a typo. I did not make it.

-- 
Simon Arlott
--

From: Jon Smirl
Date: Monday, July 28, 2008 - 12:00 pm

Other people aren't perfect, I've found over 1,000 typos in the those
names and emails. We need a validation mechanism.

-- 
Jon Smirl
jonsmirl@gmail.com
--

From: Theodore Tso
Date: Monday, July 28, 2008 - 1:22 pm

You keep using the word "need"; I do not think it means what you think
it does.  :-)

Seriously, why is it so important?  It's a nice to have, and I
recognize that you've spent a bunch of time on it.  But if the goal is
to get better statistics, and in exchange we forcibly map all Mark
Browns to one e-mail address, and/or force them to all adopt middle
initials (what if there are two Dan Smith's that don't have middle
initials) just for the convenience of your statistics gathering, I
would gently suggest to you that you've forgotten which is the tail,
and which is the dog.

Regards,

						- Ted
--

From: Jon Smirl
Date: Monday, July 28, 2008 - 1:38 pm

There are over 1,000 typos in the logs. No validation being done on
the names/addresses in the logs. Many email addresses aren't
syntactically valid. Why not put some checks in place to try and clean
this up? Signed-off-by is worthless if it is full of garbage.

The are two Mark Browns in the file:
Mark Brown <broonie@opensource.wolfsonmicro.com>
Mark Brown <broonie@sirena.org.uk>

I don't know if these are two different people or one person with two
emails. But the file doesn't force that decision. It's git shortlog
that is combining them.

The file serves two purposes:
Map people using multiple email aliases a human single name, It can be
any name they choose. Existing file already does this but the list is
not complete.
Enumerate all email addresses used in the log so that it is possible
to tell when a new address is encountered. Allows simple validation to
be implemented.

In it's current form it doesn't indicate which aliases is the
developer's currently active one.

-- 
Jon Smirl
jonsmirl@gmail.com
--

From: Dave Jones
Date: Monday, July 28, 2008 - 1:46 pm

On Mon, Jul 28, 2008 at 04:22:36PM -0400, Theodore Tso wrote:
 > On Mon, Jul 28, 2008 at 03:00:13PM -0400, Jon Smirl wrote:
 > > Other people aren't perfect, I've found over 1,000 typos in the those
 > > names and emails. We need a validation mechanism.
 > > 
 > 
 > You keep using the word "need"; I do not think it means what you think
 > it does.  :-)
 > 
 > Seriously, why is it so important?  It's a nice to have, and I
 > recognize that you've spent a bunch of time on it.  But if the goal is
 > to get better statistics, and in exchange we forcibly map all Mark
 > Browns to one e-mail address, and/or force them to all adopt middle
 > initials (what if there are two Dan Smith's that don't have middle
 > initials) just for the convenience of your statistics gathering, I
 > would gently suggest to you that you've forgotten which is the tail,
 > and which is the dog.

I'm beginning to question just how useful the continued measuring
of things like Signed-off-by's is.   Last week at OLS, I overheard
a conversation where someone was talking about the "top 10" lists
that Greg has been talking about at various conferences.
The conversation went along the lines of "my manager really wants
to see us on that list, at any cost".
Whilst the niave may think 'more patches == more better', this isn't
necessarily the case given we have nowhere near enough review bandwidth
*now*, and flooding with a zillion trivial patches really isn't going
to make that job any easier.

Getting patches into the tree is easy, we've proven that.
As things stand now, it's also fairly easy to 'game' the system
by committing something in 10 changesets when it could be done
just as easily in 2-3.

How about we start measuring things that actually matter, like..

"How many patches were reviewed before they went in"
"How many patches were directly responsible for a bug"
"How many patches actually fixed something anyone cares about"
"How many patches are responsible for just 'churn'"

	Dave

-- ...
From: Randy Dunlap
Date: Monday, July 28, 2008 - 2:14 pm

It would be Good if we could give more value to Reviewed-by: tag lines also...

IOW, we "need" to do this.  :)


---
~Randy
Linux Plumbers Conference, 17-19 September 2008, Portland, Oregon USA
http://linuxplumbersconf.org/
--

From: James Morris
Date: Monday, July 28, 2008 - 3:01 pm

Also, Tested-by:, to encourage and recognize people who may not be 
confident in reviewing code to at least test it, which is immensely 
useful if done thoughtfully.

  "Measuring programming progress by lines of code is like measuring 
   aircraft building progress by weight."

If you know who said this, award yourself a cookie :-)


- James
-- 
James Morris
<jmorris@namei.org>
--

From: Paul Mundt
Date: Monday, July 28, 2008 - 4:41 pm

Or just filter on "-by:", which seems to get anything relevant, including
people that shamelessly make up their own tags. In order for something to
be converted from a Cc: to a *-by: requires manual effort at least, which
ought to be sufficient for recognition.

If someone was really bored they could probably make a table of tags with
various points to try and balance things slightly more objectively.
Though it seems we now at least have totally different metrics on LWN,
for the kernel summit selection process, and Jon's new script. ;-)

Trying to map all of the names seems pretty pointless though, most
regular contributors contribute in a fairly consistent and sane manner,
with the odd mismatch or typo here or there. It might make sense for
anyone where there's a significant difference, but those are going to be
corner cases.
--

From: Jon Smirl
Date: Monday, July 28, 2008 - 5:14 pm

12% of the name/email pairs are messed up. It's not all simple typos.
There is significant mangling of non ASCII charsets by people's tools
in the maintainer's chain of processing. Half of the time I don't
believe what the author is submitting is what is ending up in the log
due to mangling. It's a larger source of noise than typos.

All of these variations on email names are in the log. Humans can
identify these problems, it is much harder for a machine.

For example, where are these backslashes coming from?
Auke-Jan H Kok <auke-jan.h.kok@intel.com>
Auke-Jan H Kok <auke\-jan.h.kok@intel.com>
Auke-Jan H Kok <auke\\-jan.h.kok@intel.com>
Auke-Jan H Kok <auke\\\-jan.h.kok@intel.com>
Auke-Jan H Kok <sofar@foo-projects.org>

Are the tools case sensitive or insensitive on email addresses? Some
are are some aren't, so I need these cases...
Al Viro <viro@zeniv.linux.org.uk>
Al Viro <viro@zenIV.linux.org.uk>
Al Viro <viro@ZenIV.linux.org.uk>

Another problem is internal machine names...
David S. Miller <davem@sunset.davemloft.net>
David S. Miller <davem@davemloft.net>
David S. Miller <davem@huronp11.davemloft.net>
David S. Miller <davem@hutch.davemloft.net>
David S. Miller <davem@bnsf.davemloft.net>
David S. Miller <davem@t1000.davemloft.net>
David S. Miller <davem@ultra5.davemloft.net>
David S. Miller <davem@goma.davemloft.net>

Or varying the email name...
Alexey Starikovskiy <alexey.y.starikovskiy@intel.com>
Alexey Starikovskiy <alexey_y_starikovskiy@linux.intel.com>
Alexey Starikovskiy <alexey.y.starikovskiy@linux.intel.com>

Why do these all end in (none)?
Craig Hughes <craig@com.rmk.(none)>
Dave Neuer <dneuer@org.rmk.(none)>
David Brownell <david-b@net.rmk.(none)>
David Woodhouse <dwmw2@org.rmk.(none)>
Deepak Saxena <dsaxena@net.rmk.(none)>
Enrico Scholz <enrico.scholz@de.rmk.(none)>

-- 
Jon Smirl
jonsmirl@gmail.com
--

From: Rene Herman
Date: Monday, July 28, 2008 - 5:29 pm

Because rmk rewrites addresses to comply with privacy laws. Another good 
example of why this nonsense of yours is exactly that.

I checked and am personally in there three times, once even without any 
valid email address listed. And any time there's anything other than my 
gmail address in some submission it at least recently means that someone 
_else_ took my from: address and stuck it on there and while I don't 
terribly mind that generally, I find it really annoying to see even 
those mistakes harvested into your hugely google-accessible resource.

This is just yet another example of the senseless robotic crap people 
people just insist is "needed" and "valueable", but which is neither.

Nonsense it is.

Rene.
--

From: Paul Mundt
Date: Monday, July 28, 2008 - 5:33 pm

Speaking of which, lk-changelog did the same sort of thing back in the BK
days, which was at least useful for generating a pretty short log.
Perhaps it makes more sense to start from that if someone really wants to
waste their time on this. I'm still not sure what the point is though.
--

From: Jon Smirl
Date: Monday, July 28, 2008 - 5:50 pm

The emails in the list are extracted from the commit log. I did not
touch the emails. If your email is in there wrong it is in a log
message wrong. That doesn't necessarily mean you are the person who
put it into the log wrong, patches can get mangled when being passed
along the maintainer chain. The point of this file is to turn the
mistake back into something useful. Think of these are reverse
mappings, they convert errors back to usable names.

As for privacy, if you don't want your email address in a file like
this don't put it into a GPL'd public project. Generate a random name
and email for each patch you submit. Of course I'm having trouble with
a Signed-off-by: that can't be turned back into a person.
Signed-off-by is there to track the responsibility chain for a patch


-- 
Jon Smirl
jonsmirl@gmail.com
--

From: Rene Herman
Date: Tuesday, July 29, 2008 - 3:39 am

Like I told you, I don't. Others do. And while that's not a huge issue 
in itself, you harvesting it into your nicely formatted google and 
spam-base MAKES it an issue. Just stop this crap. Be away.

Rene.
--

From: Jon Smirl
Date: Tuesday, July 29, 2008 - 6:44 am

Google got the list the second it was mailed on LKML.  Why haven't you
told Google to remove the 1,054 pages that contain your email?

http://www.google.com/support/webmasters/bin/answer.py?answer=508&topic=13511

If you really want to spam kernel developers there is a much easier
way, just send the message to LKML.


-- 
Jon Smirl
jonsmirl@gmail.com
--

From: Pekka Enberg
Date: Tuesday, July 29, 2008 - 7:22 am

Hi Jon,


Why does any of this matter? Rene asked you to drop his email from
your list and refusing to do so is somewhat rude, isn't it?

                                    Pekka
--

From: Jon Smirl
Date: Tuesday, July 29, 2008 - 7:27 am

Rene used his email in the immutable log of a public GPL'd project.
It has become part of the public domain and can't be removed.  So new
users of the log are supposed to start editing history to remove
actions from the past?

If you want your email kept private don't use it to submit patches to
a GPL'd project.

-- 
Jon Smirl
jonsmirl@gmail.com
--

From: Pekka Enberg
Date: Tuesday, July 29, 2008 - 7:34 am

Hi Jon,


OK, I'm not interested in arguing about this. I just don't understand
what you're trying to accomplish with pissing of kernel contributors,
that's all. (Not that I'm happy about being on your list either, I
just don't care enough to argue it.)

                                   Pekka
--

From: Rene Herman
Date: Tuesday, July 29, 2008 - 7:40 am

Jon, fuck of. I told you three times now -- I DO NOT, OTHERS DO. And it 
is only your bureaucrat attitude which is turning it into a problem. Go 
apply for a job at IBM if you love IT bureaucracy.

Rene.
--

From: Rene Herman
Date: Tuesday, July 29, 2008 - 7:34 am

Right, so you say that google got it the first time you fucked it up. 
How exactly do you consider that to be a reason for continuing to fuck 
it up and putting it in few hundred nicely fully indexed linux kernel 
trees out there on the web making the fuck up rank at number 1 in the 
results?

Now fortunately, from the discussion it seems that most sensible people 
will be ignoring you anyway so I guess I can and should stop bothering 
with this but please...

That which is not white is not black and my keyaccess.nl address being 
public already anyway is NOT the same as it being veryveryvery public.

Rene.
--

From: Adrian Bunk
Date: Wednesday, July 30, 2008 - 12:24 am

Whether Jon's patch is a good idea one might discuss, but as soon as 
someone puts an email address into a kernel commit Google will anyway 
find it:

The ChangeLog-* files at http://ftp.kernel.org/pub/linux/kernel/v2.6/ 
also contain all addresses in Jon's list, and Google harvests them.

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

--

From: Stefan Richter
Date: Wednesday, July 30, 2008 - 1:37 am

There isn't a lot to discuss.  From a purely technical standpoint,
duplicating SCM metadata into a source file and aiming to be

This doesn't justify what Jon did though.

Jon created a new database out of formerly disparate datasets, even
though we didn't provide him these datasets for this purpose.  The fact
that the means to create this database are rather trivial and cheap do
not mean that we implicitly agreed to what he did or that it wouldn't
matter whether we agree to it or not.

Jon even suggested that his database is then used to combine with
further databases (bugzilla accounts, mailinglist archives).  Again, the
fact that something like this is possible without great difficulties
doesn't make it right.
-- 
Stefan Richter
-=====-==--- -=== ====-
http://arcgraph.de/sr/
--

From: Adrian Bunk
Date: Wednesday, July 30, 2008 - 5:46 am

You certified:

  I understand and agree that this project and the contribution
  are public and that a record of the contribution (including all
  personal information I submit with it, including my sign-off) is
  maintained indefinitely and may be redistributed consistent with
  this project or the open source license(s) involved.


And if you think this doesn't cover Jon's patch you should also 
complain to LWN and the Linux Foundation who published data

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

--

From: Rene Herman
Date: Wednesday, July 30, 2008 - 5:54 am

You only certify anything when _you_ put your address in. Given that 
it's a very common occurence that not you but _others_ do, this does not 
mean a _single_ thing. Tested-by, Bisected-by, what have you...

But let us leave this discussion be. It's not going anywhere anyway.

Rene.

--

From: Adrian Bunk
Date: Wednesday, July 30, 2008 - 8:32 am

There's one thing where it might actually go further:

You actually have a good point here, and I'm not disagreeing with it.

I've added Linus to the recipients since stuff like e.g. Tested-by or 
Bisected-by tags actually undermine what the DCE 1.1 update should
have accomplished. So if DCE 1.1 (d) is considered to be legally 
required for public indefinite storage of name and email address

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

--

From: Rene Herman
Date: Wednesday, July 30, 2008 - 9:32 am

I was afraid you'd conclude that... Me, I conclude that we should just 
not do the "harvest all these addresses into a big file" thing that 
might make it more than a theoretical problem...

But that's just me.

Rene.
--

From: Linus Torvalds
Date: Wednesday, July 30, 2008 - 9:56 am

Quite frankly, since the patches are public anyway, and the code is open 
source, I personally think that worry is just silly fear-mongering by 
people who take lawyers not just too seriously, but then think that judges 
and lawyers are too stupid to think for themselves.

We added the lines to the DCO-1.1 because we wanted to make it _obvious_ 
that the legal requirements for the sign-off would never clash with any 
possible insane reading of things, but it was a "dot the i's" kind of 
thing.

The fact is, people who are involved in Linux know it's public. People 
make public bug-reports, and they _expect_ to get attributed. I think any 
worries about indefinite storage should be the other way around: we should 
strive to make sure that the attributions are consistent and correct.

If somebody really doesn't want their name and email known, they can say 
that. We won't accept patches from them, but it's certainly no problem to 
suppress "tested-by" etc things on request. Not that I have ever seen such 
a request that I can remember, nor do I really expect to ever see one 
(unless it's as a perverse reaction to this email where people just want 
to be silly).

Anyway, normal people talking about obscure and insane readings of some 
random law is stupid. You should worry about "doing the right thing", not 
about trying to read law as if it was some mindless machine that acted 
like the computers you're used to.

Let's face it, _everybody_ breaks laws if you think about them as some 
inflexible and absolute rules. Probably every day.

You roll through a STOP-sign (in California, it was almost as if that's 
what the sign _meant_). Maybe you take a shortcut when crossing the street 
and you don't walk _exactly_ on the zebra-crossing (or against a red light 
just because there were obviously no cars within _miles_ of you). Maybe 
you drive 58mph in a 55 zone. Maybe you walk around and spit out the 
cherry-pits on the street rather than in a garbage can.

Only insane ...
From: Rene Herman
Date: Wednesday, July 30, 2008 - 12:41 pm

The problem here is just the _scale_ of publicness. Yes, Adrian's worry 
can be shrugged of I'd say but this thread is about Jon Smirl collecting 
addresses into a hugely public (because in tree) and hugely accessible 
format and while your statement above might be true for 95% of cases 
(99, I don't care) the use of people's personalia is just something you 
cannot decide on yourself ever. It's theirs.

I'm in this thread because the from address on this message is in Jon's 
file and while I've used it myself in the past, any time it's been part 
of some Fooed-by tag recently it's because someone else put it there. 
While it's the best address I have for these uses (and so I still use 
it) it shouldn't work anymore even today, so I've been careful to put a 
future proof relay address in when I advertise a contact myself.

As said before, I'm also not going to whine about it when others do put 
it in because they shouldn't need to concern themselves with my odd 
needs and wants and it's not a real problem anyway as long as the future 
proof one is much _more_ public. I am, therefore, just not glad that 
it's now being put into a file in the root of your highly publicized 
tree of files.

Just a silly example, I know, but it doesn't really matter -- even if 
someone tells me he fears cosmic channeling will get the better of him 
if his personalia are in some resource I maintain, I jump to attention, 
salute, shout "SIR YES SIR!" and remove it. It's his.

So now for example I'm debugging a problem with an ALSA driver with a 
few users at least one of which has used different email addresses 
during it and if I'm going to attribute any of their testing and effort, 
I'm going to have to ask for permission and which address was meant to 
be the public one. And sure, sure, I'd probably do that even today 
anyway but right now it's mostly a principled thing while with the 
addresses in the tree I'd sort of insist that anyone would, what with 
them being top google hits for ever ...
From: Ray Lee
Date: Wednesday, July 30, 2008 - 12:47 pm

Er, what? Are you saying that a mailcap file inside a .gz or .bz2 or a
git repository is *more* public than a mailing list? or the already
existing gitweb history of the main tree?

I've noticed correlated (lagged) spikes in my spam volume to the email
address I use for this list whenever I post from it, so please
consider that you are perhaps being penny-wise and pound-foolish here.
--

From: Rene Herman
Date: Wednesday, July 30, 2008 - 1:00 pm

Inside a .gz of .bz2? But yes, definitely. Have you ever noticed exactly 
how many fully indexed linux source trees there are out there on the 
web? And how not any mailinglist archive does _not_ take the trouble to 

I'm not talking about spam. Spammers will get anything that's not 
private. As said, I'm talking about scale of publicness.

Rene.

--

From: Rene Herman
Date: Wednesday, July 30, 2008 - 1:24 pm

This, Jon, by the way also suggests something which I would consider 
much better; _keep_ it as SCM metadata in some less-accesible format 
under .git/

I doubt anyone's going to come up with an objection then. It's already 
in that exact same spot after all.

Rene.
--

From: Jon Smirl
Date: Wednesday, July 30, 2008 - 1:27 pm

The decision to keep it in .mailmap format was made before I was
involved. A smaller .mailmap has been in the tree since 2007. Existing


-- 
Jon Smirl
jonsmirl@gmail.com
--

From: Stefan Richter
Date: Wednesday, July 30, 2008 - 1:38 pm

You were talking about new uses all the time, hence new tools.
-- 
Stefan Richter
-=====-==--- -=== ====-
http://arcgraph.de/sr/
--

From: Stefan Richter
Date: Wednesday, July 30, 2008 - 10:05 am

[...]

Last time when I read a discussion about these tags, people at least at 
this side of the pond seemed to come to the conclusion that we ask 
testers for their consent if in doubt, before adding such a tag if they 
didn't do so themselves.  (That's different from how we handle Acked-by: 
from fellow developers which we often imply from an informally given OK.)
-- 
Stefan Richter
-=====-==--- -=== ====-
http://arcgraph.de/sr/
--

From: Stefan Richter
Date: Wednesday, July 30, 2008 - 8:02 am

Yes.

Copyright doesn't have a lot to do with personality rights though.
And then there is also ethics besides laws.
-- 
Stefan Richter
-=====-==--- -=== ====-
http://arcgraph.de/sr/
--

From: David Schwartz
Date: Wednesday, July 30, 2008 - 8:04 am

No, but that all the submissions were made under the GPL, whose explicit purpose is to allow information to be changed, processed, and reused for other purposes does.

If you don't want your submissions to be in the public record for all eternity to be used for any lawful purpose, don't make them to a GPL project.

You have no right whatsoever to look at how one person chooses to use them and say "I didn't agree to that". Yes, you did. You gave up the right to approve or reject each use when you made the submission. If you don't like it, submit under some other license.

DS


--

From: Stefan Richter
Date: Wednesday, July 30, 2008 - 8:09 am

GPL is merely about copyright.
-- 
Stefan Richter
-=====-==--- -=== ====-
http://arcgraph.de/sr/
--

From: Alan Cox
Date: Wednesday, July 30, 2008 - 7:53 am

> No, but that all the submissions were made under the GPL, whose explicit purpose is to allow information to be changed, processed, and reused for other purposes does.

So why hasn't Jon included a copy of the GPL and the sources with his new


Disagree - firstly national law trumps licences, secondly there is the
(regrettably increasingly) small matter of manners.

Alan
--

From: Jon Smirl
Date: Wednesday, July 30, 2008 - 8:23 am

Bug, obviously the file needs it, it is derived from GPL'd files.
Please edit your local copy, no need to send another couple hundred GB

By making a submission to a GPL'd project didn't you grant a license
for your data to be used? That was Ted's point when he posted the


-- 
Jon Smirl
jonsmirl@gmail.com
--

From: Alan Cox
Date: Wednesday, July 30, 2008 - 8:14 am

Data protection law trumps the GPL. The fact my address is public does
not give you the rights globally to process it.


--

From: Jon Smirl
Date: Wednesday, July 30, 2008 - 8:51 am

There are a lot of companies (including Google's code database)
indexing the kernel source and processing it into new form. What is
their standing?

-- 
Jon Smirl
jonsmirl@gmail.com
--

From: Alan Cox
Date: Wednesday, July 30, 2008 - 8:44 am

That would depend on their location and activities.

Alan
--

From: Theodore Tso
Date: Wednesday, July 30, 2008 - 9:40 am

Yes, but when you submit patches using the required Signed-off-by:,
you are agreeing to the following from the Developer's Certification
of Origin:

	(d) I understand and agree that this project and the contribution
	    are public and that a record of the contribution (including all
	    personal information I submit with it, including my sign-off) is
	    maintained indefinitely and may be redistributed consistent with
	    this project or the open source license(s) involved.

How this interacts with Europe's Data Protection Law, and whether
correcting spelling errors in e-mail addresses to make it easier to
canonicalize the list is consistent with what is allowed by the GPL,
Europe's Data Protection Law, and the permission given when a
developer's signs the DCO's, is probably not worth debating on LKML.
Let someone file a complaint with the EU who can try to arrest Jon
Smirl the next time he enters Europe, or get him extadicted to the
Hague for violations against international law if they really think
they can justify it, argue about whether they can do so in other
forums; if you put a beer in my hand, maybe I'd even be willing to
debate it in a bar at some future conference.  But does it really make
sense to argue about it here?

						- Ted

--

From: Alan Cox
Date: Wednesday, July 30, 2008 - 9:49 am

No but perhaps Jon could simply show some manners when people request him
politely not to do that. He doesn't seem to want to debate manners, just
law.

Alan
--

From: Jon Smirl
Date: Wednesday, July 30, 2008 - 11:58 am

I didn't handle the removing the people from the list issue very well.
I got caught in the fact the log immutably records history and I
objected to editing history. I viewed this along the lines of George
Washington asking to be removed from text books. He was the first
president; we can't change that and we have to include him in the list
of presidents.

I still don't have a good solution for how to track the people who
don't want their names to appear without creating yet another list.

-- 
Jon Smirl
jonsmirl@gmail.com
--

From: Simon Arlott
Date: Wednesday, July 30, 2008 - 2:05 pm

It's really simple - if there is actually some benefit to them being 
in .mailmap (because their name is spelled differently in some commits), 
then contact them. Otherwise, you can assume they don't want to be on it.

That should cut down the number of people considerably... I can only see 
one example in the A section (Auke Kok).

-- 
Simon Arlott
--

From: David Schwartz
Date: Wednesday, July 30, 2008 - 11:03 am

I would assume that was either an error or because he believes the
information contains insufficient creative content to be covered by
copyright. It seems to me more like a functional, factual report. But I can

No. But the GPL can be used to show the intent to consent to the use of the
information by others. I don't know the data protection law in your country,

I think it's terribly bad manners to submit something to a GPL project and
then complain when someone else uses it the way they want to. If you want
the benefit of using and modifying GPL software, you have to let others do
what they want with your contributions. If you don't find that deal fair,
don't make it. But then don't make the deal and then claim others are being
rude when they take what the deal gave them.

As for GPL only being about copyright, I don't think that's true. The GPL is
a copyright license. It grants you rights that the author would otherwise
hold exclusively under copyright. But it doesn't follow that the rights you
give up are only rights under copyright. See, for example, section 7.

If you want the rights GPL grants you under copyright, you have to give up
certain things, and not just copyright. One of the things you have to give
up is *any* legal mechanism that would permit you to restrict other people's
GPL rights.

GPL section 6 clearly prohibits you from using any data protection laws in
your jurisdiction to prevent someone else from modifying and redistributing
information you submitted under the GPL.

The "GPL only affects copyright" argument would mean that I could
redistribute modified GPL'd work with an EULA. Obviously, I can't do that.
Enforcing data protection laws to restrict rights granted under the GPL is
no different from enforcing an EULA to do ths same thing.

DS


--

From: Stefan Richter
Date: Wednesday, July 30, 2008 - 12:49 pm

It's not the same thing, by far.  EULA = "end user license agreement"; 
while "data protection law" is... law.  Obviously, licenses (contracts) 
must not be unlawful.

PS:  You may use a GPL'd program any way you want --- although not for 
unlawful purposes.  But that's not a matter between you and the 
copyright holder, it's between you and the law.

PPS:  SCM metadata are not part of the program.  The DCoO states that 
the personal data submitted along with the contribution may be 
redistributed "consistent... with the open source license(s) involved", 
but it isn't discussed whether other terms of the licenses, notably 
those on modification and derivatives, apply to the data supplied for 
the certificate of origin.
-- 
Stefan Richter
-=====-==--- -=== ====-
http://arcgraph.de/sr/
--

From: David Schwartz
Date: Wednesday, July 30, 2008 - 3:25 pm

I'll try to make this my last response on the issue, if possible.


An EULA itself is not law, but neither would someone's request to be removed
from such a list be itself a law. EULA's operate under law, and so would a
request for data confidentiality. This difference is no difference. Both are
attempts to invoke a law other than copyright to restrict rights guaranteed
by the GPL. You may not use any law or provision to restrict another
person's GPL rights. That's what the GPL says, and it means it.

If a law, any law, permits you to impose restrictions on something the GPL
allows, then you give up the right to use that law in exchange for the
license the GPL grants. This obviously applies to your copyright in
derivative works. But it would also apply to any attempt to use any law to
encumber GPL rights.

As the GPL states, the license grants you permission to copy, distribute,
and/or modify the covered work. This is against any rights the authors might

Precisely. And others who wish to exercise rights under the GPL forfeit any
legal mechanism (whether copyright, DMCA, contract, data privacy laws, or
whatever theory) to impose "further restrictions" on those who wish to
similarly use GPL works.

Copyright is the carrot the GPL uses to get you agree to the stick. The
stick is no "further restrictions" of any kind, imposed by any law.
Obviously, you aren't responsible for an operation of law over which you
have no control. But you cannot invoke copyright -- or any other law -- to
restrict someone else's exercise of rights granter by the GPL. You get

When you submit a unit to a GPL project, you place that unit under the GPL.
That is what the DCoO is trying to say. There cannot be some things that
some parts of the GPL apply to and some don't. There is no "sort of GPL,
sort of not" that applies to some parts of some submissions. If something is
part of or all of a submission made under the GPL, then all of the GPL
applies to it.

DS


--

From: Alan Cox
Date: Wednesday, July 30, 2008 - 3:32 pm

I don't know where you get that paticular idea from. Try sending GPL code
from the USA to Cuba. Seems the US government is using GPL code but

Some rights in laws are absolute. I cannot "give up" my right to be
identified as the author of a work I create in many countries. Its an

The metadata licensing isn't clear in my view.

I think what you are more likely to get sensible results with is arguing
estoppel ? That was always the intent of that DCO wording. To ensure that
rights or otherwise you couldn't turn around and say "hey you published
my name and I didn't expect that implied by my actions".

However publishing a name and performing data processing on personal data
databases for other purposes is not the same thing at least in some
jurisdictions. In the EU you collect data "for a purpose".

Alan
--

From: David Schwartz
Date: Wednesday, July 30, 2008 - 4:33 pm

I'm not sure how you think this is relevent. I could go to the effort of
explaining in detail why it's irrelevent, but I can't imagine you intended
this comment as a genuine response in good faith.

For one thing, even if this was a violation of the GPL, there would be no
recourse. The only conceivable recourse would be a suit by an author for
copyright infringement. The government has sovereign immunity against such a
claim. The government is immune because it has sovereign immunity. Jon is
immune because the GPL grants him the right he is exercising. (Of course, it
can't make him immune from any laws he violates, but my argument is that

Yes, but you can give up your right to pursue that right. And certainly some
terms of the GPL might be unenforceable in some jurisdictions. But the GPL
says Jon can do what he's doing, and it means what he says. As I said, I
don't know the data privacy laws in your jurisdiction, but I do know the GPL
made you give up your right to use them to impose restrictions on Jon's
imposition of his GPL rights.

You may or may not be able to stop some operation of law from happening. You
are not responsible for things outside your control. And some jurisdictions

Perhaps you can invent some other meaning it might have and then claim it's
unclear because it can mean that. But I don't think it matters. The GPL is
really what matters here, at least in my opinion. The GPL is clearly all of
apiece -- it either applies to something or it doesn't. And if you want to
argue that people must parse GPL submissions to figure out what's really
covered by the GPL and what's not, you can certainly argue that. I find that

GPL submissions are for the purposes specified in the GPL -- so that other
people may freely redistribute, copy, and modify them. You forfeit the right
to claim you made GPL submissions "for a purpose" as the GPL specifically
requires you to consent to their use for any purpose (save those the GPL
itself prohibits, of course).

DS


--

From: Kyle Moffett
Date: Wednesday, July 30, 2008 - 9:20 pm

STOP!!!  This is seriously just getting silly...

If people *really* care about the privacy of information they placed
in publicly accessible databases via agreement with the DCO, then
there is a workaround:

Instead of a "mailmap" file, use a "mailhash" file like this:

[...lines...]
4db83f457ca750b3ed0bb7db2375cfd41846fb43 Kyle Moffett <kyle@moffetthome.net>
[...more lines...]


That SHA1 checksum is of the name-and-email you are mapping *from* and
the value on the right is the string to replace it with.  For all the
people who don't like their emails being displayed when somebody looks
at logs, you can just get your entries in that file changed to
"anonymous".  Then the people who want useful statistics will ignore
your commits and people who want to look at logs will just use the
newly-added --no-mailhash option to see the
"<jranodmuser@my-typod-domian.cmo>" that you happened to put in the
Signed-off-by.

Alternatively people could realize it's not worth it and just go write
real code or something.

PS.  Just to show how easy it was, I converted the mailmap file that
was sent out into the above mailhash file with a perl one-liner
(WARNING: probably linewrapped):

perl -MDigest::SHA1=sha1_hex -n -e 'chomp; s/\s+/ /g; s/^ //; s/ $//;
print sha1_hex($_)." $_\n";' <mailmap >mailhash

Once you have the mailhash file, to convert from a "Name <email>" to
the actual desired representation you can run:

value="Name <email>"
sha1="$(echo -n "${value}" | sha1sum - | awk '{print $1}')"
line="$(sed -ne "/^${sha1} /{ s/^${sha1} //; p }" mailhash | head -n 1)"

Cheers,
Kyle Moffett
--

From: Stefan Richter
Date: Thursday, July 31, 2008 - 12:01 am

Not.  See below.  Also remember that there are sometimes tags added to 
the changelogs without having ensured that the respective person agrees 

The metadata (authorship, committership, changelog including sign-off 
tags...) are not part of the submitted program source code.

The fact that I agreed to have aspects of my participation in the open 
source project documented in the SCM does not imply an agreement that 
these data may be copied into databases which serve other purposes.
-- 
Stefan Richter
-=====-==--- -=== =====
http://arcgraph.de/sr/
--

From: Willy Tarreau
Date: Thursday, July 31, 2008 - 1:59 pm

And in some countries (at least France) you need to declare the existence
of a list you constitute from personal data (names, addresses, etc...)
and the persons referenced in your list are always granted a right to
be removed upon a simple request. This right is scrupulously respected,
and I can say that I've successfully used it many times to be removed
from advertisers' lists.

Anyway, I don't care much about Jon's list right now.

Willy

--

From: Jon Smirl
Date: Wednesday, July 30, 2008 - 8:08 am

I noticed that the log was full of errors and thought that it might be
nice to have a mechanism to correct them. Since the log is immutable,
error correction needs to be external. It is a different discussion as
to whether we should try and fix the errors in the log.

Assuming that we wanted the data clean I came up with this solution.
Maybe there is a better way.

Kernel log is immutable.
Kernel log contains about 1,000 errors of various classes.
.mailmap file format was preexisting, it maps email addresses to
people's names. If can be used to map the other direction, but none of
the kernel tools use it that way.

I observed that the unique key in the log is the email address, but
many of those email keys have errors in them, The data item we are
actually interested in is the developer's name.

I then generated a .mailmap file containing all of the unique email
addresses in the log and a guess from the log as to which developer
was associated with the email.

I then used various tools and hand editing to correct the ~1,000
errors and assign the correct developer name to the email in the log.
Correcting all these errors was a lot of work.It exposed the fact that
tools in the maintainer's change may be the largest source of errors.
Of course the file can be patched as more errors are found.

This new mailmap file now has two types of entries, ones fixing errors
and ones that are just copies of the data from the log.

I chose to leave both types of records in the file to make maintenance
easier. The complete set of email keys from the log is in the mailmap
file. To do maintenance, regenerate the email keys from the log and
diff them against mailmap. Now you only have to inspect the diff for
errors. After the diff is clean, add the new entires to the mailmap.

If you remove entries from the mailmap file they will get flagged in
every maintenance sweep and need to be removed again. Of course this
will lead you to build a list of people who don't want to be in ...
From: Rene Herman
Date: Wednesday, July 30, 2008 - 5:43 am

It will and note this is not a privacy issue "as such" at least for me 
(for rmk rewriting addresses is a privacy issue, directly or via law, 
and whether or not needed in this specific example or not)

Google find lots of things, most of which do not end up at the top of 
the search results. This address I'm now posting with is definitely 
public (or I wouldn't be posting with it) but given that it shouldn't 
even exist at the moment I have been careful for some time to put a 
relay address into anything which I intend to be long lived.

Since outside its non-existence it's the best address I have available I 
do still use it though. This is not a problem, since all mailing list 
archives go to great trouble to obscure addresses anyway and my gmail 
address will feature as the "most public" from it being in _content_. 
Sometimes others use this address in content as well but given that they 
can't be expected to know about any of my peculiar mail fetishes I'm not 
going to whine about it and it's not a practical problem anyway.

Then Jon comes along, puts _all_ addresses in content inside a hugely 
publicized, widely web-indexed tree and fucks it up.

Anyways... yesterday I had to turn the fan on my monitor to keep it from 
damage in this bloody furnace while today it's some 5 degrees cooler and 
the fan's aimed at me again so I'll stop cursing and shouting now. But 
still a damn bad idea.

Rene.
--

From: Paul Rolland
Date: Wednesday, July 30, 2008 - 5:58 am

Hello,

First, please note that my name and addresses are _in_ the list published
by Jon.

On Wed, 30 Jul 2008 14:43:12 +0200


Sorry, I don't agree. First, because using Google to collect a list of emails
is damn easy, and wether this list is handy or not is not changing for people
using it for Spam.
Second, because it takes just a few seconds to extract it nearly as complete
as Jon's version from git : git log | grep Author: | sort | uniq -c 
gives something very useful : about 5800 emails. So let's not consider Jon is
saving a complicated job for people searching for this list.

Linux is an open project. Everything that's related to it is open, and public.
If you don't want your name/email to be associated with it, that's another
issue. 

We could blame Jon for publishing the his list on the list without prior 
information, but not for creating it. 
And I certainly would like to see the .mailmap appear at my next git pull ;)

Regards,
Paul

-- 
Paul Rolland                                E-Mail : rol(at)witbe.net
CTO - Witbe.net SA                          Tel. +33 (0)1 47 67 77 77
Les Collines de l'Arche                     Fax. +33 (0)1 47 67 77 99
F-92057 Paris La Defense                    RIPE : PR12-RIPE

Please no HTML, I'm not a browser - Pas d'HTML, je ne suis pas un navigateur 
"Some people dream of success... while others wake up and work hard at it" 

"I worry about my child and the Internet all the time, even though she's too 
young to have logged on yet. Here's what I worry about. I worry that 10 or 15 
years from now, she will come to me and say 'Daddy, where were you when they 
took freedom of the press away from the Internet?'"
--Mike Godwin, Electronic Frontier Foundation 
--

From: Rene Herman
Date: Wednesday, July 30, 2008 - 6:05 am

*plonk*

Rene
--

From: Simon Arlott
Date: Wednesday, July 30, 2008 - 10:53 am

The fact still remains that most of the entries in Jon's .mailmap file are 
redundant. I may have three email addresses referred to in git, but my name 
is the same in all cases. Until "git shortlog" is changed to link an author's 
commits without using their name there is no need to include me and several 
other people in there. Especially not anyone with only one email address 
in the commit log.

(It would also be easy to sha1 hash all the email addresses in the shortlog 
mailmap, using a hash lookup to find the name, although it would make it 
a bit difficult to find what any one address is without grepping the log.)

The ChangeLogs do include all email addresses contributing to that release, 
but in general I think a big list of them which people can use with no 
effort to spam every developer is a bad thing. On the other hand maybe 
those people don't know what .files are ;)

-- 
Simon Arlott
--

From: Jon Smirl
Date: Monday, July 28, 2008 - 3:08 pm

I didn't do this to measure statistics, I did it because I was writing
a script and the script was getting garbage for input. It just had the

These are good topics for the Plumbers conference. But to ask these
questions we need to get the data into a format where a computer can
process it. Syntax checking, validation, etc are needed on the log
messages. I'm not going to hunt through 100,000 commits trying to
answer these by hand.

Another fun experiment would be to load an archive of LKML, kernel
bugzilla and the kernel source history into git and then try to link
everything together. The cleaner the data is, the easier it will be to
link things. How about a GUI where each patch is annotated with a link
to the email thread discussing it?

-- 
Jon Smirl
jonsmirl@gmail.com
--

From: Theodore Tso
Date: Monday, July 28, 2008 - 3:32 pm

Out of curiosity, what is your script trying to do?

       		       	       	      	     	- Ted
--

From: Randy Dunlap
Date: Monday, July 28, 2008 - 3:38 pm

Speaking of missing developers, I'd be more interested in whatever
happened to Michal Piotrowski, Satyam Sharma, et al...


---
~Randy
Linux Plumbers Conference, 17-19 September 2008, Portland, Oregon USA
http://linuxplumbersconf.org/
--

From: Jon Smirl
Date: Monday, July 28, 2008 - 3:52 pm

I was trying to locate my patches in other private trees that were
ready for deletion. I wanted to make sure there wasn't something good
that I had forgotten about. I processed the output from 'git log' and
got tripped up matching the author field because is was full of junk.
My database background kicked in and I found myself on a tangent
cleaning up the data.

I have since learned about the existence of 'git shortlog' which
solved my problem. But I had already cleaned up the data before


-- 
Jon Smirl
jonsmirl@gmail.com
--

From: Stefan Richter
Date: Monday, July 28, 2008 - 3:38 pm

Another fun experiment:  Fetch 10 open bugs in bugzilla which may affect 
your hardware, try to reproduce one of them, fix it.
-- 
Stefan Richter
-=====-==--- -=== ===-=
http://arcgraph.de/sr/
--

From: Nick Piggin
Date: Tuesday, July 29, 2008 - 2:58 am

This is one way of looking at "the problem". The other way to look at
it is that things are merged too quickly / without enough review, etc.

That is the problem kernel maintainers can actually do something about.
Or, they can just whine about "not enough review bandwidth".

There has been this complaining from lots of people about not enough
review bandwidth for quite a few years now. So I doubt it is going to
magically get better by making more noise.

Consider that there is probably virtually limitless amount of crap that
people want to try to merge, so there is always going to be a lack of
review bandwidth if the aim is to merge as much as we possibly can as
fast as we can.

The answer is to not make the problem worse by merging stuff faster
than can be reviewed. When that happens, developers and companies
should eventually assign a higher value to patch review.
--

From: Adrian Bunk
Date: Wednesday, July 30, 2008 - 12:42 am

How do you want to measure such stuff?

And with measuring I'm not talking about estimates but about exact data.

Authorship information was already available in the commits, which is 
why people were able to develop scripts to harvest them.

For getting any meaningful statistics you have to either enforce the 
usage of additional tags in the commits or someone has to work full-time 

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

--

From: Al Viro
Date: Monday, July 28, 2008 - 6:15 pm

Who's "we", luser, and why would I possibly give a damn for your needs?
--

From: Jon Smirl
Date: Monday, July 28, 2008 - 6:25 pm

Let's drop the whole Sign-off-by mechanism. If we can't be bothered to
clean up the junk in Signed-off-by why should we bother recording
them? Sign every patch Mickey Mouse, it has the same effect.

-- 
Jon Smirl
jonsmirl@gmail.com
--

From: Al Viro
Date: Monday, July 28, 2008 - 6:36 pm

That still doesn't answer either of my questions.  As for your question, the
point is to have them good enough to make an individual changeset feasible
to track.
--

From: Jon Smirl
Date: Monday, July 28, 2008 - 7:01 pm

The file lets you convert the mess that exists in the log file xx-by:
fields back into something reasonable. The messed up email addresses
are verbatim extracted from the log. There is one entry in the file
for each email address that appears in the log. The real names have
been fixed by script and hand to correspond a real name with the
extracted emails.

Now we will differ on the definition of feasible and whether we should
work to prevent more messed up emails/names from getting into the log.
That's the central question here, how much are you allowed to
obfuscate (on purpose or accidentally) your identity in an xx-by?

I should also point out that external information (Google) was needed
to identify several hundred names, there was insufficient information
in the log or kernel source. If we have to reconstruct this mapping
ten years from now for some random lawsuit, the external information
may not be there.

-- 
Jon Smirl
jonsmirl@gmail.com
--

From: Theodore Tso
Date: Monday, July 28, 2008 - 7:50 pm

Jon,

	The reality is ten years from now, many e-mail addresses won't
be accurate anyway.  We will have to track people down by hand, if it
ever comes down to that.  The signed-off-by needs to be enough so we
can track down someone (very likely only a few set of people); via a
manual method is quite acceptable.  I don't think it is really
necessary to try force fit the signed-off-by just so we can collect
better mode.

       It should also be noted that the Developer's Certification of
Origin 1.1 has laguage that was designed to make it legal to collect
the DCO lines even in the European Union.  So what rmk is doing is
strictly speaking not necessary.

							- Ted
--

From: Jon Smirl
Date: Monday, July 28, 2008 - 8:23 pm

The kernel already has a mailmap file, but it is not complete. So I
should just take this work that makes the mailmap file a lot better
and throw it away? The policy is that the log file should be messed up
enough so that a computer can't process it and that a human can
recover it only with several day's effort? That's a really hard line
to define and we'll probably lose the identity of a bunch of
contributors. I'll follow up with a patch that deletes the current
.mailmap

-- 
Jon Smirl
jonsmirl@gmail.com
--

From: Theodore Tso
Date: Monday, July 28, 2008 - 9:13 pm

Personally, I have no objection to the mailmap file as it's on the
whole an improvement; if it's been automatically generated and it
falsely maps multiple people to a single person, that would be highly
unfortunate, but maybe it fixes more problems than it creates.

I think the part most people are seriously objecting to is that the
supposition that Linus and some of his top lieutenants should be
enforcing some arbitrary rule that rejects commits if they come from
addresses outside of your .mailmap file (unless they first send a
patch to add their e-mail address to the .mailmap file), in some kind
of misguided attempt to enforce validation, which apparently the main
justification for which is so that you and others can runs some
statistical analysis, of which there seems to be some dispute whether
or not encouraging people to compete to get into the top 20
signed-off-by by splitting up commits into 100 different micro-patches
should be considered a desirable side effect of said statistical
analysis.

As I said earlier, the moment you started advocating enforcing
validation, you may have started to confuse which is the tail and
which is the dog.  People should be supplying patches to improve the
kernel; not to provide accurate fodder for statistical analysis.

						- Ted
--

From: Theodore Tso
Date: Monday, July 28, 2008 - 9:15 pm

Typo correction.  The first part of that sentence should read:

"Personally, I have no objection to the mailmap file IF on the
--

From: Jon Smirl
Date: Monday, July 28, 2008 - 10:05 pm

The mapping multiple people to a single person problem was always
there, the new mailmap file doesn't alter it. There simply isn't
enough information in the kernel source to tell if there are two or
one Mark Browns. The file would need to be extended to encode more
information.

Mark Brown <broonie@opensource.wolfsonmicro.com>
Mark Brown <broonie@sirena.org.uk>

If the Marks want to separate themselves they will need to alter the
mailmap. With the new mailmap this is easily done. With the old one
you would have need to identify all of the aliases first.


That whole thread was pointless, the scripts for doing validation
don't exist. The stat tools are helpful in finding errors in the
mailmap file. I never cared about the stat results, I already know who
the top developers are. Let's drop the whole validation concept too
since it is obviously upsetting people.

There are two types of entries in the file. Ones that alter the names
associated with an email and ones that don't. You could argue that the
ones that don't alter the names aren't needed. They're in there to
make maintenance on the file easier.

Putting all emails in the file lets you do maintenance by extracting
the complete list of emails from the log and then removing the ones
already in the file. Now you only have to manually check these new
emails. If the unchanged entries were removed from the file they'd get
mixed in with the new emails. Each time you updated mailmap you'd have
a couple thousand emails to check.

Putting the unchanged entries in the file also makes it very easy for
people who want to alter their name entry. Just edit the mailmap file.
Everything is there and sorted by name. Change the name for all of
your aliases to whatever you want. Just make sure the names are all

These addresses have more purposes than statistical analysis. They
also record the responsibility chain of who submitted the patch. It
seems prudent to me that we should make some effort to attempt to keep
that chain in a ...
Previous thread: [PATCH] net/niu: Fix compile warnings by Takashi Iwai on Monday, July 28, 2008 - 7:24 am. (1 message)

Next thread: [RESEND][PATCH 00/12] ipwireless: driver updates by David Sterba on Monday, July 28, 2008 - 7:52 am. (15 messages)