git clone http://github.com/mirrors/linux-2.6.git
philippp master /tmp/linux-2.6 $ git blame init/main.c ^1da177e (Linus Torvalds 2005-04-16 15:20:36 -0700 1) /* ^1da177e (Linus Torvalds 2005-04-16 15:20:36 -0700 2) * linux/init/main.c ^1da177e (Linus Torvalds 2005-04-16 15:20:36 -0700 3) * ...We see that the last editor of each line is shown at the start of the line, after the commit has and before the timestamp of the edit. We will count how many lines belong to each editor, which will tell us who owns which parts of the file. We will use the cut command to isolate the name from the rest of the line. Once we have our list of names, we will use sort and uniq to tally the contributions. The cut command isolates columns of data on arbitrary delimiters, allowing us to split our lines on any symbol we choose. We will take advantage of the parentheses around the name and timestamp. The git annotation is seperated from the code by splitting on the closing parenthesis ")" and take the first column: options -d ")" and -f 1. After this, we split on the opening parenthesis "(" and keep the first column with -d "(" and -f 2 to retain the name and timestamp.
git blame init/main.c | cut -d ")" -f 1 | cut -d "(" -f 2
Linus Torvalds 2005-04-16 15:20:36 -0700 1
Linus Torvalds 2005-04-16 15:20:36 -0700 2
Linus Torvalds 2005-04-16 15:20:36 -0700 3
...
We now have to be a little bit creative to isolate the name from the timestamp: Splitting on spaces is not going to work, since names can be any number of words and can have an arbitrary number of spaces. We will take advantage of the fact that dates within ~1000 years start with a 2 or a 1: We will simply cut columns on the character '2' and take the first column (If '2' does not exist, everything is the first column), and cut on '1' and again keep the first column.
git blame init/main.c | cut -d ")" -f 1 | cut -d "(" -f 2 | cut -d "2" -f 1 | cut -d "1" -f 1
Linus Torvalds
Linus Torvalds
Linus Torvalds
...
Great, so now we just need to count the editor's lines and we're good to go! The uniq tool will squash duplicate lines into one, and its -c option reports the count of every line in the list. Trying the obvious leads to a non-obvious result, though:
git blame init/main.c | cut -d ")" -f 1 | cut -d "(" -f 2 | cut -d "2" -f 1 | cut -d "1" -f 1 | uniq -c
16 Linus Torvalds
1 Ingo Molnar
8 Linus Torvalds
1 Len Brown
...
Why does Linus Torvalds appear twice here? Uniq only considers equivalent lines in direct sequence to be duplicates -- snaps! Luckily we can sort to get our names into groups of the same name, and then invoke uniq to get a total count.
git blame init/main.c | cut -d ")" -f 1 | cut -d "(" -f 2 | cut -d "2" -f 1 | cut -d "1" -f 1 | sort | uniq -c
4 Adrian Bunk
2 Akinobu Mita
2 Alex Riesen
22 Alon Bar-Lev
...
Great! Now we call sort for a final time, this time with a numeric and reverse sort (options -n and -r) to show us the count from highest to lowest:
git blame init/main.c | cut -d ")" -f 1 | cut -d "(" -f 2 | cut -d "2" -f 1 | cut -d "1" -f 1 | sort | uniq -c | sort -n -r
436 Linus Torvalds
50 Ingo Molnar
46 Vivek Goyal
27 Pekka Enberg
...
Armed with a command that shows us the total number of contributions for a single file, we simply have to run this command on every applicable file. We will generate a list of applicable files using the find command, and use a bash shell for loop to apply our command to each of them.
Since git only keeps history on regular files, we will constrain find with the -type f parameter. We should also exclude git's internal files, which are stored in the ./.git directory, since these will have no version history.
philippp master /tmp/linux-2.6 $ find . -type f | head -3 ./.git/config ./.git/description ./.git/HEADWe will use grep's -v command to exclude all lines matching a specific phrase. Piping through grep -v "^./.git/" excludes everything in the .git directory by excluding any line that starts with "./.git": The leading carat (^) indicates that the string has to be at the start of the line. If we were to match using grep -v ".git" we could accidentally include awesome.gitsupport.cpp or another file with .git in the name.
philippp master /tmp/linux-2.6 $ find . -type f | grep -v "^./.git" | head -3 ./.mailmap ./arch/alpha/boot/bootloader.lds ./arch/alpha/boot/bootp.c
We will run our command from part 1.a on every file in the list of files in our repository. We will use a for loop to run through the list, and execute our command for every item. Our list of items is returned by the find command, so we will need to pass the list of files found to the for loop. The easiest way is to execute the find command in the for loop, using the `` notation. When the `` statement is evaluated, any command inside the `` quotes as well as the `` quotes are replaced by the command's output.
The for loop syntax is for VARIABLENAME in LIST; do command1; command2; command3; done;, where VARIABLENAME takes on each value of LIST and is accessible using the $VARIABLENAME notation within the for loop.
When we run through our list of git repository files, we will want to print the name of each file above the contributing authors, and probabaly a blank space between file / author lists. A simple echo $fi will provide the filename, and an echo "" will give us the newline.
for fi in `find . -type f | grep -v "^./.git/"`;
do
echo $fi;
git blame $fi | cut -d ")" -f 1 | cut -d "(" -f 2 | cut -d "2" -f 1 | cut -d "1" -f 1 | sort | uniq -c | sort -n -r
echo "";
done;
./.mailmap
96 Nicolas Pitre
4 Uwe Kleine-König
2 Peter Oruba
2 Michael Buesch
For this exercise, we will focus on a small subset of the code -- the selinux module, located in ./security/selinux. We are interested which files different organizations, identified by the committing editor's email domains, have edited, and how many edits these organizations made overall.
The git log command shows us a list of commits for any file or directory, along with the name and email address of each editor. To get the list of files for each editor, we will:
This is all possible using the techniques we already know, except we are working with files for the first time. Easy enough, we can use mkdir -p to attempt to create a directory and not fail if it exists. Removing all files will regrettably result in an error, so we will route all output from rm to /dev/null. A note on routing: > routes standard output (stdout), 2> routes standard error output (stderr), and &> routes both stdout and stderr. /dev/null is a black hole device. To allow rm to fail silently, we will invoke it as rm pathname/* &> /dev/null.
The > operator we use to route rm's output to /dev/null will also route any command's output to any device or file. To append to an existing file, we use the >> variant of the > operator: a single > writes data to the beginning of the file, overwriting existing data. To append a linux filename to the list of files edited by the author, we will do something analogous to: echo "file_name" >> /author_file.
#!/bin/bash
tmp_dir="/tmp/domain_contributors";
rm -rf $tmp_dir &> /dev/null;
mkdir -p "$tmp_dir";
linux_files=`find . -type f`;
for f in $linux_files;
do
author_domains=`git log $f | grep "^Author: " | cut -d ">" -f 1 | cut -d "<" -f 2 | cut -d "@" -f 2`;
for d in $author_domains;
do
echo $f >> "$tmp_dir/$d";
done;
done;
out_file="domain_contributors.list";
for f in `ls $tmp_dir`;
do
echo $f >> $out_file;
cat "$tmp_dir/$f" | sort | uniq -c | sort -r -n >> $out_file;
done;
echo "Tallied per-domain commits for `pwd` in $out_file";
The script above gives us a list of contributions per organization -- but it is sorted alphabetical order. As our final step of this task, we will use awk to tally up the contributions of each organization, and a final sort -r -n to give us a list of most to least active organizations. Awk is a powerful but simple and concise programming language that allows us perform operations on streams of text. To tally up each organization's contribution we will initialize a counter at every line listing an organization, iterate over the lines listing contributions, and print the organization's name and sum of contributions when we the next organization's name.
The BEGIN block of our awk program initializes the organization name and counter. All other blocks execute when the regular expression to the left of the block matches the current line. We know that the domain name may not have a space and preceeds the counts, and all of the lines with counts have spaces.
We use this distinction to print out the current count and domain name when we reach a new domain name, and reset the counter and update the domain name. We do this when a line begins (carat at start of the regex: /^ /) with something that is not a space (carat at start of character class: [^ ]), matching /^[^ ]/. Matching the number of commits and filename is easier -- we simply have to match a space anywhere, using the regex / /.
BEGIN {
c_count = 0;
c_name = "";
}
/ / {
c_count += 1
}
/^[^ ]/ {
print c_count " " c_name;
c_name = $0;
c_count = 0;
}
We save our awk program as domain_summary.awk and execute it with awk -f domain_summary.awk domain_contributors.list. After sorting, we have a breakdown of the top committers by email domain:
philippp master /usr/src/linux-2.6/security $ awk -f domain_summary.awk domain_contributors.list | sort -r -n | head -10 74 redhat.com 55 namei.org 50 ppc970.osdl.org 41 canonical.com 36 gmail.com 35 kernel.org 32 hp.com 27 tycho.nsa.gov 23 linux.vnet.ibm.com 21 us.ibm.com