By default grep is going through all of the .git directly, which is the part `gi...

colin_mccabe · on Feb 2, 2014

You need to take the effect of the page cache into account. Since you are not clearing the page cache after each test, the test after it benefits from the contents. So running 'git grep' first disadvantages it, compared to everything else.

I ran a test on a large repository and here are my results. The repository was Hadoop, and is available from git://github.com/apache/hadoop-common.git.

  cmccabe@keter:~/hadoop4> du -cksh .
  375M    .
  375M    total
  sudo -- sh -c 'sync ; echo 3 > /proc/sys/vm/drop_caches'
  cmccabe@keter:~/hadoop4> /usr/bin/time git grep 'class TestDefaultNameNodePort'                                                                              
  hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDefaultNameNodePort.java:public class TestDefaultNameNodePort {
  0.11user 0.34system 0:00.74elapsed 61%CPU (0avgtext+0avgdata 60512maxresident)k
  260256inputs+0outputs (19major+9718minor)pagefaults 0swaps

  sudo -- sh -c 'sync ; echo 3 > /proc/sys/vm/drop_caches'
  cmccabe@keter:~/hadoop4> /usr/bin/time grep -rI --exclude .git 'class TestDefaultNameNodePort' *
  hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDefaultNameNodePort.java:public class TestDefaultNameNodePort {
  0.13user 0.56system 0:02.40elapsed 29%CPU (0avgtext+0avgdata 5584maxresident)k
  252792inputs+0outputs (2major+414minor)pagefaults 0swaps

So you can see that it is faster, even when excluding the .git directory.

Running grep a second time without clearing the cache gives a bogus result:

  cmccabe@keter:~/hadoop4> /usr/bin/time grep -rI --exclude .git 'class TestDefaultNameNodePort' *
  hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDefaultNameNodePort.java:public class TestDefaultNameNodePort {
  0.03user 0.04system 0:00.08elapsed 97%CPU (0avgtext+0avgdata 5584maxresident)k
  0inputs+0outputs (0major+416minor)pagefaults 0swaps

This is because the data is all in the page cache at that point, so we're not actually accessing the disk.

I was curious about the true source of the speedup, and so I checked the output of the 'perf' tool. git had 1,922 CPU migrations, whereas grep had 52. Following up on this, you can see that git is spawning a bunch of threads, whereas grep only has one thread.

  cmccabe@keter:~/hadoop4> strace -f -e trace=clone git grep 'class TestDefaultNameNodePort' 2>&1 | grep -c '] +++ exited with '                                 
  8
 cmccabe@keter:~/hadoop4> strace -f -e trace=clone grep -rI --exclude=.git 'class TestDefaultNameNodePort' *  2>&1 | grep -c '] +++ exited with '                                               
  0

I also think git might be cheating and using a simpler regex engine than grep, but at this point I got bored. Case closed.