Git tip of the day : show the hottest files in a repo

Submitted by Frederic Marand on

When auditing or reviewing an unknown code base, I often have to decide which files to examine in priority. Beyond the usual heuristics for Drupal projects (hint: look at templates in D7), how can one find the parts most likely to contain problems ? This simple command set can help pinpoint troublemaking files quickly.

The trick

The idea is simply that is a file contains problems, they are likely to have been identified and worked on more than other files, leading to a higher number of commits. And for this, Git can help us, by giving the most often changed files. Turns out this is really simple by massaging the git log with some shell commands:

Which can give us something like this example, taken from the recently created Beanstalkd module for Drupal 8:

$ code_heat.bash 7.x-1.x..8.x-1.x
      4 runqueue.php
      4 src/Queue/QueueBeanstalkd.php
      4 src/Runner.php
      5 beanstalkd.info.yml
      5 beanstalkd.routing.yml
      5 .scrutinizer.yml
      5 src/Queue/BeanstalkdQueue.php
      5 src/Server/Item.php
      5 src/Tests/BeanstalkdQueueTest.php
      6 beanstalkd.drush.inc
      6 beanstalkd.drush.yml
      7 src/Plugin/QueueWorker/SampleWorker.php
      8 README.md
      9 beanstalkd.services.yml
     10 src/Server/BeanstalkdServerFactory.php
     11 beanstalkd.install
     15 beanstalkd.module
     15 src/Tests/BeanstalkdServerTest.php
     17 beanstalkd.drush.php
     26 src/Server/BeanstalkdServer.php

Results quality

The results are interesting : the top-modified class (BeanstalkdServer) just recently broke the cyclomatic complexity warning threshold, and is on my list as the next refactor target. And the Drush plugin is coming next. So, at least in this case, the results are definitely relevant.

The check for relevancy, I also tried this on an unpublished entreprise-class project with a dozen developers, and found the results on thousands of commits instead of a few dozen to be just as relevant, so this seems like a good zero-cost trick to keep in mind.

Going further, assuming your development group sticks to a formalized commit message format including ticket references with issue types, it is possible to group issues by ticket by using a slightly more complex version of this log filtering, and use Brandon Carson's Defect Density Heatmap to get a 2-dimensional heat map as a cloud of files to analyze, relying on size for the number of commits and color for the type of commits (e.g. features vs bug fixes).