Very simple Bash script to get source code stats

May 11, 2008 · 4 min read

The other day I wanted to have an overview of the size of a software project I'm working on. The project is relatively big and involves quite a few languages and technologies spread among different tiers. Because I just joined the team there are tons of lines of code already written that I have not even seen. So I felt the need to have at least a grasp of the size of the codebase for each language and learn how programming languages compare among them within the project. I came up with this very simple Bash script, getSizeStats.sh. It expects one or more directories as parameters. It finds all regular files contained in the trees that hang from those directories. It then adds up LOC for all files, grouping by the extension of the filenames. The script assumes that the arguments are, or might be, local copies of Subversion repos (that's why it excludes .svn directories). While running, the output line shows a counter for the number of files as they are processed. Once finished, a list of LOC for each language (i.e. file extension) sorted by LOC is created in /tmp/getSizeStats.sh.out.

#!/bin/sh

TMP_DIR=/tmp/basename $0 TMP_OUTPUT_FILENAME=/tmp/basename $0.tmp OUTPUT_FILENAME=/tmp/basename $0.out

if [ -d $TMP_DIR ]; then rm $TMP_DIR/* 2> /dev/null else mkdir $TMP_DIR fi

COUNTER=0

find "$@" -name "." | grep -v \.svn | while read j; do

if [ -f "$j" ]; then cat "$j" >> $TMP_DIR/echo $j | rev | cut -s -d "." -f 1 | cut -d "/" -f 1 | rev COUNTER=$((COUNTER + 1)) echo -en "\r$COUNTER files " fi

done

if [ -f $TMP_OUTPUT_FILENAME ]; then rm $TMP_OUTPUT_FILENAME fi

for i in ls $TMP_DIR; do echo -e "wc -l $TMP_DIR/$i | cut -d " " -f 1 $i\twc -l $TMP_DIR/$i | cut -d " " -f 1" >> $TMP_OUTPUT_FILENAME done

sort -nr $TMP_OUTPUT_FILENAME | cut -d " " -f 2- > $OUTPUT_FILENAME rm $TMP_OUTPUT_FILENAME

echo processed -- see $OUTPUT_FILENAME

(I think that for some versions of echo you'll have to remove the option -e or it won't work properly). The only serious problem I found were filenames containing blanks and other characters that usually need to be escaped (bad naming, I know — it wasn't my idea). I played with different types of quoting and tried to find a workaround for that. Real-time help about that from @enlavin and @nauj27 was much appreciated. I tried that find -print0 | xargs -0 … thing but couldn't make it work as I needed. Eventually the while read j; do … approach worked. (I confess that I still get confused easily by the subtle differences between quoting variants and how variables get expanded in each case. I ought to find some time to learn that well, once and for all). Now there are so many things to improve here. First of all, the script does not tell binary files from text files, i.e. you will also get counts of “lines of code” for all binary assets within your project, such as object files and images. It should also discard all text files that are not source code, e.g. a CHANGELOG. It should be robust to case variations, i.e. group .java and .JAVA files together. It should rely on something more sophisticated than filename extensions to tell programming languages, because you probably want to count your .cpp and .c++ files as a whole. I was planning to better it adding those fixes/improvements and others. Then Golan told me of sloccount, a command that does just what I need. But proper. Here you have an example of how getSizeStats.sh works. After re-inventing the wheel, I couldn't but run my own script… against the source code of sloccount itself.

$ svn co https://sloccount.svn.sourceforge.net/svnroot/sloccount sloccount-src > /dev/null $ ./getSizeStats.sh sloccount-src/ 38 files processed -- see /tmp/getSizeStats.sh.out $ cat /tmp/getSizeStats.sh.out c 4493 orig 3899 html 3032 1 235 dat 197 l 171 rb 152 lhs 59 spec 56 h 50 CBL 31 php 27 inc 23 pas 21 hs 19 gz 10 f 10 cs 8 f90 7 cbl 4 tar 1