Very simple Bash script to get source code stats

The other day I wanted to have an overview of the size of a software project I’m working on. The project is relatively big and involves quite a few languages and technologies spread among different tiers. Because I just joined the team there are tons of lines of code already written that I have not even seen. So I felt the need to have at least a grasp of the size of the codebase for each language and learn how programming languages compare among them within the project.

I came up with this very simple Bash script, getSizeStats.sh. It expects one or more directories as parameters. It finds all regular files contained in the trees that hang from those directories. It then adds up LOC for all files, grouping by the extension of the filenames. The script assumes that the arguments are, or might be, local copies of Subversion repos (that’s why it excludes .svn directories). While running, the output line shows a counter for the number of files as they are processed. Once finished, a list of LOC for each language (i.e. file extension) sorted by LOC is created in /tmp/getSizeStats.sh.out.

#!/bin/sh
 
TMP_DIR=/tmp/`basename $0`
TMP_OUTPUT_FILENAME=/tmp/`basename $0`.tmp
OUTPUT_FILENAME=/tmp/`basename $0`.out
 
if [ -d $TMP_DIR ]; then
    rm $TMP_DIR/* 2> /dev/null
else
    mkdir $TMP_DIR
fi
 
COUNTER=0
 
find "$@" -name "*.*" | grep -v \.svn | while read j; do
 
    if [ -f "$j" ]; then
        cat "$j" >> $TMP_DIR/`echo $j | rev | cut -s -d "." -f 1 | cut -d "/" -f 1 | rev`
        COUNTER=$((COUNTER + 1))
        echo -en "\r$COUNTER files "
    fi
 
done
 
if [ -f $TMP_OUTPUT_FILENAME ]; then
    rm $TMP_OUTPUT_FILENAME
fi
 
for i in `ls $TMP_DIR`; do
    echo -e "`wc -l $TMP_DIR/$i | cut -d " " -f 1` $i\t`wc -l $TMP_DIR/$i | cut -d " " -f 1`" >> $TMP_OUTPUT_FILENAME
done
 
sort -nr $TMP_OUTPUT_FILENAME | cut -d " " -f 2- > $OUTPUT_FILENAME
rm $TMP_OUTPUT_FILENAME
 
echo processed -- see $OUTPUT_FILENAME
 
# EOF

(I think that for some versions of echo you’ll have to remove the option -e or it won’t work properly).

The only serious problem I found were filenames containing blanks and other characters that usually need to be escaped (bad naming, I know — it wasn’t my idea). I played with different types of quoting and tried to find a workaround for that. Real-time help about that from @enlavin and @nauj27 was much appreciated. I tried that find -print0 | xargs -0 … thing but couldn’t make it work as I needed. Eventually the while read j; do … approach worked. (I confess that I still get confused easily by the subtle differences between quoting variants and how variables get expanded in each case. I ought to find some time to learn that well, once and for all).

Now there are so many things to improve here. First of all, the script does not tell binary files from text files, i.e. you will also get counts of “lines of code” for all binary assets within your project, such as object files and images. It should also discard all text files that are not source code, e.g. a CHANGELOG. It should be robust to case variations, i.e. group .java and .JAVA files together. It should rely on something more sophisticated than filename extensions to tell programming languages, because you probably want to count your .cpp and .c++ files as a whole.

I was planning to better it adding those fixes/improvements and others. Then Golan told me of sloccount, a command that does just what I need. But proper.

Here you have an example of how getSizeStats.sh works. After re-inventing the wheel, I couldn't but run my own script… against the source code of sloccount itself.

$ svn co https://sloccount.svn.sourceforge.net/svnroot/sloccount sloccount-src > /dev/null
$ ./getSizeStats.sh sloccount-src/
38 files processed -- see /tmp/getSizeStats.sh.out
$ cat /tmp/getSizeStats.sh.out 
c       4493
orig    3899
html    3032
1       235
dat     197
l       171
rb      152
lhs     59
spec    56
h       50
CBL     31
php     27
inc     23
pas     21
hs      19
gz      10
f       10
cs      8
f90     7
cbl     4
tar     1

11 May 2008 Computers

2 comments so far

  1. ragnarol11 May 2008 9:15
    Gravatar

    .CBL… yeeeekk

  2. golan12 May 2008 14:30
    Gravatar

    Here you have a reduced version I wrote which tries to be more unix friendly, as in following the unix paradigm (writing to stdout) and not leaving behind unwanted directories. Also, it tries to filter binary files as well.

Your comment