Very simple Bash script to get source code stats
The other day I
wanted to have an overview of the size of a software project I'm working
on. The project is relatively big and involves quite a few languages and
technologies spread among different tiers. Because I just joined the
team there are tons of lines of code already written that I have not
even seen. So I felt the need to have at least a grasp of the size of
the codebase for each language and learn how programming languages
compare among them within the project. I came up with this very simple
Bash script,
getSizeStats.sh
.
It expects one or more directories as parameters. It finds all regular
files contained in the trees that hang from those directories. It then
adds up LOC for all files, grouping by the extension of the filenames.
The script assumes that the arguments are, or might be, local copies of
Subversion repos (that's why it excludes .svn
directories). While
running, the output line shows a counter for the number of files as they
are processed. Once finished, a list of LOC for each language (i.e. file
extension) sorted by LOC is created in /tmp/getSizeStats.sh.out
.
#!/bin/sh
TMP_DIR=/tmp/basename $0
TMP_OUTPUT_FILENAME=/tmp/basename $0
.tmp
OUTPUT_FILENAME=/tmp/basename $0
.out
if [ -d $TMP_DIR ]; then rm $TMP_DIR/* 2> /dev/null else mkdir $TMP_DIR fi
COUNTER=0
find "$@" -name "." | grep -v \.svn | while read j; do
if [ -f "$j" ]; then
cat "$j" >> $TMP_DIR/echo $j | rev | cut -s -d "." -f 1 | cut -d "/" -f 1 | rev
COUNTER=$((COUNTER + 1))
echo -en "\r$COUNTER files "
fi
done
if [ -f $TMP_OUTPUT_FILENAME ]; then rm $TMP_OUTPUT_FILENAME fi
for i in ls $TMP_DIR
; do
echo -e "wc -l $TMP_DIR/$i | cut -d " " -f 1
$i\twc -l $TMP_DIR/$i | cut -d " " -f 1
" >> $TMP_OUTPUT_FILENAME
done
sort -nr $TMP_OUTPUT_FILENAME | cut -d " " -f 2- > $OUTPUT_FILENAME rm $TMP_OUTPUT_FILENAME
echo processed -- see $OUTPUT_FILENAME
(I think that for some versions of echo
you'll have to remove the
option -e
or it won't work properly). The only serious problem I found
were filenames containing blanks and other characters that usually need
to be escaped (bad naming, I know — it wasn't my idea). I played with
different types of quoting and tried to find a workaround for that.
Real-time help about that from
@enlavin
and
@nauj27
was much appreciated. I tried that find -print0 | xargs -0 …
thing but
couldn't make it work as I needed. Eventually the while read j; do …
approach worked. (I confess that I still get confused easily by the
subtle differences between quoting variants and how variables get
expanded in each case. I ought to find some time to learn that well,
once and for all). Now there are so many things to improve here. First
of all, the script does not tell binary files from text files, i.e. you
will also get counts of “lines of code” for all binary assets within
your project, such as object files and images. It should also discard
all text files that are not source code, e.g. a CHANGELOG
. It should
be robust to case variations, i.e. group .java
and .JAVA
files
together. It should rely on something more sophisticated than filename
extensions to tell programming languages, because you probably want to
count your .cpp
and .c++
files as a whole. I was planning to better
it adding those fixes/improvements and others. Then
Golan told me of
sloccount
,
a command that does just what I need. But proper. Here you have an
example of how getSizeStats.sh
works. After re-inventing the wheel, I
couldn't but run my own script… against the source code of
sloccount
itself.
$ svn co https://sloccount.svn.sourceforge.net/svnroot/sloccount sloccount-src > /dev/null $ ./getSizeStats.sh sloccount-src/ 38 files processed -- see /tmp/getSizeStats.sh.out $ cat /tmp/getSizeStats.sh.out c 4493 orig 3899 html 3032 1 235 dat 197 l 171 rb 152 lhs 59 spec 56 h 50 CBL 31 php 27 inc 23 pas 21 hs 19 gz 10 f 10 cs 8 f90 7 cbl 4 tar 1