The other day I
wanted to have an overview of the size of a software project I'm working
on. The project is relatively big and involves quite a few languages and
technologies spread among different tiers. Because I just joined the
team there are tons of lines of code already written that I have not
even seen. So I felt the need to have at least a grasp of the size of
the codebase for each language and learn how programming languages
compare among them within the project. I came up with this very simple
Bash script,
It expects one or more directories as parameters. It finds all regular
files contained in the trees that hang from those directories. It then
adds up LOC for all files, grouping by the extension of the filenames.
The script assumes that the arguments are, or might be, local copies of
Subversion repos (that's why it excludes .svn
directories). While
running, the output line shows a counter for the number of files as they
are processed. Once finished, a list of LOC for each language (i.e. file
extension) sorted by LOC is created in /tmp/
TMP_DIR=/tmp/basename $0
TMP_OUTPUT_FILENAME=/tmp/basename $0
OUTPUT_FILENAME=/tmp/basename $0
if [ -d $TMP_DIR ]; then
rm $TMP_DIR/* 2> /dev/null
mkdir $TMP_DIR
find "$@" -name "." | grep -v \.svn | while read j; do
if [ -f "$j" ]; then
cat "$j" >> $TMP_DIR/echo $j | rev | cut -s -d "." -f 1 | cut -d "/" -f 1 | rev
echo -en "\r$COUNTER files "
if [ -f $TMP_OUTPUT_FILENAME ]; then
for i in ls $TMP_DIR
; do
echo -e "wc -l $TMP_DIR/$i | cut -d " " -f 1
$i\twc -l $TMP_DIR/$i | cut -d " " -f 1
sort -nr $TMP_OUTPUT_FILENAME | cut -d " " -f 2- > $OUTPUT_FILENAME
echo processed -- see $OUTPUT_FILENAME
(I think that for some versions of echo
you'll have to remove the
option -e
or it won't work properly). The only serious problem I found
were filenames containing blanks and other characters that usually need
to be escaped (bad naming, I know — it wasn't my idea). I played with
different types of quoting and tried to find a workaround for that.
Real-time help about that from
was much appreciated. I tried that find -print0 | xargs -0 …
thing but
couldn't make it work as I needed. Eventually the while read j; do …
approach worked. (I confess that I still get confused easily by the
subtle differences between quoting variants and how variables get
expanded in each case. I ought to find some time to learn that well,
once and for all). Now there are so many things to improve here. First
of all, the script does not tell binary files from text files, i.e. you
will also get counts of “lines of code” for all binary assets within
your project, such as object files and images. It should also discard
all text files that are not source code, e.g. a CHANGELOG
. It should
be robust to case variations, i.e. group .java
and .JAVA
together. It should rely on something more sophisticated than filename
extensions to tell programming languages, because you probably want to
count your .cpp
and .c++
files as a whole. I was planning to better
it adding those fixes/improvements and others. Then
Golan told me of
a command that does just what I need. But proper. Here you have an
example of how
works. After re-inventing the wheel, I
couldn't but run my own script… against the source code of
$ svn co sloccount-src > /dev/null
$ ./ sloccount-src/
38 files processed -- see /tmp/
$ cat /tmp/
c 4493
orig 3899
html 3032
1 235
dat 197
l 171
rb 152
lhs 59
spec 56
h 50
CBL 31
php 27
inc 23
pas 21
hs 19
gz 10
f 10
cs 8
f90 7
cbl 4
tar 1