I Smell Duplication Can You?

Want to know how to find duplicate code in your code base quickly and easily? Want to be able to sniff out the most pungent of code smells in double quick time? You’ve come to the right place.


If your not familiar with the idea of code smells, the be sure to check out Martin Fowlers excellent book Refactoring: Improving the Design of Existing Code. Code smells are symptoms in your source code that can indicate problems – arguably the worst being code duplication which violates the principles of DRY.

The code maintenance issues are well known,  so I won’t revisit them here. What I want to do is talk about a tool I found the other day that helps find duplicate code.

Copy Paste Detection (CPD) is a great little program that can detect duplicate code in a code base.  Its available under a BSD-style licence, shipped as part of the PMD static code analyzer for Java. Although PMD is targeted as java, as CPD works using string matching, it can be used on any language.  Java, JSP, C, C++, Fortran and PHP are supported out of the box. It is also possible to add further langugages.

How to use it

Running CPD is very simple:

./cpd.sh ~/path/to/source/

And thats it. By default the output is in text format, this can be changed to xml or csv. Example output of processing the JDK (reported to take only 4 seconds) can be seen here. The number of duplicate tokens require for code to be considered copy and pasted can also be configured, this defaults to 100.

My findings

I had to increase the heap size available to java to get the code based I’m working on parsed. Its about a million lines of C/C++ code. There results were fascinating.  Sure enough, copy and pasted code was found, comments and all. Worse still, code that had been copy and pasted but not quite kept in sync, in most cases straight bugs.

The only real false positives I found were with auto generated code. By default CPD recursively parses the directories (you can supply as many as you like) on the command line, without being able to ignore certain files (eg *_autogen.cpp). As these files are produced as part of the build process, I’m now running CPD on a clean checkout, without build any build artifacts lying about.

What next?

As always with these things, I’m left with a bunch of open questions:

I can see this tool can offer some real value, but how do I integrate it into my teams work flow? Its a command line tool only, so there is no administration interface to allow results of various runs to be compared and analysed.

There are plenty of other static code analyzers that do much more than just check for duplication, what are peoples experiences with these?

Leave a Reply

Your email address will not be published. Required fields are marked *