A Clojure library for finding Atoms of Confusion in C projects.
Contains facilities for:
- Parsing C/C++ with Eclipse CDT library
- Finding specific patterns in an AST
- Traversing version histories through git
- Parsing commit logs for bug/patch IDs
Output from this work formed the basis of our paper at the Mining Software Repositories 2018 conference: Prevalence of Confusing Code in Software Projects - Atoms of Confusion in the Wild.
If you would like to use this project as-is to find all occurrences of the 15 atoms of confusion described in our 2018 MSR paper you can run our code from the command-line as:
lein run dir1 dir2 > atoms.csv
This command will loop over each of the directories provided (in the example
above: dir1 dir2
) and print a csv to the file atoms.csv
with one row for
each atom in the following shape:
atom,file,line,offset,code
operator-precedence,nginx/src/misc/ngx_google_perftools_module.c,109,2669,&gptcf->profiles
post-increment,nginx/src/core/ngx_thread_pool.c,254,6422,task->id = ngx_thread_pool_task_id++
operator-precedence,nginx/src/core/ngx_thread_pool.c,122,3452,&tp->mtx
It is best to redirect this output to a file for further post-processing.
To dump every AST node (independent of atom of confusion or not) run:
lein with-profile dump-asts run
The majority of interesting files in this project are located in the top-level
src
directory. Secondarily, test files are in located test
and jars in
resources
. Under src
there are several important directories:
atom_finder
- Clojure files for parsing C/C++ files and searching for atomsanalysis
- R files for statistically analyzing the results of the code miningconf
- Configuration variables to customize each runtime environment
The most important, and complicated directory is src/atom_finder
which
contains all the code analyze source code and repositories. Within the top-level
of atom-finder
is code which is both specific to this project, but also
reusable between different analyses. Below the top-level are several other
useful directories:
classifier
- Every file in this directory is used to determine whether an individual AST node is a particular atom of confusionquestions
- Every file in this directory corresponds to one of our [published, or potential) research hypotheses. These files implicitly use the classifier infrastructure to observe patterns.tree_diff
- Tree diffing was a difficult enough problem that took several iterations to get working. Each evolution is it's own sub-namespace in this directory. Ultimately onlydifflib
ended up being used.util
- The most reusable and general functions. Most of these files are potentially useful in other projects outside this one.
This project is primariy written in Clojure(JVM), and uses many Java libraries. In order to run this project you should install:
- Leiningen - The Clojure build manager. This tool will automatically download the right version of Clojure, resolve all the necessary libraries, run tests, and execute the program.
- One of Emacs/CIDER, Sublime/SublimeREPL, Nightcode, or anything that offers you a clojure-centric workflow. The way one writes Clojure (and lisp in general) is a bit more interactive than traditional development. It's important to be able to evaluate code as you write it.
After you've installed these tools, first run lein test
to make sure everything is up and running.
Then you should be able to develop in your editor, executing snippets of code as you go.
The first thing you might want to do, is parse some C code. There are three
main functions for doing this, parse-file
, parse-source
and
parse-frag
. Both functions take a String
as an argument, and return
an
IASTNode
.
parse-file
and parse-source
both require whole programs, the former
accepting a filename as its argument and the latter a string containing the
full code. parse-frag
on the other hand can take any (read "many") partial
program. For example:
(parse-file "gcc/testsuite/c-c++-common/wdate-time.c") ;; => CPPASTTranslationUnit
(parse-source "int main() { 1 + 1; }") ;; => CPPASTTranslationUnit
(parse-frag "1 + 1") ;; => CPPASTBinaryExpression
After you've parsed some code, you might reasonably want to see what it looks like:
(->> "gcc/testsuite/c-c++-common/wdate-time.c"
parse-file
(get-in-tree [2])
print-tree)
Which should output:
[] <SimpleDeclaration> {:line 6, :off 238, :len 39}
[0] <SimpleDeclSpecifier> {:line 6, :off 238, :len 10}
[1] <ArrayDeclarator> {:line 6, :off 249, :len 27}
[1 0] <Name> {:line 6, :off 249, :len 9}
[1 1] <ArrayModifier> {:line 6, :off 258, :len 2}
[1 2] <EqualsInitializer> {:line 6, :off 261, :len 15}
[1 2 0] <IdExpression> {:line 6, :off 263, :len 13}
[1 2 0 0] <Name> {:line 6, :off 263, :len 13}
Some other useful functions are:
print-tree -> Prints a debug view of the tree structure of an AST plus metadata
write-tree -> Takes an AST and returns the code that generated it (inverse parsing)
get-in-tree -> Digs down into an AST to get at nested children
default-finder -> Take a function that returns true/false for a single AST node, and run it over an entire AST
You may also be interested in finding where in software projects atoms of confusion live.
In the classifier
namespace there
are several functions for finding atoms. First, every type of atom has a
classifier which can be applied to an AST node to determine whether it
represents an atom of confusion.
(->> "x++" parse-expr post-*crement-atom?) ;; => false
(->> "y = x++" parse-expr post-*crement-atom?) ;; => true
Further, by applying the default-finder
function, each classifier can be
adapted to find each example of an atom in a piece of code.
(->> "x = (1, 2) && y = (3, 4)"
parse-expr
((default-finder comma-operator-atom?))
(map write-tree))
;; => ("1, 2" "3, 4")
If you would like to find every atom in a piece of code you can use the helper
function find-all-atoms
in
classifier.clj
.
(->> "11 && 12 & 013"
parse-expr
find-all-atoms
(map-values (partial map write-tree))
(remove (comp empty? last))
(into {}))
;; => {:operator-precedence ("11 && 12 & 013"), :literal-encoding ("12 & 013")}
Beyond simply finding atoms of confusion, there's a fair amount of code to
answer specific questions about how atoms of confusion relate to a
codebase. Much of this code lives in
the questions
directory, and is very poorly
documented. Sorry in advance.