This is a tool for detecting atoms of confusion in the Java language as showcased in this paper by Langhout and Aniche.
Usage: tool [OPTIONS] COMMAND [ARGS]...
Analyze Java source code for the presence of atoms of confusion
Options:
-d, --disabled TEXT Space separated list of disabled atoms
-v, -V, --verbose Print the results of its analysis on the console
-l, -L, --log Save the progress of the analysis to a log file
-h, --help Show this message and exit
Commands:
files Analyze the provided files for atoms of confusion
pr Analyze the provided github pull request for atoms of confusion
Usage: tool files [OPTIONS] FILES...
Analyze the provided files for atoms of confusion
Options:
-r, -R, --recursive Recursively search any input directory for Java files
-h, --help Show this message and exit
Arguments:
FILES Space separated list of files/directories to analyzej
For example:
# run the detector on File1.java, File2.java and all of the files in dir
tool files -r File1.java File2.java ./dir/
Usage: tool pr [OPTIONS] URL
Analyze the provided github pull request for atoms of confusion
Options:
-dl, -DL, --download Download all of the affected files in the pull request
both before and after the merge
-t, --token TEXT Github API key you can obtain one at
https://github.com/settings/tokens
-h, --help Show this message and exit
Arguments:
URL The github pr URL
For example:
# analyze pull request 1926 of the mockito project
tool pr https://github.com/mockito/mockito/pull/1926
# you can provide a token, this will allow you to do 5000 runs per hour rather than 60
tool pr --token <token> <url>
# assing the -dl flag will download the analyzed files both before and after, to make manually finding the detected atoms simpler
tool pr -dl <url>
In this section you can find information about the implementation of the different parts of the tool. For more details, feel free to also check the documentation of the classes and the methods in the source code.
The tool is a CLI tool. To parse CLI arguments the CLIKT library is used. You can run
the tool on local files or alternatively, you can pass a github pull request and anlyze the code both before and after
the merge. All of the CLI logic is implemented in the file Cli.kt
.
When running the detector on files, the InputParser
class is responsible for retreiving the individual files
provided by the user and parsing them. Next the detector is ran and the results are provided to the user.
In order to run the tool on pull requests, the github API is used to find the commit SHA for the code before and after
applying the PR. Next, .diff
file for the PR is downloaded and parsed to get the affected filenames before and after
the merge, as well as the ranges of line numbers which are added/deleted. Lastly, the before and after files are
downloaded and the detector executes on them. This produces two sets of atoms. Now, for each atom in the before set with
a line number which is "removed" we mark this atom as being removed in the PR. Likewise for each atom in the after set
with a line number which is "added" the atom is marked as "added" in the PR. All remaining atoms in the after set are
those which remain.
Here you can find high-level descriptions of the different parts of the analysis pipeline of the tool.
To parse the code the tool uses a parser generated using ANTLR v4. The grammar we used, as
well as the generated parser and lexer can be found under src/main/java
.
To detect the atoms in the code the listener infrastructure provided by ANTLR has been heavily utilized. Using this we
implemented the AtomsListener
class, which can be found under the parsing
package in the code base. This listener is
responsible for traversing the parse tree generated by the parser. During the traversal the listener can pass certain
nodes of the tree to different Detectors to check for atoms.
Detectors are the classes responsible for actually analysing a part of the source code for atoms. In general each
detector corresponds to one specific atom. All Detectors can be found in the parsing.detectors
package. Each detector
is annotated with the Visit
annotation which specifies on what nodes of the parse tree this detector should be called.
Then the detectors are registered to the AtomsListener
who uses the annotation to know when to call a specific
detector.
To detect some of the atoms, identifier and symbol resolution was required. That is why the tool also keeps scoping
information on the code that's being analysed. To implement this we have extended
the symtab
library provided by the Antlr team. The classes that we have added to extend the library's functionality can be found
under the parsing.symtab
package. The logic related with scoping is implemented by the AtomsListener
.
In this section you can find information on how the tool internally represents the results of the analysis as well as to how the tool outputs them.
The confusion graph is a specialised data structure developed for the purposes of this tool that allows for quickly
storing the atoms found as well for efficient queries. The main idea behind it is that there are 2 different types of
nodes (Atoms nodes representing a type of atom and Source nodes representing an input file) and they are connected to
each other with edges that include information about where the atom appears. For example if we have file Hello.java
in
which the Type Conversion
atom exists on lines 10, 32 and 50 then in the graph we would have the Type Conversion
atom node connected to the Hello.java
source node with an edge containing the set {10, 32, 50}
. Keep in mind that
Atom nodes can only be connected to Source nodes and vice versa. This constraint is enforced by the code and exceptions
will arise if you try to connect 2 nodes of the same type. One last thing to add, is that due to the implementation of
the graph which is based on hash maps and some duplication of information most operations are of O(1)
complexity. This
allows the tool to remain fast even when analysing large sets of files. The code of the graph can be found in
the output.graph
package.
The tool also provides support for seeing how atoms have changed between pull requests. This is implemented in two
steps. Firstly the diff
file associated with the pull request to get information about which lines have been removed
and added. This is implemented by the DiffParser
class found in the github
package. Secondly, the information
retrieved is compared with the information in the graphs generated by analysing the "from" and "to" branches of the pull
request to see what atoms have been removed, added and remain. The logic for this is implemented in the PRDelta
class
under the aforementioned package.
Finally, to write the output to CSV files we have used the kotlin-csv
library to implement the CsvWriter
which provides methods for writing both the csv graph and the PRDelta to CSV files.
The code for this class can be found in the output.writers
package.