IEEE-conference-template-062824.tex

\documentclass[conference]{IEEEtran}
\IEEEoverridecommandlockouts
% The preceding line is only needed to identify funding in the first footnote. If that is unneeded, please comment it out.
%Template version as of 6/27/2024

\usepackage{cite}
\usepackage{amsmath,amssymb,amsfonts}
\usepackage{algorithmic}
\usepackage{graphicx}
\usepackage{textcomp}
\usepackage{xcolor}
\usepackage{url}
\def\BibTeX{{\rm B\kern-.05em{\sc i\kern-.025em b}\kern-.08em
    T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}}
\begin{document}

\newif\ifrev
%%%%%%%%%%%%%%%%%%
% COMMENT OUT NEXT LINE TO HIDE COMMENTS
%\revtrue
%%%%%%%%%%%%%%%%%%
\ifrev
  \newcommand{\yanyan}[1]{{\color{blue} [Yanyan: #1]}}
  \newcommand{\red}[1]{\textcolor{red}{#1}}
  \newcommand{\todo}[1]{\textcolor{red}{\textbf{Anyone todo}: #1}}
\else
  \newcommand{\yanyan}[1]{}
  \newcommand{\red}[1]{}
  \newcommand{\todo}[1]{}
\fi

\title{Encourage, Engage, and
Enhance a Community for Atoms of Confusion}

\author{Justin Cappos${}^{\rm ny}$ \qquad Dan Gopstein${}^{\rm ms}$ \qquad Renata Vaderna${}^{\rm ny}$ \qquad Yanyan Zhuang${}^{\rm co}$\\
${}^{\rm ny}$\,New York University \quad 
${}^{\rm co}$\,University of Colorado, Colorado Springs \quad  ${}^{\rm ms}$\,Microsoft %\thanks{This work is supported in part by \yanyan{Justin complete}.}
}

\maketitle

\begin{abstract}
The recent work by our team has empirically 
validated the existence of small patterns in C code
that can interfere with program comprehension.
These small code patterns, named atoms of confusion 
(or atoms for short), are correlated with many software bugs. This finding has been replicated by numerous 
research teams. For example, atoms have been validated in different programming languages including Java and Python, and through various methods such as using an EEG device.

To engage a growing community, we have set out to put our work 
into practice by bridging the gap between academia and 
industry. We have been working to deploy atom parsers in the 
Linux kernel and developing data sharing and visualization 
tools with the Linux Foundation. We plan to build a broader 
community through academia-industry collaboration, 
encouraging, engaging, and enhancing community connections to 
grow into a large open-source ecosystem. We invite you to join 
us!

\end{abstract}

\begin{IEEEkeywords}
Atoms of Confusion, Code Comprehension, Community Effort, Open Source
\end{IEEEkeywords}

\section{Introduction}

Software bugs represent a significant challenge worldwide, 
with a staggering cost of approximately \$2.08 trillion in the 
US alone in 2020, as reported by the Consortium for 
Information \& Software Quality (CISQ) \cite{CPSQ2020}. 
Addressing these bugs efficiently before they cause damage is 
crucial in both industry and academia.
%
Code review serves as a cornerstone of quality assurance in 
both open source and proprietary projects. Research highlights 
that inadequate code reviews are linked to diminished software 
quality, the emergence of anti-patterns, and increased 
post-release defects. %\yanyan{Should have a citation here?} 
Ensuring that software is understandable to 
humans is crucial yet challenging, often approached ad hoc.

Our research has delved into the causes of code confusion for 
human developers. We have empirically validated 
small C code patterns, termed \emph{atoms of confusion}, which 
programmers often misinterpret~\cite{gopstein2017understanding}. Analyzing 14 large C projects 
revealed hundreds of thousands of these atoms, highly 
correlated with both comments and bug fixes, underscoring 
their profound impact on software reliability 
\cite{gopstein2018prevalence}.
%
This body of work has inspired a breadth of subsequent studies 
across various programming languages, employing methodologies 
like eye-tracking and EEG to probe deeper into how programmers 
interact with these confusing elements 
\cite{langhout2021atoms, 
torres2023investigation, dacosta2023seeing, 
Manor2018AtomsConfusionSwift, yeh2017detecting}. Over a dozen research teams, e.g., from the Netherlands and Brazil, have replicated experiments from our prior work and published their work about Java~\cite{langhout2021atoms, mendes2022dazed}, JavaScript~\cite{oliveira2019impact},  Swift~\cite{castor2018identifying}, Python~\cite{da2023seeing}, and the Android platform~\cite{tabosa2024dataset}. A Senior Scientist from the University of Klagenfurt, Austria collaborated with us and published a recent discovery that used the datasets from our work~\cite{gopstein2017understanding, zhuang2023developer} and designed a high-quality C programming ability assessment tool~\cite{glasauer2024c}.

As this research started to gain momentum, we initiated 
efforts to enhance community collaboration. By fostering 
discussions, sharing datasets, and initiating research into 
unified standardization of what constitutes an atom of 
confusion, we aim to streamline the process for researchers to 
conduct future atoms-related experiments. Additionally, we are 
committed to expanding awareness and fostering interest in 
this novel approach of examining the causes of code confusion 
and bugs, not just within academia, but also across the 
industry.


\section{Enhancing Community Engagement}

With the community's enthusiasm, however, comes the different definitions and execution in these works that led to divergent uses of our original concept. In this section, we outline our effort to unify the research community and apply our work in collaboration with the industry.

\subsection{Community Meetings}

In our previous experience, organizing community meetings has 
proven to be effective in gathering individuals who are 
passionate about a research project. Since February 2024, we 
have initiated bimonthly meetings to cultivate a strong sense 
of community among researchers. Our goal is to establish a 
supportive platform where participants can share their work 
and engage in constructive feedback exchanges within the 
community.
Another challenge is the varied interpretations of what 
constitutes an atom. We organized meetings specifically 
focused on this issue, presenting multiple code snippets that 
some consider atoms of confusion, while others do not find 
them confusing. These discussions have sparked significant 
interest and motivated further exploration into understanding 
these divisive elements.

\subsection{A Framework for Parsing Atoms of Confusion}

While our previous parsers \cite{gopstein2018prevalence} 
have shown correlations between atoms, bugs, and 
code comments, they have not succeeded in providing clear 
definitions. Our goal is to develop a precise, easily 
extensible, and reproducible methodology for identifying
specific code patterns as atoms. To this end, we are actively 
developing a framework that includes guidelines 
to determine what is and is not an atom, based on our findings that increased occurrences of comments and bug fixes are often 
correlated with the presence of atoms. 

First, we count the occurrences of all Abstract Syntax Tree 
(AST) subtrees within a large open-source project, as well as 
the ASTs in commented code snippets and code removed by bug-fix commits. Next, we 
look for patterns of code that are correlated (or not 
correlated) with high frequencies of bug fixes and comments. 
The former patterns are identified as atoms. To test the 
significance of these correlations, we will apply a formal 
statistical method, such as Hotelling’s T-Square method, to 
distinguish atoms from non-atoms.

This approach aims to provide a precise and objective method 
for identifying atoms, potentially uncovering additional 
confusing patterns that were previously unrecognized. 
Researchers will also be able to apply different metrics 
beyond comments and bug-fix frequencies by adding new 
dimensions or substituting the measures we have chosen. This 
framework is designed to support hypothesis testing among 
community members, assist future projects in exploring atoms 
in various programming languages, and enhance software 
reliability.


\subsection{Collaboration with the Industry}

We aim for our research to go beyond academic borders and have 
a greater impact in the industry. Our previous research helped 
identify and fix several Linux kernel bugs caused by macros, 
often occurring when developers incorrectly parenthesized one 
or more arguments. This success demonstrates the potential 
practical impact of our work and highlights the importance of 
ongoing collaboration with the industry. Below, we provide two 
more examples of such collaboration.

\textbf{Coccinelle} is a tool widely used to detect and fix issues,
as well as perform refactoring, in the Linux kernel code.
It allows users to specify patterns of interest in any C code and,
optionally, their replacements using the Semantic Patch Language.
Unlike traditional patches, Coccinelle's semantic patches can modify hundreds
of files across thousands of code sites by abstracting away specific details
and variations \cite{coccinelle}.

Given its extensive use and our collaboration with the 
maintainers of Coccinelle, we have implemented atom parsers 
using this tool. While our primary focus is on the Linux 
kernel, Coccinelle can be used to process any C code. 
Consequently, the parsers we have developed for detecting 
atoms can also be applied to other C codebases. We have 
implemented parsers for the detection of 14 out of 15 
identified atoms, excluding only one that is beyond 
Coccinelle's capabilities. All these parsers are available in 
our GitHub repository \cite{githubcocci}.

Additionally, to further encourage collaboration between 
academia and industry, we dedicated one of our community 
meetings to Coccinelle. We invited a key contributor to 
Coccinelle to present the tool to our community, with a 
specific focus on detecting atoms. This initiative brings us a 
step closer to bridging the gap between academic research and 
practical industry applications.

\textbf{OpenSSF and the Linux Foundation.}
%
To further solidify our efforts and expand the impact of our 
research, we are planning to establish a collaboration with 
the Open Source Security Foundation (OpenSSF) and the Linux 
Foundation. These organizations provide a collaborative and 
structured environment that encourages innovation and 
standardization in open-source security. An important aspect 
of this collaboration is the sharing of datasets, which allows 
teams to replicate and build upon our and each other's work. 
To facilitate the sharing of various large datasets, we are 
working with the Linux Foundation to host them in a neutral, 
industry-facing manner. We have successfully collaborated with 
these organizations on other projects in the past, and with 
this established reputation, we aim to further promote the 
adoption of our methodologies and tools. Aligning with OpenSSF 
and the Linux Foundation will help ensure that our research on 
atoms of confusion can benefit a wider range of software 
projects across the industry.

{\scriptsize \bibliographystyle{abbrv} \bibliography{bibdata}}


\end{document}