-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathIEEE-conference-template-062824.tex
231 lines (197 loc) · 10.9 KB
/
IEEE-conference-template-062824.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
\documentclass[conference]{IEEEtran}
\IEEEoverridecommandlockouts
% The preceding line is only needed to identify funding in the first footnote. If that is unneeded, please comment it out.
%Template version as of 6/27/2024
\usepackage{cite}
\usepackage{amsmath,amssymb,amsfonts}
\usepackage{algorithmic}
\usepackage{graphicx}
\usepackage{textcomp}
\usepackage{xcolor}
\usepackage{url}
\def\BibTeX{{\rm B\kern-.05em{\sc i\kern-.025em b}\kern-.08em
T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}}
\begin{document}
\newif\ifrev
%%%%%%%%%%%%%%%%%%
% COMMENT OUT NEXT LINE TO HIDE COMMENTS
%\revtrue
%%%%%%%%%%%%%%%%%%
\ifrev
\newcommand{\yanyan}[1]{{\color{blue} [Yanyan: #1]}}
\newcommand{\red}[1]{\textcolor{red}{#1}}
\newcommand{\todo}[1]{\textcolor{red}{\textbf{Anyone todo}: #1}}
\else
\newcommand{\yanyan}[1]{}
\newcommand{\red}[1]{}
\newcommand{\todo}[1]{}
\fi
\title{Encourage, Engage, and
Enhance a Community for Atoms of Confusion}
\author{Justin Cappos${}^{\rm ny}$ \qquad Dan Gopstein${}^{\rm ms}$ \qquad Renata Vaderna${}^{\rm ny}$ \qquad Yanyan Zhuang${}^{\rm co}$\\
${}^{\rm ny}$\,New York University \quad
${}^{\rm co}$\,University of Colorado, Colorado Springs \quad ${}^{\rm ms}$\,Microsoft %\thanks{This work is supported in part by \yanyan{Justin complete}.}
}
\maketitle
\begin{abstract}
The recent work by our team has empirically
validated the existence of small patterns in C code
that can interfere with program comprehension.
These small code patterns, named atoms of confusion
(or atoms for short), are correlated with many software bugs. This finding has been replicated by numerous
research teams. For example, atoms have been validated in different programming languages including Java and Python, and through various methods such as using an EEG device.
To engage a growing community, we have set out to put our work
into practice by bridging the gap between academia and
industry. We have been working to deploy atom parsers in the
Linux kernel and developing data sharing and visualization
tools with the Linux Foundation. We plan to build a broader
community through academia-industry collaboration,
encouraging, engaging, and enhancing community connections to
grow into a large open-source ecosystem. We invite you to join
us!
\end{abstract}
\begin{IEEEkeywords}
Atoms of Confusion, Code Comprehension, Community Effort, Open Source
\end{IEEEkeywords}
\section{Introduction}
Software bugs represent a significant challenge worldwide,
with a staggering cost of approximately \$2.08 trillion in the
US alone in 2020, as reported by the Consortium for
Information \& Software Quality (CISQ) \cite{CPSQ2020}.
Addressing these bugs efficiently before they cause damage is
crucial in both industry and academia.
%
Code review serves as a cornerstone of quality assurance in
both open source and proprietary projects. Research highlights
that inadequate code reviews are linked to diminished software
quality, the emergence of anti-patterns, and increased
post-release defects. %\yanyan{Should have a citation here?}
Ensuring that software is understandable to
humans is crucial yet challenging, often approached ad hoc.
Our research has delved into the causes of code confusion for
human developers. We have empirically validated
small C code patterns, termed \emph{atoms of confusion}, which
programmers often misinterpret~\cite{gopstein2017understanding}. Analyzing 14 large C projects
revealed hundreds of thousands of these atoms, highly
correlated with both comments and bug fixes, underscoring
their profound impact on software reliability
\cite{gopstein2018prevalence}.
%
This body of work has inspired a breadth of subsequent studies
across various programming languages, employing methodologies
like eye-tracking and EEG to probe deeper into how programmers
interact with these confusing elements
\cite{langhout2021atoms,
torres2023investigation, dacosta2023seeing,
Manor2018AtomsConfusionSwift, yeh2017detecting}. Over a dozen research teams, e.g., from the Netherlands and Brazil, have replicated experiments from our prior work and published their work about Java~\cite{langhout2021atoms, mendes2022dazed}, JavaScript~\cite{oliveira2019impact}, Swift~\cite{castor2018identifying}, Python~\cite{da2023seeing}, and the Android platform~\cite{tabosa2024dataset}. A Senior Scientist from the University of Klagenfurt, Austria collaborated with us and published a recent discovery that used the datasets from our work~\cite{gopstein2017understanding, zhuang2023developer} and designed a high-quality C programming ability assessment tool~\cite{glasauer2024c}.
As this research started to gain momentum, we initiated
efforts to enhance community collaboration. By fostering
discussions, sharing datasets, and initiating research into
unified standardization of what constitutes an atom of
confusion, we aim to streamline the process for researchers to
conduct future atoms-related experiments. Additionally, we are
committed to expanding awareness and fostering interest in
this novel approach of examining the causes of code confusion
and bugs, not just within academia, but also across the
industry.
\section{Enhancing Community Engagement}
With the community's enthusiasm, however, comes the different definitions and execution in these works that led to divergent uses of our original concept. In this section, we outline our effort to unify the research community and apply our work in collaboration with the industry.
\subsection{Community Meetings}
In our previous experience, organizing community meetings has
proven to be effective in gathering individuals who are
passionate about a research project. Since February 2024, we
have initiated bimonthly meetings to cultivate a strong sense
of community among researchers. Our goal is to establish a
supportive platform where participants can share their work
and engage in constructive feedback exchanges within the
community.
Another challenge is the varied interpretations of what
constitutes an atom. We organized meetings specifically
focused on this issue, presenting multiple code snippets that
some consider atoms of confusion, while others do not find
them confusing. These discussions have sparked significant
interest and motivated further exploration into understanding
these divisive elements.
\subsection{A Framework for Parsing Atoms of Confusion}
While our previous parsers \cite{gopstein2018prevalence}
have shown correlations between atoms, bugs, and
code comments, they have not succeeded in providing clear
definitions. Our goal is to develop a precise, easily
extensible, and reproducible methodology for identifying
specific code patterns as atoms. To this end, we are actively
developing a framework that includes guidelines
to determine what is and is not an atom, based on our findings that increased occurrences of comments and bug fixes are often
correlated with the presence of atoms.
First, we count the occurrences of all Abstract Syntax Tree
(AST) subtrees within a large open-source project, as well as
the ASTs in commented code snippets and code removed by bug-fix commits. Next, we
look for patterns of code that are correlated (or not
correlated) with high frequencies of bug fixes and comments.
The former patterns are identified as atoms. To test the
significance of these correlations, we will apply a formal
statistical method, such as Hotelling’s T-Square method, to
distinguish atoms from non-atoms.
This approach aims to provide a precise and objective method
for identifying atoms, potentially uncovering additional
confusing patterns that were previously unrecognized.
Researchers will also be able to apply different metrics
beyond comments and bug-fix frequencies by adding new
dimensions or substituting the measures we have chosen. This
framework is designed to support hypothesis testing among
community members, assist future projects in exploring atoms
in various programming languages, and enhance software
reliability.
\subsection{Collaboration with the Industry}
We aim for our research to go beyond academic borders and have
a greater impact in the industry. Our previous research helped
identify and fix several Linux kernel bugs caused by macros,
often occurring when developers incorrectly parenthesized one
or more arguments. This success demonstrates the potential
practical impact of our work and highlights the importance of
ongoing collaboration with the industry. Below, we provide two
more examples of such collaboration.
\textbf{Coccinelle} is a tool widely used to detect and fix issues,
as well as perform refactoring, in the Linux kernel code.
It allows users to specify patterns of interest in any C code and,
optionally, their replacements using the Semantic Patch Language.
Unlike traditional patches, Coccinelle's semantic patches can modify hundreds
of files across thousands of code sites by abstracting away specific details
and variations \cite{coccinelle}.
Given its extensive use and our collaboration with the
maintainers of Coccinelle, we have implemented atom parsers
using this tool. While our primary focus is on the Linux
kernel, Coccinelle can be used to process any C code.
Consequently, the parsers we have developed for detecting
atoms can also be applied to other C codebases. We have
implemented parsers for the detection of 14 out of 15
identified atoms, excluding only one that is beyond
Coccinelle's capabilities. All these parsers are available in
our GitHub repository \cite{githubcocci}.
Additionally, to further encourage collaboration between
academia and industry, we dedicated one of our community
meetings to Coccinelle. We invited a key contributor to
Coccinelle to present the tool to our community, with a
specific focus on detecting atoms. This initiative brings us a
step closer to bridging the gap between academic research and
practical industry applications.
\textbf{OpenSSF and the Linux Foundation.}
%
To further solidify our efforts and expand the impact of our
research, we are planning to establish a collaboration with
the Open Source Security Foundation (OpenSSF) and the Linux
Foundation. These organizations provide a collaborative and
structured environment that encourages innovation and
standardization in open-source security. An important aspect
of this collaboration is the sharing of datasets, which allows
teams to replicate and build upon our and each other's work.
To facilitate the sharing of various large datasets, we are
working with the Linux Foundation to host them in a neutral,
industry-facing manner. We have successfully collaborated with
these organizations on other projects in the past, and with
this established reputation, we aim to further promote the
adoption of our methodologies and tools. Aligning with OpenSSF
and the Linux Foundation will help ensure that our research on
atoms of confusion can benefit a wider range of software
projects across the industry.
{\scriptsize \bibliographystyle{abbrv} \bibliography{bibdata}}
\end{document}