markconsume.tex

\documentclass[letterpaper,10pt]{article}
\usepackage{epsfig,endnotes}
%\usepackage{usenix,epsfig}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\usepackage{subfig}
\usepackage{fixltx2e}
\usepackage{url}        % \url{} command with good linebreaks
\usepackage{amssymb,amsmath}
\usepackage{graphicx}
\usepackage{enumerate}
\usepackage{listings}
\usepackage{xspace}
\usepackage{xcolor}
\lstset{basicstyle=\ttfamily}
\usepackage[bookmarks=true,bookmarksnumbered=true,pdfborder={0 0 0}]{hyperref}

% Avoid widows and orphans
\widowpenalty=500
\clubpenalty=500

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{document}

\lstset{
	literate={\_}{}{0\discretionary{\_}{}{\_}}%
}
\newcommand{\co}[1]{\lstinline[breaklines=yes,breakatwhitespace=yes]{#1}}

\title{P0462R1:\\ Marking {\tt memory\_order\_consume} Dependency Chains}

\author{
{\bf Doc. No.: } WG21/P0462R1 \\
{\bf Date: } 2017-02-07 \\
{\bf Reply to: } Paul E. McKenney, Torvald Riegel, Jeff Preshing, \\
	Hans Boehm, Clark Nelson, Olivier Giroux, Lawrence Crowl, \\
	JF Bastien, and Micheal Wong \\
{\bf Email: } paulmck@linux.vnet.ibm.com, triegel@redhat.com, \\
jeff@preshing.com, boehm@acm.org, clark.nelson@intel.com, \\
OGiroux@nvidia.com, Lawrence@Crowl.org, jfbastien@apple.com, \\
and fraggamuffin@gmail.com \\ ~ \\
{\bf Other contributors: }
	Alec Teal,
	Alisdair Meredith, \\
	David Howells,
	David Lang,
	George Spelvin,
	Jeff Law, \\
	Joseph S. Myers,
	Linus Torvalds,
	Mark Batty,
	Michael Matz, \\
	Peter Sewell,
	Peter Zijlstra,
	Ramana Radhakrishnan, \\
	Richard Biener,
	Will Deacon,
	Faisal Vali,
	Behan Webster, \\
	Tony Tye,
	Thomas Koeppe,
	Jens Maurer,
	... \\
} % end author

% Use the following at camera-ready time to suppress page numbers.
% Comment it out when you first submit the paper for review.
%\thispagestyle{empty}

\pagestyle{myheadings}
\markright{WG21/P0462R1}

\maketitle

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

This document is based in part on WG21/D0098R1, extracting the
alternatives for marking dependency chains (each headed by a
\co{memory_order_consume}~\cite{RichardSmith2015N4527} load).
It also adds a few additional alternatives based on discussions
at the March 2016 meeting in Jacksonville, Florida,
It contains updates based on feedback from the 2016 meeting in
Issaquah, Washington, and has also been reorganized to place the
alternatives not taken into an appendix.
This document does not define the behavior of dependency chains, but instead
only the syntax used to call the compiler's attention to them.
Please see WG21/P0190R3 for detailed information on dependency-chain behavior.

A detailed change log starts on
page~\pageref{sec:Change Log}.

\section{Introduction}
\label{sec:Introduction}

Spirited discussions of \co{memory_order_consume}
at the Jacksonville meeting resulted in a few of items of agreement:

\begin{enumerate}
\item	Dependency chains should be restricted to pointers.
	Please note that this excludes not only the troublesome objects
	of integral type, but also accesses to static members of classes.
\item	Unmarked code can be handled by having the implementation
	behave as if markings had been supplied in all locations that
	could reasonably be marked.
	This allows natural handling of dependencies in unmarked code.
	This behavior should be controlled by a compiler flag.
	Such a flag is of course outside of the standard.
\item	Software artifacts that are built standalone (such as the Linux
	kernel and numerous embedded projects) can reasonably use
	unmarked dependency chains.
	In contrast, software artifacts that are expected to dynamically link
	against standard libraries seem likely to need to mark their
	dependency chains.
\item	Discussions involving marking of library APIs have been
	set aside for the moment, and so this document does not address
	this point.
\end{enumerate}

These points result in three known valid ways of handling
\co{memory_order_consume}:

\begin{enumerate}
\item	Ignore the markings and promote \co{memory_order_consume}
	to \co{memory_order_acquire}, as is current practice.
\item	On all known platforms other than DEC Alpha,\footnote{
		DEC Alpha systems require that each \co{memory_order_consume}
		loads be followed by full memory-barrier instructions
		if there are any loads that depend on the
		\co{memory_order_consume} load~\cite{Compaq01,ALPHA95}.
		Therefore, on DEC Alpha we recommend promoting
		\co{memory_order_consume} loads to \co{memory_order_acquire}.}
	ignore the markings, emit the same code for
	\co{memory_order_consume} as is emitted for
	\co{memory_order_relaxed}, and suppress troublesome
	optimizations of pointers.
	However, there was some difficulty in arriving at a precise
	definition of ``troublesome''.
\item	On all known platforms other than DEC Alpha, emit the same code for
	\co{memory_order_consume} as is emitted for \co{memory_order_relaxed}
	and suppress troublesome optimizations of marked pointers.
	The fact that such optimizations need not be suppressed
	for unmarked pointers means that a much more conservative
	definition of ``troublesome'' is feasible, thus reducing
	the need for precision.
	Note that pointer comparisons will still break dependency chains
	in some cases, unless the comparisons were carried out using
	proposed dependency-preserving pointer-comparison intrinsics.
	Note further that the template-based method described in
	Section~\ref{sec:Template}
	uses operator overloading so that the usual relational
	operators invoke these intrinsics.
\end{enumerate}

The key requirement enabling \co{memory_order_relaxed} code to be
emitted for \co{memory_order_consume} loads is the preservation
of address and data dependencies through certain operations on
pointers, as detailed in WG21/P0190R2.
All known systems other than DEC Alpha preserve dependencies as required.
``Troublesome optimizations'' can be roughly characterized as user-visible
data speculation.
Note that hardware speculation and leveraging trans\-ac\-tional-mem\-o\-ry
hardware to carry out software speculation are not
user-visible~\cite{MarkDHill2007DeconstructHTM},
and thus are consistent with emitting \co{memory_order_relaxed} code
for \co{memory_order_consume} loads.

However, a number of ways of marking dependency chains have been
proposed and there was nothing resembling any sort of agreement
on which should be used.
This paper therefore catalogs approaches to marking dependency chains,
and evaluates each against a set of representative use cases.

\section{Representative Use Cases}
\label{sec:Representative Use Cases}

\begin{figure}[tbp]
{ \scriptsize
\begin{verbatim}
 1 struct rcutest {
 2   int a;
 3   int b;
 4   int c;
 5   spinlock_t lock;
 6 };
 7
 8 struct rcutest1 {
 9   int a;
10   struct rcutest rt;
11 };
12
13 std::atomic<rcutest *> gp;
14 std::atomic<rcutest1 *> g1p;
15 std::atomic<int *> gip;
16 struct rcutest *gslp; /* Global scope, local usage. */
17 std::atomic<rcutest *> gsgp;
18
19 template<typename T>
20 T *rcu_consume(std::atomic<T*> *p)
21 {
22   volatile std::atomic<T> *q = p;
23   // Change to memory_order_consume once it is fixed
24   depending_ptr<T> temp(q->load(std::memory_order_relaxed));
25   return temp;
26 }
27
28 template<typename T>
29 T *rcu_consume(T *p)
30 {
31   // Alternatively, could cast p to volatile atomic...
32   T *temp(*(T *volatile *)&p);
33   return temp;
34 }
35
36 template<typename T>
37 T* rcu_store_release(std::atomic<T*> *p, T *v)
38 {
39   p->store(v, std::memory_order_release);
40   return v;
41 }
42
43 template<typename T>
44 T* rcu_store_release(T **p, T *v)
45 {
46   // Alternatively, could cast p to volatile atomic...
47   atomic_thread_fence(std::memory_order_release);
48   *((volatile T **)p) = v;
49   return v;
50 }
51
52 // Linux-kernel compatibility macros, not for atomics
53 #define rcu_dereference(p) rcu_consume(p)
54 #define rcu_assign_pointer(p, v) rcu_store_release(&(p), v)
\end{verbatim}
}
\caption{Common Definitions}
\label{fig:Common Definitions}
\end{figure}

This section uses the common definitions shown in
Figure~\ref{fig:Common Definitions}
to discuss the use cases in the following list:

\begin{enumerate}
\item	Simple case.
\item	Function in via parameter.
\item	Function out via return value.
\item	Function both in and out, but different chains.
\item	Dependency chain fanning out.
\item	Dependency chain fanning in.
\item	Dependency chain fanning both in and out.
\item	Conditional compilation of endpoint accesses.
\item	Examples involving handoff to locking.
\end{enumerate}

Each of the above use cases is covered in one of the following sections,
followed by a discussion of evaluation criteria.

\subsection{Simple Case}
\label{sec:Simple Case}

\begin{figure}[tbp]
{ \scriptsize
\begin{verbatim}
 1 void *thread0(void *unused)
 2 {
 3   rcutest *p;
 4
 5   p = new rcutest();
 6   assert(p);
 7   p->a = 42;
 8   assert(p->a != 43);
 9   rcu_store_release(&gp, p);
10   return nullptr;
11 }
12
13 void *thread1(void *unused)
14 {
15   rcutest p;
16
17   p = rcu_consume(&gp);
18   if (p)
19     p->a = 43;
20   return nullptr;
21 }
\end{verbatim}
}
\caption{Simple Case}
\label{fig:Simple Case}
\end{figure}

The simple case is shown in
Figure~\ref{fig:Simple Case}.
Here, the dependency chain extends from line~16 through line~18,
where it terminates.
Given the simplicity and compactness of this example, any reasonable
proposal should handle this example simply and naturally.

\subsection{In via Function Parameter}
\label{sec:In via Function Parameter}

\begin{figure}[tbp]
{ \scriptsize
\begin{verbatim}
 1 void thread0(void)
 2 {
 3   struct rcutest *p;
 4
 5   p = new rcutest;
 6   assert(p);
 7   p->a = 42;
 8   rcu_assign_pointer(gp, p);
 9 }
10
11 void
12 thread1_help(struct rcutest *q)
13 {
14   if (q)
15     assert(q->a == 42);
16 }
17
18 void thread1(void)
19 {
20   struct rcutest *p;
21
22   p = rcu_dereference(gp);
23   thread1_help(p);
24 }
\end{verbatim}
}
\caption{In via Function Parameter}
\label{fig:In via Function Parameter}
\end{figure}

Figure~\ref{fig:In via Function Parameter}
shows an example dependency chain that begins at line~22,
enters function \co{thread1_help()} at line~23,
and then extending from line~12 to line~15 in the called function.
This is a common encapsulation technique.

\subsection{Out via Function Return}
\label{sec:Out via Function Return}

\begin{figure}[tbp]
{ \scriptsize
\begin{verbatim}
 1 void thread0(void)
 2 {
 3   struct rcutest *p;
 4
 5   p = new rcutest;
 6   assert(p);
 7   p->a = 42;
 8   rcu_assign_pointer(gp, p);
 9 }
10
11 struct rcutest *thread1_help(void)
12 {
13   return rcu_dereference(gp);
14 }
15
16 void thread1(void)
17 {
18   struct rcutest *p;
19
20   p = thread1_help();
21   if (p)
22     assert(p->a == 42);
23 }
\end{verbatim}
}
\caption{Out via Function Return}
\label{fig:Out via Function Return}
\end{figure}

Figure~\ref{fig:Out via Function Return}
shows a dependency chain exiting a function.
It starts at line~21, is returned to line~20, and
terminates on line~22.
This is also a common encapsulation technique.

\subsection{In and Out, But Different Chains}
\label{sec:In and Out, But Different Chains}

\begin{figure}[tbp]
{ \scriptsize
\begin{verbatim}
 1 void thread0(void)
 2 {
 3   struct rcutest *p;
 4
 5   p = new rcutest;
 6   assert(p);
 7   p->a = 42;
 8   rcu_assign_pointer(gp, p);
 9
10   p = new rcutest;
11   assert(p);
12   p->a = 43;
13   rcu_assign_pointer(gsgp, p);
14 }
15
16 struct rcutest *thread1_help(struct rcutest *p)
17 {
18   if (p)
19     assert(p->a == 42);
20   return rcu_dereference(gsgp);
21 }
22
23 void thread1(void)
24 {
25   struct rcutest *p;
26
27   p = rcu_dereference(gp);
28   p = thread1_help(p);
29   if (p)
30     assert(p->a == 43);
31 }
\end{verbatim}
}
\caption{In and Out, But Different Chains}
\label{fig:In and Out, But Different Chains}
\end{figure}

Figure~\ref{fig:In and Out, But Different Chains}
shows an example where a dependency chain enters a function
(\co{thread1_help()} on lines~16-21)
and a dependency chain leaves that same function,
but where they are different chains.

\subsection{Chain Fanning Out}
\label{sec:Chain Fanning Out}

\begin{figure}[tbp]
{ \scriptsize
\begin{verbatim}
 1 void thread0(void)
 2 {
 3   struct rcutest *p;
 4
 5   p = new rcutest;
 6   assert(p);
 7   p->a = 42;
 8   rcu_assign_pointer(gp, p);
 9 }
10
11 void
12 thread1_help1(struct rcutest *q)
13 {
14   if (q)
15     assert(q->a == 42);
16 }
17
18 void
19 thread1_help2(struct rcutest *q)
20 {
21   if (q)
22     assert(q->a != 43);
23 }
24
25 void thread1(void)
26 {
27   struct rcutest *p;
28
29   p = rcu_dereference(gp);
30   thread1_help1(p);
31   thread1_help2(p);
32 }
\end{verbatim}
}
\caption{Chain Fanning Out}
\label{fig:Chain Fanning Out}
\end{figure}

Figure~\ref{fig:Chain Fanning Out}
shows a dependency chain fanning out, courtesy of the \co{thread1()}
function's calls to \co{thread1_help1()} and \co{thread1_help2()}
on lines~30 and~31.
This is a common pattern in the Linux kernel, as it supports
abstraction of data structures, for example, allowing common RCU-protected
data structures to be aggregated into a larger RCU-protected data
structure.
In this scenario, \co{thread1_help1()} might implement one type of
RCU-protected structure and \co{thread1_help2()} might implement
another.

\subsection{Chain Fanning In}
\label{sec:Chain Fanning In}

\begin{figure}[tbp]
{ \scriptsize
\begin{verbatim}
 1 void thread0(void)
 2 {
 3   struct rcutest *p;
 4   struct rcutest1 *p1;
 5
 6   p = new rcutest;
 7   assert(p);
 8   p->a = 42;
 9   rcu_assign_pointer(gp, p);
10
11   p1 = new rcutest;
12   assert(p1);
13   p1->a = 41;
14   p1->rt.a = 42;
15   rcu_assign_pointer(g1p, p1);
16 }
17
18 void
19 thread1_help(struct rcutest *q)
20 {
21   if (q)
22     assert(q->a == 42);
23 }
24
25 void thread1(void)
26 {
27   struct rcutest *p;
28
29   p = rcu_dereference(gp);
30   thread1_help(p);
31 }
32
33 void thread2(void)
34 {
35   struct rcutest1 *p1;
36
37   p1 = rcu_dereference(g1p);
38   thread1_help(&p1->rt);
39 }
\end{verbatim}
}
\caption{Chain Fanning In}
\label{fig:Chain Fanning In}
\end{figure}

Figure~\ref{fig:Chain Fanning In}
demonstrates different dependency chains fanning into the same function,
in this case \co{thread1_help()}, from lines~29 and~37.
This fanning-in is also used to support abstraction, for example,
allowing a given implementation of an RCU-protected data structure
to be aggregated into several different data structures.

\subsection{Chain Fanning In and Out}
\label{sec:Chain Fanning In and Out}

\begin{figure}[tbp]
{ \scriptsize
\begin{verbatim}
 1 void thread0(void)
 2 {
 3   struct rcutest *p;
 4   struct rcutest1 *p1;
 5
 6   p = new rcutest;
 7   assert(p);
 8   p->a = 42;
 9   p->b = 43;
10   rcu_assign_pointer(gp, p);
11
12   p1 = new rcutest;
13   assert(p1);
14   p1->a = 41;
15   p1->rt.a = 42;
16   p1->rt.b = 43;
17   rcu_assign_pointer(g1p, p1);
18 }
19
20 void
21 thread1a_help(struct rcutest *q)
22 {
23   assert(q->a == 42);
24 }
25
26 void
27 thread1b_help(struct rcutest *q)
28 {
29   assert(q->b == 43);
30 }
31
32 void
33 thread1_help(struct rcutest *q)
34 {
35   if (q) {
36     thread1a_help(q);
37     thread1b_help(q);
38   }
39 }
40
41 void thread1(void)
42 {
43   struct rcutest *p;
44
45   p = rcu_dereference(gp);
46   thread1_help(p);
47 }
48
49 void thread2(void)
50 {
51   struct rcutest1 *p1;
52
53   p1 = rcu_dereference(g1p);
54   thread1_help(&p1->rt);
55 }
\end{verbatim}
}
\caption{Chain Fanning In and Out}
\label{fig:Chain Fanning In and Out}
\end{figure}

Figure~\ref{fig:Chain Fanning In and Out}
shows dependency chains fanning both in and out, starting
at lines~45 and~53, fanning into \co{thread1_help()}, and
fanning out again at the call to \co{thread1a_help()} on
line~36 and to \co{thread1b_help()} on line~37.
This combination permits composition of the types of abstraction
described in
Sections~\ref{sec:Chain Fanning Out}
and~\ref{sec:Chain Fanning In}.

\subsection{Conditional Compilation of Chain Endpoints}
\label{sec:Conditional Compilation of Chain Endpoints}

\begin{figure}[tbp]
{ \scriptsize
\begin{verbatim}
 1 void thread0(void)
 2 {
 3   struct rcutest *p;
 4   struct rcutest1 *p1;
 5
 6   p = new rcutest;
 7   assert(p);
 8   p->a = 42;
 9   p->b = 43;
10   rcu_assign_pointer(gp, p);
11
12   p1 = new rcutest;
13   assert(p1);
14   p1->a = 41;
15   p1->rt.a = 42;
16   p1->rt.b = 43;
17   rcu_assign_pointer(g1p, p1);
18 }
19
20 #ifdef FOO
21 void
22 thread1a_help(struct rcutest *q)
23 {
24   assert(q->a == 42);
25 }
26 #endif
27
28 void
29 thread1b_help(struct rcutest *q)
30 {
31   assert(q->b == 43);
32 }
33
34 void
35 thread1_help(struct rcutest *q)
36 {
37   if (q) {
38 #ifdef FOO
39     thread1a_help(q);
40 #endif
41     thread1b_help(q);
42   }
43 }
44
45 void thread1(void)
46 {
47   struct rcutest *p;
48
49   p = rcu_dereference(gp);
50   thread1_help(p);
51 }
52
53 void thread2(void)
54 {
55   struct rcutest1 *p1;
56
57   p1 = rcu_dereference(g1p);
58   thread1_help(&p1->rt);
59 }
\end{verbatim}
}
\caption{Conditional Compilation of Chain Endpoints}
\label{fig:Conditional Compilation of Chain Endpoints}
\end{figure}

Although the C preprocessor does not necessarily have the best reputation
among the various aspects of either C or C++, it is true that it is
always there when you need it.
Figure~\ref{fig:Conditional Compilation of Chain Endpoints}
applies conditional compilation to
Figure~\ref{fig:Chain Fanning In and Out},
so that portions of the dependency chain can come and go, depending
on the value of the C-preprocessor macro \co{FOO}.

\subsection{Handoff to Locking}
\label{sec:Handoff to Locking}

\begin{figure}[tbp]
{ \scriptsize
\begin{verbatim}
 1 void thread0(void)
 2 {
 3   struct rcutest *p;
 4
 5   p = new rcutest;
 6   assert(p);
 7   p->a = 42;
 8   assert(p->a != 43);
 9   rcu_assign_pointer(gp, p);
10 }
11
12 void thread1(void)
13 {
14   struct rcutest *p;
15
16   p = rcu_dereference(gp);
17   if (p) {
18     assert(p->a == 42);
19     spin_lock(&p->lock);
20     p = std::kill_dependency(p);
21     p->a++;
22     spin_unlock(&p->lock);
23   }
24 }
\end{verbatim}
}
\caption{Handoff to Locking}
\label{fig:Handoff to Locking}
\end{figure}

Figure~\ref{fig:Handoff to Locking}
shows how RCU protection can hand off to other synchronization
primitives, in this case, locking.
The dependency chain starts at line~16 and continues through line~18
and~19.
However, once line~19 has completed, the code is under the protection
of \co{p->lock}, so line~20 explicitly ends the dependency chain.
The lock then protects the increment on line~21.

It is also possible to hand off protection from RCU to reference counting,
explicit memory barriers, transactional memory, and so on.

Note that the \co{std::kill_dependency()} on line~20 will typically have
no effect on code generation.

\subsection{Evaluation Criteria}
\label{sec:Evaluation Criteria}

\begin{enumerate}
\item	Ease of compilation.
\item	Ease of modification of programs.
\item	Precise specification of dependency chains.
\item	Support for cross-function dependency chains.
\item	Support for cross-compilation-unit dependency chains.
\item	Compatibility with C.
\item	Formal Verification Compatibility.
\end{enumerate}

\section{Marking Proposals}
\label{sec:Marking Proposals}

This section presents a pair of marking proposals that (in combination)
appear to meet the needs of dependency-chain users (at least assuming
that compilers provide options that treat all variables that could
carry a dependency as if they carried a dependency).

\subsection{Object Modifier}
\label{sec:Object Modifier}

\input{markobjmod}

\subsection{Template}
\label{sec:Template}

\input{markobjtemplate}

\section{Evaluation}
\label{sec:Evaluation}

\begin{table}
\small
\centering
\begin{tabular}{l|c|c|c|c|c|c|c}
Mark:
	~ ~ ~ ~ ~ ~ ~ ~ ~
	& \begin{picture}(6,185)(0,0)
		\rotatebox{90}{Ease of Compilation}
	  \end{picture}
	& \begin{picture}(6,185)(0,0)
		\rotatebox{90}{Ease of Modification}
	  \end{picture}
	& \begin{picture}(6,185)(0,0)
		\rotatebox{90}{Precise Dependency Chains}
	  \end{picture}
	& \begin{picture}(6,185)(0,0)
		\rotatebox{90}{Cross-Function Dependency Chains}
	  \end{picture}
	& \begin{picture}(6,185)(0,0)
		\rotatebox{90}{Cross-Compilation-Unit Dependency Chains}
	  \end{picture}
	& \begin{picture}(6,185)(0,0)
		\rotatebox{90}{C Compatibility}
	  \end{picture}
	& \begin{picture}(6,185)(0,0)
		\rotatebox{90}{Formal Verification}
	  \end{picture}
	\\
	\hline
%	 EC  EM  PD    CF  CC  CC  FV
\hline
Modifier+Template (Sections~\ref{sec:Object Modifier} and~\ref{sec:Template})
	&   & o &     &   &   &   &    \\
\hline
\hline
Translation Unit (Appendix~\ref{sec:Mark Translation Unit})
	& T &   & N   &   & m &   &  N \\
\hline
Range of Code (Appendix~\ref{sec:Mark Range of Code})
	& t &   & N   & m & m &   &  N \\
\hline
Functions (Appendix~\ref{sec:Mark Functions})
	& t &   & N   & m & m &   &  N \\
\hline
Objects (Appendix~\ref{sec:Mark Objects})
	&   &   &     &   &   &   &    \\
\hline
~~~~Attribute (Appendix~\ref{sec:Attribute})
	&   & o &     & t & t & a &    \\
\hline
~~~~Type Qualifier (Appendix~\ref{sec:Type Qualifier})
	&   & o &     &   &   &   &  N \\
\hline
~~~~Modifier (Section~\ref{sec:Object Modifier})
	&   & o &     & t & t &   &    \\
\hline
~~~~Template (Section~\ref{sec:Template})
	&   & o &     &   &   & N &    \\
\hline
Root/Leaf (Appendix~\ref{sec:Mark Root/Leaf Pairs})
	&   & A &     & ? & ? & ? &  ? \\
\hline
Nothing
	&   &   & N   &   &   &   &  N \\
\end{tabular}
\caption{Dependency-Chain Marking Evaluation}
\label{tab:Dependency-Chain Marking Evaluation}
\end{table}

Table~\ref{tab:Dependency-Chain Marking Evaluation}
provides a rough comparison between the various marking methods, and
also includes the unmarked option for comparison purposes.
The recommended approach, Modifiers+Template, appears at the
beginning of the table, followed by various other proposals.

For ease of compilation, the cells corresponding to methods that
explicitly mark dependency chains or that don't require marking at all
are left blank.
Those that require tracing dependency chains throughout the full
translation unit are marked ``T'' and those that limit the code in
which tracing is required are marked ``t''.

For ease of modification, the cells corresponding to methods that
either require no marking or that mark large-scale entities are
left blank.
Those that require marking the definitions of objects that carry
dependencies are marked ``o'', and those require marking of individual
accesses are marked ``A''.

Cells corresonding to those methods that precisely mark dependency chains are 
left blank, otherwise, they are marked ``N''.

For cross-function dependency chains, those methods that either support
cross-function marking or that do not require such marking are left blank.
Those that require manual consistency checks are marked ``m'', those
that are amenable to consistency-check tooling are marked ``t'', and
those that are not fleshed out sufficiently to tell are marked ``?''.
These same markings are used for cross-compilation-unit dependency
chains.

Cells corresponding to those methods supporting C compatibility are left
blank.
Those that would support C compatibility if C were to provide attributes
are marked ``a''.
Those that do not support C compatibility (at least not unless combined
with some other method) are marked ``N'',
and those that are not fleshed out sufficiently to tell are marked ``?''.

Cells corresponding to methods believed to support formal verification
are left blank, those that are believed not to support formal verification
are marked ``N'',
and those that are not fleshed out sufficiently to tell are marked ``?''.
Note that the object type qualifier could in theory support formal
verification, but the specific proposal rules this out by requiring that
the compiler treat \co{memory_order_consume} loads as potentially
returning any value from the type.

Following the lead of C11 and C++11 atomics, the ``Modifier+Template''
row covers the combination of
marking objects with template classes (for C++) and with an
typed object modifier (for C), a combination that appears to be quite
attractive.
This should be further combined with a totally unmarked option for
use by standalone projects such as the Linux kernel.

The best possible method would have a row of all blank cells.

\section{Summary}
\label{sec:Summary}

This paper reviewed the 2016 discussions of \co{memory_order_consume}
that took place at the Jacksonville meeting, presented several representative
use cases, listed evaluation criteria, presented a number of
marking proposals, and provided a comparative
evaluation.
The paper presents two of the marking proposals in depth, including
code for the representative use cases.

We recommend a combination of typed object modifier (for C compatibility)
and a template class (for C++), which is similar to the approach used
by atomics.
For standalone applications such as the Linux kernel, there should
additionally be an
unmarked option, where the implementation assumes that everything that
could legally marked is so marked.

%\input{acknowledgments}
%\input{legal}

%\bibliographystyle{abbrv}
%\softraggedright

\bibliographystyle{acm}
\bibliography{bib/RCU,bib/WFS,bib/hw,bib/os,bib/parallelsys,bib/patterns,bib/perfmeas,bib/refs,bib/syncrefs,bib/search,bib/swtools,bib/realtime,bib/TM,bib/standards,bib/maze}

\clearpage
\appendix

\section{Roads Not Taken}
\label{sec:Roads Not Taken}

This appendix lists proposals that were never filled out, and thus not
selected.

\subsection{Mark Translation Unit}
\label{sec:Mark Translation Unit}

\input{transunit}

\subsection{Mark Range of Code}
\label{sec:Mark Range of Code}

\input{rangeofcode}

\subsection{Mark Functions}
\label{sec:Mark Functions}

\input{markfunc}

\subsection{Mark Objects}
\label{sec:Mark Objects}

This class of proposals marks the objects that are to carry dependencies.
These objects must be of pointer type.
Note that implementations requiring point-to-point associations between
each \co{memory_order_consume} load and its corresponding dependent
memory references can generate these associations based on the
operations carried out on a given marked object.

\subsubsection{Attribute}
\label{sec:Attribute}

\input{markobjattr}

\subsubsection{Type Qualifier}
\label{sec:Type Qualifier}

\input{markobjtypeq}

\subsection{Mark Root/Leaf Pairs}
\label{sec:Mark Root/Leaf Pairs}

\input{markrootleaf}

\section{Change Log}
\label{sec:Change Log}

This paper first appeared as {\bf WG21/P0462R1} in October of 2016.
Revisions to this document are as follows:

\begin{itemize}
\item	Switch to one-column mode for ease of exposition.
	(November 12, 2016.)
\item	Reword relationship between \co{memory_order_consume}
	and \co{memory_order_relaxed} per Lawrence Crowl feedback.
	(November 16, 2016.)
\item	Wordsmithing and requirements adjustments
	per Lawrence Crowl feedback.
	(December 5, 2016.)
\item	Move proposals not filled out to appendix.
	(January 4, 2017.)
\item	Move ``\co{_Carries_dependency}'' to after the ``*'' per
	Hans Boehm feedback.
	(January 4, 2017.)
\end{itemize}

% At this point, the paper was published as {\bf P0098R1}.

\end{document}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% for Ispell:
% LocalWords:  workingdraft BCM ednote SubSections xfig SubSection