Grace is a prototype race detector for Golang based on what we call happens-before sets (HB-sets).
Go ships with a race detector built from the ThreadSanitizer library, TSan which employs the concept of vector-clocks (VCs). We built Grace to study the trade-offs between VC-based race detection and HB-set based.
Grace treats channels as first class, meaning, it observes send and receive operations on channels, as opposed to acquire/release operations on the locks protecting channel buffer entries.
For details, check out our paper on arXiv.
Data races can happen when threads cooperate around a pool of shared memory. Two concurrent memory accesses constitute a data race if the accesses reference the same memory location and at least one of the accesses is a write. Here is an example from the Go memory model:
var a string
func main() {
go func() { a = "hello" }()
print(a)
}
The main function invokes an anonymous function; this anonymous function sets the global variable a
to hello
. Note, however, that the call is prepended with the keyword go
, which makes the callee run in a separate goroutine (or thread); meaning, main
continues without waiting for the anonymous function to return.
Both main
and the anonymous functions access the same shared variable in a conflicting manner: one of the accesses is a write. Since both the main and the anonymous functions run in parallel and no synchronization is used, the two accesses are also concurrent. This allows us to conclude that this program has a race.
Because data races can lead to counter intuitive behavior, it is important to detect them.
Vector clocks are an efficient way of tracking the relative order of execution of threads (or goroutines). VCs are based on a notion of local time: each thread keeps track of its own clock; a thread's clock can advance independent from other threads.
Clocks are tracked in vectors. A vector clock V
as a function V : Tid -> Nat
, meaning that V(t)
is the value of the clock associated with a thread-id t
according to the vector-clock V
.
Vector clocks can be used for race detection as follows:
Each thread t
is given a vector clock, C_t
where it keeps track of two things:
C_t(t)
records the current time att
, and- for another thread
u
,C_t(u)
records the time of the most recent operation performed byu
that is known tot
. In other words,u
's clock can advance withoutt
knowing of it. The entryC_t(u)
is the time of the most recent operation performed byu
thatt
is aware of.
Recall that, intuitively, a race is when a thread attempts to access a variable without the thread being "aware" of some previous access to that variable. So, when a thread t
attempts to write to variable z
, the thread must be aware of the previously occurring reads and writes to z
. When a thread t
attempts to read from z
, the thread must be aware of the previous writes to z
(since the are no read-read conflicts, there is no need for t
to know of prior reads to z
).
A race detector, then, keeps track of the accesses to each variable. Take one of the early VC-based race detection algorithms, known as Djit, as example. There, each variable z
is associated with two vector clocks: one for read accesses, R_z
, and another for writes, W_z
.
When t
attempts to write to z
, a race is flagged when either:
- there exist some write to
z
not known tot
, meaningC_t ⊑ W_z
, or - there exist some read from
z
not known tot
, meaningC_t ⊑ R_z
.
When t
attempts to read from z
, a race is flagged when there exists a prior write not known to the thread, meaning C_t ⊑ W_z
. (As discussed previously, there is no need to check R_z
when a read is attempted since there are no read-read conflicts).
When a thread succeeds in accessing memory, the vector clock of the memory location is updated to reflect the time of the access. Precisely, when a thread t
writes to z
, the clock W_z(t)
is updated to C_t(t)
. If, instead, the access is a read, then R_z(t)
is updated to C_t(t)
.
Although we won't explore this in detail here, it's worth pointing out that the detector described in the previous paragraph can be improved. As opposed to remembering writes to z
in a vector clock W_z
, a detector can remember only the clock and the thread identifier associated with the most recent write to z
. Similarly, the detector does not need to track all reads from z
but only the reads that have accumulated after the most recent write. This improvement over Djit lead to an algorithm known as FastTrack.
The last piece of the puzzle is how a thread's vector clock is updated. Threads "learn" about each other's progress when they synchronize. In the case of Go, synchronization is done via message passing. In the case of locks, synchronization is done via acquire
and release
operations.
Each lock m
is associated with a vector clock L_m
. When a thread t
acquires m
, the thread learns about about memory accesses performed by the thread who last released the lock. More precisely, the vector clock of the thread, C_t
, is updated to C_t ⊔ L_m
. When a thread t
releases a lock m
, the vector clock of the lock, L_m
, is updated to C_t
and thread's clock is advanced, meaning C_t(t) := C_t(t)+1
.
In the happens-before set approach, a memory location keeps track of the most recent write onto that location, and of the most recent reads that accumulated after the write. Technically, not all reads need to be recorded, only the most recent reads that are not ordered. For example, if thread t
reads from z
and then reads again, the second read subsumes the first. Similarly, a thread keeps track of the reads and writes to memory that it is "aware of".
The difference with vector-clocks is that we track individual accesses, which means that we consume more memory. Vector-clock based race detectors have a worst-case memory consumption of O(τ)
per thread, where τ
is the number of entries in the vector, which is capped by the number of threads spawn during execution. The per-thread memory consumption of happens-before sets is O(ντ)
where ν
is the number of shared variables in a program.
Vector clocks' memory efficiency, when compared to happens-before sets, come from VC's ability to succinctly capture the per-thread accesses that take place in between advances of a clock. A thread's clock is advanced when the thread releases a lock. All accesses made by a thread t
in a given clock c
are captured by the clock: if another thread u
"knows" the value c
of t
's clock, then u
is in happens-after with all accesses made by t
---that is, all accesses up to when t
's clock was advanced to c + 1
. In contrast, the happens-before set representation is much more coarse. We keep track of individual accesses, as opposed to lumping them together into a clock number. This coarseness explains the extra factor of ν
in the worst-case analysis of the happens-before set solution. Although being a disadvantage in the worst case scenario, happens-before sets do provide some benefits.
The vector clocks associated with threads and locks grow monotonically. By growing monotonically we do not mean that time marches forward to increasing clock values. Instead, we mean that the number of clocks in a vector grows without provisions for the removal of entries. This growth can lead to the accumulation of "stale" information, where by stale we mean information that is not useful from the point of view of race detection. This growth stands in contrast to HB-set's approach to garbage collection. Stale information is purged from happens-before sets, which means they can shrink back to size zero after having grown in size.
We conjecture that an approach that purges stale information from VCs, similar to HB-set's notion of garbage collection, would be highly be beneficial.
We wrote a prototype of Grace, a happens-before-set based race detector for Go. The race detector observes and tracks relevant events from the execution of a Go program. The events are:
go
, goroutine initialization,read
andwrite
events to a variable, and- operations on channels,
send
,recv
,close
The prototype is written in Python, in grace.py
.
The prototype can be fed traces "manually" by calling functions in the Grace
class. The prototype can also consume traces from TSan, using a layer t2g.py
that translates calls to TSan into Grace events. To that end, TSan needs to be recompiled as to print certain calls from the Go runtime; see tsan_patch.diff
.
Go ships with a race detector based on TSan, which can be invoked with the -race
command line flag:
go run -race my_program.go
To see what -race
does, we can compile (but not link) a Go program and inspect the object file produced. Take the producer_consumers.go
file in src/examples
:
$ go tool compile -race producer_consumers.go
$ go tool objdump producer_consumers.o
The first line above compiles the example and creates the file producer_consumer.o
, the second line dumps the contents of the object file so we can read it.
If you search for the word race
in the disassemble, you'll see many calls to things like runtime.racefuncenter
, runtime.racefuncexit
, runtime.raceread
or runtime.racewrite
. These are calls into the race detector. If you have the Go sources available to follow along, you can see that these functions are declared in go/src/runtime/race.go
, but where/how are they implemented? They are "implemented" in a platform specific way, inside the race_amd64.s
, race_arm64.s
, and race_ppc64le.s
assembly files, depending on the architecture. The word "implementation" is in quotes since the implementation is simply some thunk code that calls out to something else. Here is the "implementation" of raceread
which simply calls out to __tsan_read
passing some arguments:
// func runtime·raceread(addr uintptr)
// Called from instrumented code.
TEXT runtime·raceread(SB), NOSPLIT, $0-8
MOVD addr+0(FP), R1
MOVD LR, R2
// void __tsan_read(ThreadState *thr, void *addr, void *pc);
MOVD $__tsan_read(SB), R9
JMP racecalladdr<>(SB)
The actual implementation is in TSan, which ships with Go as a library. The location of the TSan library is the following:
`go env GOTOOLDIR`/../../../src/runtime/race
Go ships with TSan in a .syso
library file built for your OS and architecture. For example, on Ubuntu we have:
$ cd `go env GOTOOLDIR`/../../../src/runtime/race
$ pwd
/usr/lib/go-1.10/src/runtime/race
$ ls *.syso
race_linux_amd64.syso
while on Mac:
$ cd `go env GOTOOLDIR`/../../../src/runtime/race
$ pwd
/usr/local/Cellar/go/1.13.5/libexec/src/runtime/race
$ ls *.syso
race_darwin_amd64.syso
We can thus change the behavior of the race detector by checking out TSan from sources, modifying it, and replacing the .syso
file ships by copying over our own .syso
. (On Ubuntu, Go can be installed as a snap. Modifying contents of a snap can be tricky. Instead, I recommend installing Go with apt
so it's easier for us to replace the race detector.)
The modifications that we have done for Grace are simple: it mostly involves printing information about the calls to TSan to the console. So, when a Go program is run with -race
, we get lots of output. We gather the output and feed it to the t2g.py
script (short of Tsan-to-Grace), which converts this printed out text into calls to Grace.
To apply our patch to TSan, check TSan out from source, copy the diff file over and apply it:
$ git clone https://git.llvm.org/git/compiler-rt.git
$ cd compiler-rt
$ cd path_to_grace/src/tsan_patch.diff .
$ git apply tsan_patch.diff
The Go compiler lives in compiler-rt/lib/tsan/go
, and can be built by running the buildgo.sh
script. Running the script will create a .syso
file that should then be copied to the directory:
`go env GOTOOLDIR`/../../../src/runtime/race
When a Go program is compiled, channel operations (like sends and receives) translate to calls into the Go runtime. Channels are implemented in go/src/runtime/chan.go
. When race detection is enabled with -race
, a successful send onto a channel (function chansend
in chan.go
) yields three calls to TSan: first a call to __tsan_readpc
, then a call to __tsan_acquire
, and then __tsan_release
. The call to __tsan_readpc
passes along a pointer to the chansend
function in the runtime. If you have applied our patch to TSan, when you run with -race
a Go program that uses channels, you will see text like this being printed:
__tsan_read_pc,0x000001383428,tid=5,0x00c42009a070,0x000000481c03,0x000000431741,chansend
__tsan_read,0x000001393468,tid=6,0x0000004feb88,0x000000481cd9
__tsan_acquire,0x000001383428,tid=5,0x00c42009a0d0
__tsan_read_pc,0x000001393468,tid=6,0x00c42009a070,0x000000481cf3,0x000000431741,chansend
__tsan_release,0x000001383428,tid=5,0x00c42009a0d0
There is a bit to unpack here. The first thing to notice is that these prints involve activities from two different threads: one with tid=5
and another from tid=6
. Focusing on tid=5
, the first is a call to __tsan_read_pc
passing the address of chansend
as an argument. This line tells us that a send is starting. The third and the last line are the lock acquisition and release protecting the entry in the channel's buffer where the message is placed. Note that the fourth line is also a chansend
, but from tid=6
. In other words, channel sends and receives from different threads can be interleaved.
TSan works at the level of locks; channels are not a first-class concept. Grace, on the other hand, works at the realm of channels, there are no locks.
We thus put the sequence of calls to TSan (__tsan_readpc
, __tsan_acquire
, and __tsan_release
) back together and interpret them as a send operation on a particular channel. Similarly, we can reinterpret calls to TSan related to the closing-of and receiving-from from a channel.
@InProceedings{fava.steffen:sbmf19,
author = "Daniel Fava and Martin Steffen",
title = "Ready, set, {G}o! {D}ata-race detection and the {G}o language",
booktitle = "To appear in the pre-proceedings of the Brazilian Symposium on Formal Methods (SBMF)",
year = 2019,
note = "\url{http://arxiv.org/abs/1910.12643} ",
}