-
Notifications
You must be signed in to change notification settings - Fork 25
Towards Jepsen-like tests #162
Comments
There's also the case when an action that returns a reference fails. In that case the subsequent actions that use the reference and have a precondition that the reference must exist in the model, will fail saying that the precondition was false. |
Source: https://jepsen.io/consistency |
@kderme: No, I don't think so. (It might make some tests easier to write, but shouldn't be necessary.) |
The above PR adds the ability to complete histories. To be able to complete a What completing history does is that it simply appends a response to the What's still not clear to me is: can we find bugs without completing, e.g. |
@kderme is currently working on adding an example which uses rqlite (a distributed version of sqlite that uses Raft for consensus), hopefully this can serve as a test bed for experimenting with distributed systems and fault injection. See the following work-in-progress branch. As a first experiment, the idea is to try to trigger a stale read in the weak read consistency mode by either stopping and restarting nodes or by causing partitions (perhaps using blockade). |
I think the answer is yes and that there's a trade-off here:
I've also learned why Jepsen doesn't have a completion function, if an operation crashes it advances the model purely from the request. This isn't possible in our case as our transition function also involves the response. Lets make things a bit more concrete with an example, consider a simple counter that starts at
At this point the value of the counter could be
This seems weird, how can read return two different values, without an Also note that the second thread cannot be used as that could break the "single-threaded constraint: processes only do one thing at a time", as per the comment above. |
Jepsen is a framework for testing distributed systems. It has been used to find
bugs in many systems, e.g. Kafka, Cassandra, RabbitMQ.
It is also based on performing randomly generate actions from multiple threads
and then using linearisation to ensure correctness. However it is different in
that it has a thread, called the Nemesis, which is solely dedicated to fault
injection. Fault injections include skewed clocks, gc/io pauses (killall -s
STOP/CONT), lost packets, timeouts, and network partitions between the
distributed servers.
In order to account for the faults, Jepsen has a richer notion of history than
ours which includes the possibily of operations failing or timing out. When an
action failes, we know for sure that it did not change the state of the system,
where as if an operation timed out we don't know for sure (the state could have
changed, but the server ack that didn't reach us).
It would be neat if we could perform similar tests. Exactly how this is supposed
to work is still not clear. Some answers can perhaps be found in the original
linearizabiliy paper or in the many blog posts and talks by "Aphyr", the author
of Jepsen, some of which are linked to in the README.
Perhaps a good start would be to only focus on failed actions to begin with.
This issue seems to pop up in some of our examples already, see discussion in
#159, and does not require a nemesis thread.
A first step might be to change the type of
Semantics
to account for possiblefailures:
(This can later be extended to deal with timeouts.)
The
ResponseEvent
constructor will need to be changed accordingly, andlinearise
as well. I guess the right thing to do in linearise is to not updatethe model and not check the post-condition. We could also change the
post-condition make assertions about
err
in addition toresp
, but maybe thisis a refinement we can make when we see the need for it.
Thoughts?
The text was updated successfully, but these errors were encountered: