Violation witness clarification #38

hernanponcedeleon · 2021-04-14T10:23:26Z

I'm working on extending Dartagnan for validation (currently with emphasis in violation witnesses for concurrency reachability where we only have one validator) and I have a couple of questions about (violation) witnesses.

Every paper I checked talk about (1) witness automata, validation as (2) product of program automaton and witness automaton and that the witness should (3) should restrict the search space. I understand these decisions from an historical perspective: the first verifiers / validators that were developed represent programs as automata. However, since currently SVCOMP have many tools not based on automata, such decisions start to seem rather restrictive. My understanding is however that these restrictions (or assumptions) are not really imposed by the standard, but rather by the tools implementing validation.

The standard says that the witnesses must conform to the GraphML format, thus syntactically following (1), but AFAIK it does not say anything about which semantics to give to the XML graph.

My idea is to convert the XML witness into an SMT formula (there are many ways of doing this with different advantages, see below) which in conjunction with the normal reachability formula would reduce the search space for the SMT solver (thus fulfilling (3)).

I think encoding the witness as an SMT formula (in my mind this is having BMC-validation) would overcome some of the limitations we currently have with validation:

Suppose we have the following program

1:   global x,y;
2:   thread-0           
3:   x++;                   

4:   thread-1
5:   y=1;

In general, the x++ instructions would result (in assembly or LLVM IR) in two memory accesses, one reading the initial value from memory and one writing to memory the value after adding 1. A verifier that reasons at this level can generate a witness with this scheduling 3 -> 5 ->3 (since begin/end_atomic() are not used, this context switch is possible).

Unfortunately CPAChecker (the only validator for concurrency currently available) assumes each C statement to be atomic and thus cannot validate the witness from above.

I have the impression (I might be wrong because I don't know much the details outside the reachability category) that violation witness validators rely on scheduling information. Unfortunately not every tool can generate such information (see e.g. the post by @vesalvojdani about witnesses in the data-races category here).

I think the SMT encoding of witnesses is very flexible. For example, if the verifier knows which statements are executed, but not in which order, we might encode that those lines should be executed but do not specify in which order. This would still reduce the search space, but not as much as interpreting also the order.

The XML graph induces an order but we can opt not to give it any meaning. This is where I think the difference I mentioned above between syntax and semantics of the graph needs to be clarified in the standard.

I think this flexibility would be useful too for witnesses in a (hopefully soon to come) category about weak memory models where instruction re-ordering and not-atomicity are possible.

Looking forward to see some discussion about the topic :D

The text was updated successfully, but these errors were encountered:

kfriedberger · 2021-04-15T08:24:44Z

For point 1)

A witness is represented as an automaton and it gives several possible program paths (see here), e.g., the automaton gives transitions for a program. The level of granularity for transitions is currently chosen to match program statements, because of an agreement of tool developers and validators. I am not sure where this is documented, but the overall understanding should be clear.
If the input language for your tool is LLVM-IR or assembly, then the witness could also match that representation. However, the number of validators for such cases is small. CPAchecker does currently not support lower-level languages (lower than C) and internally does not handle seperated read- and write-operations for transitions. If your tool expects seperate read- and write-operations for memory accesses, one needs to additionally specify this in the specification/documentation and, e.g., file an extension to the standard. It depends on the developers of validators to support that case.

For point 2)

Violation witnesses for concurrent programs of course rely on scheduling information, and it is represented as sequence of transitions in the witness, i.e., which thread applies which operation at which position in the trace. This is the most basic source of information for developers and shows the exact control flow that leads to a property violation. Also for data races, also the trace towards a property violation should be available, and it can be represented as sequence of program operations (currently neglecting read- and write operations).
It is possible to leave out (skip) program statements, causing the validator to look for transitions on its own. The so called "empty witness" consists of only two nodes (start and end node), with a single transition representing an arbitrary step in the program, e.g., entering the main function. This witness represents all possible traces towards the end node. CPAchecker will analyse the witness and try to find the property violation without further guidance (it might succeed to do so in many cases). CPAchecker will terminate after finding the first valid path, but in general, a validator can also report all possible paths represented by the witness.

An SMT-based encoding of program traces is also a good idea, but you need to give more details.
Maybe I did not yet understand your idea:

Are you writing about an intermediate step during validation or are you proposing a new format for witnesses?
How do you encode repeated assignments of variables?
This is already solved in the current GraphML-based format: seperate transitions, maybe using assumptions over variables.
How do you encode loops?
This is already solved in the current format: just build cycles in the automaton.
How is a non-SMT-based verification tool supposed to use the witness?
It is possible to transform a automaton-based witness into SMT, but not the other way around. Thus, the automaton-format might have a larger user-group and a validator that requires SMT-formulas can generate them from the automaton.
Can every verification tool export such an SMT-based witness?
Of course, transforming a trace inot SMT is possible, but can this be done in a consistent way for all tools? Alternatively, exporting an automaton might be easier, even for purely SMT-based tools.
How do purely SMT-based tools report information towards the user? I assume that there needs to be a readable sequence of statements anyway.

Of course, there is the possibility to extend the specification and, e.g., enrich the automaton with additional information based on SMT. For example, there is currently no possibility for some invariants to be annotated as pure C-expressions, because of missing quantifiers.

MartinSpiessl · 2021-04-15T08:25:46Z

My understanding is however that these restrictions (or assumptions) are not really imposed by the standard, but rather by the tools implementing validation.

Which standard are you referring to? You mean this repository? Generally this repository tries to give a formal description of the witness format. For the semantics I would look into the corresponding papers. For violation witnesses there are state-space guards and sink states in the witness automaton, their semantics is to restrict state space, cf. http://research.stahlbauer.net/FSE-2015-Testification.pdf :

The exploration of the state space can be restricted either by transitions to sink control states (that have no outgoing transitions, and thus the path exploration ends) or by restricting the state space using state-space guards at transitions.

The only thing that is really under specified is how the CFA for a program is constructed.

In general, the x++ instructions would result (in assembly or LLVM IR) in two memory accesses

We usually verify against the C standard(and GCC when it comes to implementation-defined behavior). The C standard itself does not state whether this are two memory accesses. In general for a program without data races the C standard usually guarantees that the outcome is identical to one with a fixed order of statements (which might not be unique -> race conditions). But I remember having discussed this before with you, some of the atomic functions in C11 might have introduced a way to explicitly marking data races as OK, but my memory is foggy. Do you have an example for this at hand? iirc one can give a relaxed memory order e.g. to this function:

C atomic_fetch_add_explicit( volatile A* obj, M arg, memory_order order );

So my guess is with x++ you really meant such a relaxed operation instead.

If you want to analyze programs with data races by assuming some memory model (x86-TSO?) then yes, a extension of the witness format would be needed, and at that point we could also be open to a completely new format (I see SMT just as one of many proposals, this should be discussed once it is clear what we actually want. To me it sounds like an SMT formula also would have the same problem, namely that you need to know about the interleaving).

I think it is most productive if you could provide an example program and example SMT witness such that we can figure out where the requirements are that are not fulfilled by the current format.

hernanponcedeleon · 2021-04-15T09:31:58Z

I think my description was not clear enough. To clarify I'm not proposing to change the witness format. As Martin said, this repo seems to focus mostly on the witness format (syntax). What it is not 100% clear to me is what must be the meaning of e.g. a graph edge.

My goal is to convert a witness graph into an SMT formula. This formula in conjunction with the formula representing all executions of the program and the formula representing the reachability condition would give me a BMC-based witness validator. No changes are required by developers of other tools, I just want to understand the meaning of each graph element (I need this to faithfully convert the graph into the formula) and hopefully clarify what is assumed by current validators (for technical reasons) and what is imposed by the standard (witness semantics).

For point 1)

A witness is represented as an automaton and it gives several possible program paths (see here), e.g., the automaton gives transitions for a program.

There are different ways of representing a program path. An automaton is one way, an SMT formula is another. My concern is that most papers seems to assume that program paths are explicitly represent as automata, which is clearly not true for all tools.

The level of granularity for transitions is currently chosen to match program statements, because of an agreement of tool developers and validators. I am not sure where this is documented, but the overall understanding should be clear.

I understand this agreement for technical reasons, but I don't think the C standard imposes this (see below my answer to Martin's comment) and I think we should be more general. One possibility would be to add a new graph tag atomic-statement:: Bool which tells the validator if instructions giving rise to transitions are considered atomic or not. We can be more fine grained ad use this tag at the edge level.

For point 2)

Violation witnesses for concurrent programs of course rely on scheduling information, and it is represented as sequence of transitions in the witness, i.e., which thread applies which operation at which position in the trace.

I don't fully agree with this. Suppose you have an assertion assert(x==1) and you can via e.g. constant propagation determine that the value of x at that point will be 0 not matter what the scheduling is. Then the assertion is violated, and you have an analysis that can soundly detect this, but it cannot generate any scheduling information. This is the kind of limitation that was mentioned by one of the goblint developer in the issue I linked in my original post.

It is possible to leave out (skip) program statements, causing the validator to look for transitions on its own. The so called "empty witness" consists of only two nodes (start and end node), with a single transition representing an arbitrary step in the program, e.g., entering the main function. This witness represents all possible traces towards the end node. CPAchecker will analyse the witness and try to find the property violation without further guidance (it might succeed to do so in many cases). CPAchecker will terminate after finding the first valid path, but in general, a validator can also report all possible paths represented by the witness.

I know that many tools are exploiting this empty witnesses (which kind of defeats the goals of having validation in the first place). However I think there are witnesses in the middle of the spectrum (left-hand-side: empty witnesses, right-hand-side: fully scheduling information). For example, for data races, it could be useful if the witness could say which two instructions form the data race (not as useful as the full scheduling, but definitely more useful than an empty witness). So, can I add this information in the tag of an edge without CPAChecker trying to match it with some program transition?

An SMT-based encoding of program traces is also a good idea, but you need to give more details.
Maybe I did not yet understand your idea:

As I said, I'm not proposing to change the format to SMT, but rather to understand what is the meaning of the graph elements so it would allow me to convert such witnesses into a logical formula.

Are you writing about an intermediate step during validation or are you proposing a new format for witnesses?

How do you encode repeated assignments of variables?
This is already solved in the current GraphML-based format: seperate transitions, maybe using assumptions over variables.

BMC techniques convert the program to the SSA form. The witness assumption would still refer to variables. It is the job of the validator developer to convert this to the corresponding SSA variable.

How do you encode loops?
This is already solved in the current format: just build cycles in the automaton.

This will definitely be a limitation of BMC-based-validation since BMC tools unroll loops. In principle we would have to do the same trick that verification tools do and thus the SMT formula generated from the witness will just represent bounded witness-executions.

How is a non-SMT-based verification tool supposed to use the witness?
It is possible to transform a automaton-based witness into SMT, but not the other way around. Thus, the automaton-format might have a larger user-group and a validator that requires SMT-formulas can generate them from the automaton.

Can every verification tool export such an SMT-based witness?
Of course, transforming a trace inot SMT is possible, but can this be done in a consistent way for all tools? Alternatively, exporting an automaton might be easier, even for purely SMT-based tools.

How do purely SMT-based tools report information towards the user? I assume that there needs to be a readable sequence of statements anyway.

The 3 questions above have the same answer. I hope by now it is clear, the witness format will not change so non-SMT-based verification tools should interpret witnesses in the same way they are doing it now.

We usually verify against the C standard(and GCC when it comes to implementation-defined behavior). The C standard itself does not state whether this are two memory accesses. In general for a program without data races the C standard usually guarantees that the outcome is identical to one with a fixed order of statements (which might not be unique -> race conditions).

The C standard does not say this are two accesses, but it does not say they are not either. My understanding is that for the program below, a witness saying "Starts executing line 3, execute line 5, finish executing line 5" should be valid, and unfortunately current validator cannot validate such witness.

1:   global x,y;
2:   thread-0           
3:   x++;                   

4:   thread-1
5:   y=1;

Notice that this notion of atomicity is independent of races. The program above suffers this "problem" but does not have a race.

But I remember having discussed this before with you, some of the atomic functions in C11 might have introduced a way to explicitly marking data races as OK, but my memory is foggy. Do you have an example for this at hand? iirc one can give a relaxed memory order e.g. to this function:

These two programs have the same executions, the first has a race, the second does not (according to C11)

thread-0
x=1;

thread-1
x=2;

thread-0
atomic_store(&x, 1, memory_order_relaxed);

thread-1
atomic_store(&x, 2, memory_order_relaxed);

hernanponcedeleon mentioned this issue Apr 26, 2021

Benchmarks for weak memory models sosy-lab/sv-benchmarks#1297

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Violation witness clarification #38

Violation witness clarification #38

hernanponcedeleon commented Apr 14, 2021

kfriedberger commented Apr 15, 2021

MartinSpiessl commented Apr 15, 2021

hernanponcedeleon commented Apr 15, 2021

For point 1)

For point 2)

Violation witness clarification #38

Violation witness clarification #38

Comments

hernanponcedeleon commented Apr 14, 2021

kfriedberger commented Apr 15, 2021

For point 1)

For point 2)

MartinSpiessl commented Apr 15, 2021

hernanponcedeleon commented Apr 15, 2021

For point 1)

For point 2)