Discussion: Operations and Intrinsics #21

bitwalker · 2019-12-04T20:14:56Z

I decided to open up our discussion here @hansihe, since Slack is a bit awkward for long form writing.

Background

Hans and I had started talking about what kinds of operations (aka instructions), and/or intrinsics, EIR should have, and whether new ones are necessary/desirable. In particular, I'd like EIR to be expressive enough so that backends can generate efficient code from it; and more generally, ensure EIR is granular enough so that passes have plenty to work with for optimizations.

I'm in favor of adding some additional operations, and possibly defining some BIF-like functionality as intrinsics. Hans disagrees.

Operations vs Intrinsics

One thing I want to clarify is how, and why, I'm distinguishing between intrinsics and operations (i.e. instructions) in the IR.

Intrinsics

Intrinsics are:

Opaque during high-level optimization
No observable semantics from the perspective of the IR
Intended to be passed through unmodified to whatever tier of the compiler cares about them (this might be the code generator, or it may be passes which run late in the pipeline).
Meant to be eventually lowered to more primitive IR, and/or used as hints for optimization
- The "more primitive IR" may be other EIR instructions, or another IR used by a later stage of the compiler. Ultimately, the intrinsics are stripped, leaving primitive operations in their place, or nothing at all.

The opaque nature of intrinsics is also their weakness, as they cannot take advantage of any optimizations until they have been lowered, which may occur quite late in the pipeline. The result in such cases, is either subpar code generation, or requiring duplicate runs of early passes.

Generally speaking, intrinsics are best used as hints, or to invoke target-specific behavior. I think in this regard, the approach to intrinsics in EIR is on point, as they are entirely opaque.

Examples

A few examples from LLVM of things which are implemented as intrinsics rather than operations.

llvm.prefetch, hints to the code generator that some data should be prefetched, if supported by the target. This is purely a hint, and does not change the semantics of the program
llvm.memcpy, informs the code generator that a memcpy should be performed. Rather than inserting a call to memcpy in libc, LLVM instead allows the backend to choose how to lower the intrinsic (which in theory could be a runtime call, but in practice is done by generating more precise code directly. Semantically, this is just a function call, but not necessarily lowered as one.
llvm.sadd.with.overflow.*, informs the code generator that a signed addition with overflow should be performed. The precise lowering of this intrinsic is dependent on the target, using hardware support where present, or emulation as a fallback. Like llvm.memcpy, this is semantically a function call.

Operations

Conversely, operations:

Have observable semantics, and are thus not opaque
Are intended to participate in optimization and analysis
May also function as hints, but are fully general (i.e. they are not target-specific hints)
Are directly lowered to some equivalent operation, or to a set of equivalent, but more primitive operations

All tiers of the compiler operating on the IR must respect the semantics of each operation. In many cases, this happens explicitly, by handling each operation individually. In others, it happens implicitly. For example, writing a compiler pass for an optimization may only directly operate on a subset of operations, but must still preserve the semantics of all operations. This is the weakness of operations - they incur additional maintenance burden and cognitive load.

However, because of their nature, operations are what give an IR its expressive power. The granularity of IR operations directly impacts the kind of optimizations that can be performed - after all, how do you optimize a more general operation to one of higher specificity, if you have nothing but general operations?

So generally speaking, operations are best used as building blocks of varying generality, for things which have a direct impact on the program semantics/behavior.

Examples

Again, from LLVM, some examples of operations (instructions) which illustrate the difference with intrinsics.

call, a function calling primitive
br, a basic if/else control flow primitive
switch, a generalization of br which allows branching to multiple different locations
unreachable, acts as a hint for optimization, but its semantics are "undefined behavior". What is notable here is that this operation is fully general, and combined with its prominent role in optimization, it makes sense to have it as an operation rather than as an intrinsic
add/sub, some fundamental arithmetic operations. It would be hard to imagine these as intrinsics, since that would preclude constant propagation in many obvious cases, as well as other optimizations that take advantage of the semantics of these operations to combine them, rearrange them, etc., in various ways.
extractelement, an example of aggregate access which permits precise access to an element without copying/moving the value.
alloca, indicates that some memory should be dynamically allocated on the stack, rather than the heap
load/store, primitive memory read/write operations

These operations range from more general to more primitive. LLVM IR overall is quite low-level, so many of these operations wouldn't make sense to have in something at the level of EIR, but that's not my point anyway, my reference to LLVM IR is just to illustrate how this distinction between operations and intrinsics is handled in a very broadly used IR.

Problem

So I've laid out the difference between operations and intrinsics, but I haven't really explained why I'm bringing it up to begin with. To illustrate, I think it would help to look in detail at an issue I raised about one area of the IR:

When lowering Match, and dealing with a branch with a MatchKind type Tuple(arity), ListCell, or MapItem, the semantics indicate that one of the following occurs if matched:

Tuple(arity), then all arity elements in the tuple must be "unpacked" or loaded (or at the very least, pointers to the elements must be calculated)
ListCell requires that the head and tail of the cell be unpacked.
MapItem requires that a given key and value are unpacked

The unpacked values are then given as block arguments to the successor block of the branch. However, some or all of the values may not be live in the successor block at all, so we'd like to avoid generating those accesses at all.

Unfortunately, this is where we hit a limitation of the current IR. We can't eliminate the unnecessary accesses in EIR at all, because there are no fine-grained operations for accessing just one element of a tuple, or just the head/tail of a cons cell, etc. Instead, we have to lower EIR to another IR that does have those operations, then write a pass which eliminates the dead operations, and updates the argument list of the successor (as well as any other branches to that block). That is really not desirable - if EIR is to be a foundation on which all kinds of analyses and optimizations are based, it needs to be able to handle this kind of optimization.

To be clear, I'm not saying EIR must support these operations, a backend can handle these things on its own. But without more precision, EIR becomes significantly less useful for analysis and optimization, and it ends up making more sense to lower out of EIR as soon as possible and do all the optimization work in a more extensive IR to avoid uneccessary duplication of effort. We're all better off if we can do more in EIR, and only lower out of it when most optimization has already been performed.

Proposal

I'd like to suggest the following additions to the IR. Note that I'm using a more SSA-like syntax here, but that is just to keep things readable.

Key:

%name, a virtual register/block argument/SSA-like value
T, a statically known type
^block, a block name

Operations:

%val = extractelement(%term, %index_or_key), a general aggregate access operation
- on tuples, it unpacks a specific element at the provided zero-based index
- on cons cells, it unpacks the head via index 0, and the tail via index 1
- on maps, it unpacks the value with the given key
- alternatively, these could each be made into different operations to allow for better analysis
  - %val = OP(%cons) # where OP = car|cdr|head|tail
  - %val = extractvalue(%map, %key)
updateelement(%term, %index_or_key, %val), an in-place aggregate update operation
- again, could also break this up into multiple typed operations
- this is explicitly to enable optimizations which identify places where in-place mutation of a data structure is safe
is_type<T>(%term, ^yes, ^no) and is_type<T>(%term1, .., %termN, ^yes, ^no)
- branches to ^yes if %term (or all terms) are of type T, ^no otherwise
- can be lowered to a runtime call, bit masking/shifting, or elided in favor of a direct branch if analysis can prove that %term (or any term) is or isn't T statically
- speaking of direct branches...
br ^block(%args..)
- we currently unify branches with function calls, but I think it would make sense to define a branch operation and lower control-flow calls to explicit branches once we've resolved what is and isn't a "real" call. It will make for easier writing of analysis and optimization passes to boot

From my perspective, the above are pretty essential, and likely don't cover the full range of things that would be helpful to have in the IR, but certainly hit the major pain points. Feel free to bikeshed the naming/etc., but the semantics are really the important part.

Things this enables:

Precise access of aggregate values
Elision of unnecessary accesses
In-place mutation
Type checking/monomorphization, i.e. is_type<fixnum>(%a, %b, ^fast_math, ^slow_math)
Lowering of conditional/multiway branches to direct branches when static analysis can prove which branch will be hit

Problems this introduces:

The IR currently already has multiple stages it goes through where various constructs are not used (i.e. Case is not present after pattern compilation). This would potentially introduce additional cases of this, as some operations would only be present after optimizations (br, updateelement).
Additional internal complexity, though I would argue that is unavoidable

A Note On Dialects

One thing that I think may alleviate the above is leaning in to the concept of "dialects" which kinda/sorta exists in EIR, but not really. MLIR is a fantastic example of how dialects can work in practice, and I would suggest taking a look if you haven't already.

In short though, you start with a high-level dialect, and then lower through progressively more primitive dialects as operations in the high-level dialect get converted to their constituent operations in lower-level dialects. However, multiple dialects can be in play at the same time, so it isn't your typical compiler pipeline that lowers from A -> B -> C, it is more like A -> AB -> ABC -> BC -> C.

To that point, each dialect can define partial or full conversions to another dialect, and in the case of MLIR, it will handle transitive conversions as long as there is a path from dialect A to dialect Z. For example, if I define an EIR dialect, and a conversion to the MLIR Standard dialect, I can convert directly from EIR to the LLVM IR dialect automatically, since the Standard dialect defines a conversion to LLVM IR dialect. Passes can perform partial lowerings of operations from one dialect to another as part of performing an optimization, which matches quite well to some of the patterns I mentioned earlier.

The only catch is that I'm not sure how best to design a dialect system in Rust. MLIR has the flexibility of C++ to work with, while Rust is much more strict (thankfully). That said, I think the most imporant pieces would fit naturally in the trait system, so it may be worth modeling the skeleton of MLIRs hierarchy in Rust to see where it falls apart, rather than discounting it completely.

The text was updated successfully, but these errors were encountered:

hansihe · 2019-12-05T12:02:27Z

I'm in favor of adding some additional operations, and possibly defining some BIF-like functionality as intrinsics. Hans disagrees.

I said that I disagreed to a certain degree, and I think the discussion would have been a bit more productive if you had waited for at least the base of my reasoning.

I am not against having intrinsics in general. True intrinsics are already used for a couple of things in the IR, namely representing the semantics needed for receive blocks.

I mainly took issue with the way you wanted to handle guard functions, and more generally how you wanted to handle some BIFs as intrinsics.

I think it's useful to define a couple of different cases:

Constructs that should be handled as intrinsics by some passes, but that correspond 1:1 to erlang BIFs.
- erlang:byte_size/1 - Can quite conceivably be handled as an intrinsic by an optimization pass, but doesn't need to be special-cased in any way. Simply having it be a normal function call to a CaptureFunction PrimOp will enable other optimization passes to transparently propagate constants to its MFA, making it potentially eventually end up as a constant CaptureFunction. It can then be trivially handled as an intrinsic in relevant optimization passes, while still maintaining the exact same format in the cases where it is called dynamically.
- There are countless other BIFs like this.
Promoted BIFs. These are BIFs that could be handled by the first mechanism just fine, but that have operations/primops that represent its semantics exactly, and give some direct advantages when promoted. Some examples:
- erlang:</2 and the other compare BIFs. These have PrimOps that represent their semantics, and using those would give us free deduplication and potentially make them get scheduled in a more optimal spot.
- erlang:apply - Whenever we resolve a static call to this, it should be promoted into either a direct function Call, or into a function Call to a CaptureFunction
True intrinsics. These may, but do not in general, have a 1:1 relationship to erlang BIFs.
Examples of these are:
- receive_* - used for representing the semantics of receive blocks. These exist and are used today.
- Anything else that does not fit into the other classes.

Proposal

%val = extractelement(%term, %index_or_key), a general aggregate access operation

on tuples, it unpacks a specific element at the provided zero-based index

on cons cells, it unpacks the head via index 0, and the tail via index 1

on maps, it unpacks the value with the given key

alternatively, these could each be made into different operations to allow for better analysis

%val = OP(%cons) # where OP = car|cdr|head|tail

%val = extractvalue(%map, %key)

updateelement(%term, %index_or_key, %val), an in-place aggregate update operation

again, could also break this up into multiple typed operations

this is explicitly to enable optimizations which identify places where in-place mutation of a data structure is safe

Both of these generally sound like a good idea to me, and we will need to have some variant of these eventually.

is_type<T>(%term, ^yes, ^no) and is_type<T>(%term1, .., %termN, ^yes, ^no)

branches to ^yes if %term (or all terms) are of type T, ^no otherwise

can be lowered to a runtime call, bit masking/shifting, or elided in favor of a direct branch if analysis can prove that %term (or any term) is or isn't T statically

This is quite trivially represented by a match with one branch matching on the type, and a fallback wildcard branch. I don't really see the value in having a very specific operation like this when the exact same semantics can be trivially expressed with a strictly more powerful operation.

It could be argued that it is desirable to have a variant of match that only checks a single condition instead of a list of them, but I don't see any reason to restrict it to type dispatch only.

br ^block(%args..)

we currently unify branches with function calls, but I think it would make sense to define a branch operation and lower control-flow calls to explicit branches once we've resolved what is and isn't a "real" call. It will make for easier writing of analysis and optimization passes to boot

This is not really true. The call operation has two variants:

ControlFlow - This is required to be a call to a local block. It can not create a stack frame. This is what would be the branch operation you describe.
Function - This is required to create a new stack frame. If the call is to a block, the only way control flow can escape from the scope of that block is through a call to one of the functions escape continuations. This makes it illegal for any of the outer functions escape continuations to be accessed from within the inner function.

These semantics are preserved in all stages of the IR.

Dialects

I agree that having some way to represent this would be really beneficial, and I have a couple of ideas I would like to try out.

bitwalker added backend:codegen Anything related to codegen backend:ir Anything related to IRs discussion Open discussions on topics for exploration RFC Proposals for new/modified functionality labels Dec 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: Operations and Intrinsics #21

Discussion: Operations and Intrinsics #21

bitwalker commented Dec 4, 2019

hansihe commented Dec 5, 2019

Discussion: Operations and Intrinsics #21

Discussion: Operations and Intrinsics #21

Comments

bitwalker commented Dec 4, 2019

Background

Operations vs Intrinsics

Intrinsics

Examples

Operations

Examples

Problem

Proposal

A Note On Dialects

hansihe commented Dec 5, 2019

Proposal

Dialects