Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proof-of-concept draft: Add a simple vector extension to femtorv #53

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

mbitsnbites
Copy link

@mbitsnbites mbitsnbites commented Jan 10, 2022

Here is a draft of my ideas that I came up with yesterday.

Caveat emptor

Take it for what it is: A sketchbook proof of concept experiment, totally untested. I did not even try to build it, and I am pretty sure that some state transitions will not work properly for certain vector operations.

Functionality

As described in the code comment, this change tries to:

  • Map vector registers on top of the scalar register file.
  • Adds the VSETVL instruction (from the V extension) to manipulate the vector length (VL) register.
  • Adds logic for iterating over the vector register elements while staying in the EXECUTE or EXECUTE+WAIT_ALU_OR_MEM states, until VL vector elements have been processed.

Instruction encoding

To encode vector instructions, the two least significant bits of the instruction word are used (in RV32I these bits are always 11, so anything else indicates a vector operation). This is not compatible with the C extension, for instance, so some other encoding trick must be used if you want to support that (I am not very versed in RISC-V instruction encoding, but the CUSTOM_0 - CUSTOM_3 pages could be a possibility).

Bugs / refactoring

I think that the source register lookup and destination register index (rdId) is broken for multi-cycle instructions (load/store/div). Specifically vecIdx is not always updated in the right state/cycle.

Furthermore the source register lookup is currently done in two different places (really it needs to be done in three different places IIUIC). It feels like this part can be refactored to solve both the out-of-sync vecIdx problem and possibly reduce LUT usage.

Possible improvements

More vector registers

The current implementation only provides eight vector registers, of which 3-5 are usable in practice (V0 can never be used, and some scalar registers must be spared for scalar operations). It would be very simple, and valuable, to add more vector registers. All that is required is to double (or quadruple?) the number of scalar registers in registerFile. It is mostly a matter of balancing the size of the core (e.g. the number of LUT:s).

Stride based load/store

Another functionality that I have not added, but that is quite powerful, is support for on-the-fly generation of address strides. I think that a feasible solution would be to add special handling of the case when src2IsVec = 1 and src2 is an immediate value (e.g. isLoad | isStore | isALUimm), such that the immediate value is replaced by an incrementing (registered) value as follows [0, IMM, 2*IMM, 3*IMM, ...].

Writing programs

This is of course a major problem at the moment. No compiler / toolchain supports the new vector instructions (except for VSETVL).

For prototyping purposes I would personally only write vectorized code directly in assembler language (that also gives better control over scalar register allocation), by first compiling the corresponding scalar code, and then hand-modifying the generated machine code to use the encoding for vector instructions (i.e. modify the 2 LSB:s), and emitting them as .word directives.

For instance, the following C code:

void foo(int* dst, const int* src, int num) {
    for (int i = 0; i < num; ++i) {
        dst[i] = src[i];
    }
}

...could be implemented in RISC-V assembler with vector instructions (assuming that stride based load/store is supported):

foo:
	blez	a2, .L2
.L1:
	vsetvl	a4, a2, zero
	.word	0x0045a381	# lw	v7, +4(a1)
	.word	0x00752221	# sw	v7, +4(a0)
	sub	a2, a2, a4
	addi	a0, a0, 16
	addi	a1, a1, 16
	bnez	a2, .L1
.L2:
	ret

As it's far from convenient, in the long run you probably want to patch some toolchain (e.g. binutils/as) to support these instructions to some degree.

This is a proof of concept implementation that maps virtual vector
registers onto the scalar register file.
Implement VSETVL and vector op state transitions.

Also make the vector length build-time configurable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant