Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

part-selects ($shift, $shiftx) create huge MUX-trees #3833

Closed
phsauter opened this issue Jul 6, 2023 · 25 comments · Fixed by #3883
Closed

part-selects ($shift, $shiftx) create huge MUX-trees #3833

phsauter opened this issue Jul 6, 2023 · 25 comments · Fixed by #3883
Labels
pending-verification This issue is pending verification and/or reproduction

Comments

@phsauter
Copy link
Contributor

phsauter commented Jul 6, 2023

Version

Yosys 0.28 (git sha1 0d6f4b0, gcc 11.2.0 -fPIC -Os)

On which OS did this happen?

Linux

Reproduction Steps

shift_issue.zip

make all

Then check the reports in output and compare them.

Expected Behavior

The two modules should be about the same size after synthesis as they do the same thing.

Actual Behavior

The module using part-select is about twice as large as the other one.


Background

During synthesis of our (soon-to-be) open-source SoC Iguana, we noticed that indexed part-selects like array[i*33+:33] leads to a very large number of muxes during techmap.
A large amount of them get optimized away later on but:

  • During techmap the ballooned number of mux cells increases the memory footprint substantially
  • Not all of them get optimized away, increasing the total cell area significantly

In the worst modules this leads to a roughly 4x-ish larger module than what we would expect.

Example

Attached I have a rather simple example resulting from converting the register-interface demux module using sv2v.
The verilog file has two modules, they should be identical in function but in one I replaced the problematic part-select with another construct.
The original module has a roughly 2x larger area than the manually rewritten one.
(it should be noted that a normal invocation of techmap would produce even worse results but the way techmap is called right now aids in seeing the problem).

To generate the netlists and reports to see the problem, run:

make all

Per default this will download the IHP13-PDK, you may change the used .lib to any other, it should not matter.

What we think happens

As far as we can tell the problem seems to be an inherent problem of how $shift and $shiftx work. They work with a generic shift amount, by which the data can be shifted, it essentially looses the information about how the parts in a part-select are grouped.
Together with their implementation in techmap.v, using a logarithmic barrel shifter, seems to result in yosys building large mux trees and then failing to optimize them completely later on.

So in an expression like array[i*33+:33], I know that always 33 bit belong together and any other shift value than a multiple of 33 is impossible.
This information seems to be lost and it just shifts by a generic amount, implemented using log-shifters.
For POW2 grouped data (or to be more specific, if the set-bits are compact), this is not an issue as a lot of bits will be constant over all possible indices . For example for array[j*32+:32] the optimization steps will later see that bits 0-4 of j*32 (the shift amount) will always be constant and the corresponding log-shifter stages are also constant and can be replaced by wiring.

However, if we have something like array[i*33+:33] this optimization is no longer possible. Since the LSB of the multiplier is set, any of the following bits could change depending on the value of i. This means no log-shifter stages can be removed and we generate a significant overhead, as we use a much more powerful operation (a generic shift) to implement the part selection.

The provided example should show that during techmap of $shiftx and $shift a large amount of mux cells are created, some of them get optimized away but not all of them, leading to a significant area-overhead.

Possible Solutions

I think the cleanest solution would be to have a dedicated $select (or whatever it will be called) construct that represents a selection of a group of bits from an array of groups, this would explicitly implement this concept.
This would give a significant amount of flexibility how it will be mapped later on and for different targets.

Though it seems $mux is already multibit capable and it might be possible to use it instead to represent a part-select.
This would likely be a simpler fix but might create other problems for different targets.

@phsauter phsauter added the pending-verification This issue is pending verification and/or reproduction label Jul 6, 2023
@daglem
Copy link
Contributor

daglem commented Aug 2, 2023

For writes, it could be possible to expand on the work I did earlier to optimize access to multirange packed arrays (within structs).
This takes the relevant bit stride into account when generating the CASE block requested via the (* nowrshmsk *) attribute. This CASE block is a substitute for $shift/$shiftx.

I actually also experimented with a similar optimization for reads, using a new attribute (* nordshift *). In my case, this didn't yield any resource savings, however it is not impossible that an expanded version of this could be beneficial in your case. I have rebased the branch for this in case it could be helpful: https://github.com/daglem/yosys/tree/nordshift

@povik
Copy link
Member

povik commented Aug 2, 2023

However, if we have something like array[i*33+:33] this optimization is no longer possible. Since the LSB of the multiplier is set, any of the following bits could change depending on the value of i. This means no log-shifter stages can be removed and we generate a significant overhead, as we use a much more powerful operation (a generic shift) to implement the part selection.

There's a shiftmul rule in the peepopt pass that is supposed to address exactly that. When it sees a shift driven by a mul that effectively work as a group select, the rule will round up the shift stride to the next power-of-2 and rearrange the input to the shift appropriately, so that those optimizations you describe can take place. It only matches on right shifts, so it will only optimize part-selects working as muxes, not the demux you have in your sample as well.

The reason why that rule doesn't fire off in your case seems to be due an earlier opt_expr pass replacing the bottom bit of the multiplication result with zero since 34 is even, and this way interfering with the detection in peepopt. (Both opt_expr and peepopt are called by the coarse block in synth).

I think the cleanest solution would be to have a dedicated $select (or whatever it will be called) construct that represents a selection of a group of bits from an array of groups, this would explicitly implement this concept.
This would give a significant amount of flexibility how it will be mapped later on and for different targets.

FWIW there's the $bmux cell which I think is exactly that. Maybe that's what the part-select in your sample would synthesize to through the Verific frontend? Who knows.

@phsauter
Copy link
Contributor Author

phsauter commented Aug 2, 2023

For writes, it could be possible to expand on the work I did earlier to optimize access to multirange packed arrays (within structs). This takes the relevant bit stride into account when generating the CASE block requested via the (* nowrshmsk *) attribute. This CASE block is a substitute for $shift/$shiftx.
Doy you by chance have a simple example for this?

We actually tried this before but didn't get (* nowrshmsk *) to do anything but its possible we just didn't use it properly.

@daglem
Copy link
Contributor

daglem commented Aug 2, 2023

For writes, it could be possible to expand on the work I did earlier to optimize access to multirange packed arrays (within structs). This takes the relevant bit stride into account when generating the CASE block requested via the (* nowrshmsk *) attribute. This CASE block is a substitute for $shift/$shiftx.

You could actually try out (* nowrshmsk *) without making any changes. Even though this is currently not ideal in that it will create an enormous CASE, it could just be that this will be properly optimized. When I implemented (* nowrshmsk *) for packed arrays within structs, the stride functionality only served to generate a smaller/nicer AST, it didn't really make any difference on the final resource usage.

I.e. (* nowrshmsk *) output reg [(NoPorts * 117) - 1:0] out_req_o;

@phsauter
Copy link
Contributor Author

phsauter commented Aug 2, 2023

There's a shiftmul rule in the peepopt pass...

Damn, pmgen looks pretty damn complex, I am not quite sure what I am looking at there.
This is indeed extremely close to what I am looking for.

The reason why that rule doesn't fire off in your case seems to be due an earlier opt_expr pass replacing the bottom bit of the multiplication result with zero since 34 is even, and this way interfering with the detection in peepopt. (Both opt_expr and peepopt are called by the coarse block in synth).

I just played around with this for a bit and I think I found out why it didn't do anything in our case:

  1. making the bottom bit constant seems to be fine, at least I see no indication in the code why this shouldn't work
  2. The $mul needs to be connected directly to the $shiftx. In our case it often has an offset (ie something like data[3+33*index+:5] which can be read as "from packs of 33bits, select a pack, then read starting from bit pos 3, read 5 consecutive bits")
    Bonus: this also means you must run opt_expr before it since the initial mapping of a part select will always have an $add cell for the offset (without offset its just +0 but its still there). opt_expr will optimize this cell away.
  3. It doesn't like slicing downwards (data[33*index-:5]; even if you make sure index is >0)
  4. It has a seemingly arbitrary maximum shift-size of 20. Everything above will not be optimized this way (if (GetSize(shamt) > 20) reject;)
  5. Running wreduce before it is absolutely necessary, this was already the case but still worth mentioning. Otherwise the $mul will use the default width of the index and it again rejects the optimization.

So in conclusion, this should solve the simplest cases of this problem but anything moderately complex will be left as-is.

Still good to know it exists to help me with the pass I am currently working on. In the end it might be worth to replace this simple implementation with a more holistic optimization, though that means I need to understand what happens in pmgen, which I am not fond of xD

@phsauter
Copy link
Contributor Author

phsauter commented Aug 2, 2023

You could actually try out (* nowrshmsk *) without making any changes. Even though this is currently not ideal in that it will create an enormous CASE, it could just be that this will be properly optimized. When I implemented (* nowrshmsk *) for packed arrays within structs, the stride functionality only served to generate a smaller/nicer AST, it didn't really make any difference on the final resource usage.

I.e. (* nowrshmsk *) output reg [(NoPorts * 117) - 1:0] out_req_o;

Quality-of-Results seems to be a lot better, not quite as good as my manual fix but pretty close to it.
What worries me is that CPU time increases to 18x and memory footprint to 10x of the initial run.
Since our memory footprint without a fix is already >300GB, I don't see us running this on the large modules.
Still, it shows that the general approach of replacing $shift/$shiftx cells is a good one and I can likely use it to get better estimates for the expected QoR with a fix.

@povik
Copy link
Member

povik commented Aug 2, 2023

@phsauter

I happen to be writing some patches for the shiftmul matcher, I sure invite you to take a look and provide feedback once it's posted.

just played around with this for a bit and I think I found out why it didn't do anything in our case:

  1. making the bottom bit constant seems to be fine, at least I see no indication in the code why this shouldn't work

It's the index <SigSpec> port(mul, \Y) === shamt instruction that restricts it. It means the shift amount in shamt must match what's on the output port of the $mul cell exactly (modulo wire aliases, since those are dealt with in the port() helper, and zero-padding from above, since that's explicitly stripped some lines above).

  1. The $mul needs to be connected directly to the $shiftx. In our case it often has an offset (ie something like data[3+33*index+:5] which can be read as "from packs of 33bits, select a pack, then read starting from bit pos 3, read 5 consecutive bits")
    Bonus: this also means you must run opt_expr before it since the initial mapping of a part select will always have an $add cell for the offset (without offset its just +0 but its still there). opt_expr will optimize this cell away.

If there isn't a pass already that deals with the constant offset in the shift and implements it as two consecutive shifts maybe we can add that as a separate pattern to peepopt, once that pattern is applied the regular shiftmul can deal with what's left.

  1. It has a seemingly arbitrary maximum shift-size of 20. Everything above will not be optimized this way (if (GetSize(shamt) > 20) reject;)

That's restriction on the bitsize of the shift amount, so it restricts the shift to 2**20.

@phsauter
Copy link
Contributor Author

phsauter commented Aug 2, 2023

Good to know that someone with more experience is also working on this :-)

Sorry for not seeing the GetSize(), that was stupid.

Below you can find what I have been working on so far, feel free to take whatever you want from it, though I doubt its 'good code' since I am still getting my grasp on the code-base.
As a note, it currently doesn't actually do anything since the final replacement is still missing but it does identify the cells and the necessary signals, parameters and so on.
https://github.com/phsauter/yosys/tree/opt-part-select

@povik
Copy link
Member

povik commented Aug 2, 2023

@phsauter No worries! See #3873

@daglem
Copy link
Contributor

daglem commented Aug 3, 2023

Quality-of-Results seems to be a lot better, not quite as good as my manual fix but pretty close to it. What worries me is that CPU time increases to 18x and memory footprint to 10x of the initial run. Since our memory footprint without a fix is already >300GB, I don't see us running this on the large modules. Still, it shows that the general approach of replacing $shift/$shiftx cells is a good one and I can likely use it to get better estimates for the expected QoR with a fix.

You could test something like this, still without changing anything in Yosys:

	typedef struct packed {
	    logic [NoPorts][117] port;
	} out_req_t;
	...
	(* nowrshmsk *) output out_req_t out_req_o;
	...
		out_req_o.port[in_select_i] = in_req_i;

Unfortunately packed multirange arrays are currently only available within packed structs in Yosys - otherwise this code could be further simplified.

@phsauter
Copy link
Contributor Author

phsauter commented Aug 3, 2023

You could test something like this, still without changing anything in Yosys ...

Maybe I am still doing it wrong but if I do that, it just reduces everything to one single out_req_o wire instead of it being a 117bit 1:13 demultiplexer.

@daglem
Copy link
Contributor

daglem commented Aug 3, 2023

You could test something like this, still without changing anything in Yosys ...

Maybe I am still doing it wrong but if I do that, it just reduces everything to one single out_req_o wire instead of it being a 117bit 1:13 demultiplexer.

Doesn't this wire have wiretype out_req_t? It looks plausible to me when I run yosys -p 'read_verilog -sv -dump_ast2 reg_demux.v' - the output contains the appropriate number of AST_CASE / AST_COND. Here is the complete modified module for reference:

module reg_demux (
	clk_i,
	rst_ni,
	in_select_i,
	in_req_i,
	in_rsp_o,
	out_req_o,
	out_rsp_i
);
	parameter [31:0] NoPorts = 32'd13;
	typedef struct packed {
	    logic [NoPorts][117] port;
	} out_req_t;
	input wire clk_i;
	input wire rst_ni;
	input wire [3:0] in_select_i;
	input wire [116:0] in_req_i;
	output reg [33:0] in_rsp_o;
//	output reg [(NoPorts * 117) - 1:0] out_req_o;
	(* nowrshmsk *) output out_req_t out_req_o;
	input wire [(NoPorts * 34) - 1:0] out_rsp_i;
	always @(*) begin
		out_req_o = 1'sb0;
		in_rsp_o = 1'sb0;
		
//		out_req_o[in_select_i * 117+:117] = in_req_i;
		out_req_o.port[in_select_i] = in_req_i;
		in_rsp_o = out_rsp_i[in_select_i * 34+:34];
	end
endmodule

@phsauter
Copy link
Contributor Author

phsauter commented Aug 3, 2023

In the final netlist I am getting:

  (* wiretype = "\\out_req_t" *)
  output out_req_o;
  wire out_req_o;

It notes the wiretype but the size is still only one bit. Is there some additional command I need to run after read_verilog?

@daglem
Copy link
Contributor

daglem commented Aug 3, 2023

Heh, sorry, I have no idea what is going on there. Could the netlist be wrong (do you use it for anything later)?
Hopefully someone else can explain.

@thommythomaso
Copy link

You could test something like this, still without changing anything in Yosys:

A synthesizer should be able to handle any code construct efficiently, including the ones mentioned in the issue above. Having to engineer sourcecode and adding pragmas to obtain good results is a non-sustainable solution.

@daglem
Copy link
Contributor

daglem commented Aug 3, 2023

You could test something like this, still without changing anything in Yosys:

A synthesizer should be able to handle any code construct efficiently, including the ones mentioned in the issue above. Having to engineer sourcecode and adding pragmas to obtain good results is a non-sustainable solution.

Welcome to reality!

@daglem
Copy link
Contributor

daglem commented Aug 3, 2023

In the final netlist I am getting:

  (* wiretype = "\\out_req_t" *)
  output out_req_o;
  wire out_req_o;

It notes the wiretype but the size is still only one bit. Is there some additional command I need to run after read_verilog?

Evidently I've been on vacation for too long. Change the typedef to this, and things should hopefully start working:

	typedef struct packed {
	    logic [NoPorts-1:0][117-1:0] port;
	} out_req_t;

@phsauter
Copy link
Contributor Author

phsauter commented Aug 4, 2023

Evidently I've been on vacation for too long. Change the typedef to this, and things should hopefully start working:

And I shouldn't just copy stuff blindly ^^

Sadly if I do this I now get the AST_CASE and AST_COND as I would expect but they only seem to checks whether in_select_i is equal to zero. Meaning it probably doesn't like multi-dimensional packed arrays.

If I change it to:

typedef struct packed {
	    logic [117-1:0] port[NoPorts-1:0];
	} out_req_t;

This then does actually give me the correct AST and initial RTLIL (even though it is kinda scary since now it isn't packed anymore).
But then somewhere inside synth -run begin:fine (ie in coarse synthesis) it converts it back to a shift.

The peepopt shiftmul approach looks pretty promising though so unless you are curious, I don't think its necessary to debug the nowrshmsk approach.

For documentation purposes, here is a zip file with the initial, manual fix and nowrshmsk approaches:
shift_issue2.zip

@daglem
Copy link
Contributor

daglem commented Aug 4, 2023

Sadly if I do this I now get the AST_CASE and AST_COND as I would expect but they only seem to checks whether in_select_i is equal to zero. Meaning it probably doesn't like multi-dimensional packed arrays.

I'm using (* nowrshmsk *) for multidimensional packed arrays in https://github.com/daglem/reDIP-SID to good effect, so it should work. I don't see anything wrong in the AST offhand - e.g. the output of yosys -p 'read_verilog -sv -dump_vlog2 reg_demux.v' looks like this:

...
          case ((in_select_i)*(117))
            0:
              (* wiretype = /** AST_STRUCT_ITEM **/ *)out_req_o[1520:0][116:0] = in_req_i;
            117:
              (* wiretype = /** AST_STRUCT_ITEM **/ *)out_req_o[1520:0][233:117] = in_req_i;
            234:
              (* wiretype = /** AST_STRUCT_ITEM **/ *)out_req_o[1520:0][350:234] = in_req_i;
...

I'd better look into avoiding the multiplication with a non-power-of-two number, but it should still work.

If I change it to:

typedef struct packed {
	    logic [117-1:0] port[NoPorts-1:0];
	} out_req_t;

This then does actually give me the correct AST and initial RTLIL (even though it is kinda scary since now it isn't packed anymore). But then somewhere inside synth -run begin:fine (ie in coarse synthesis) it converts it back to a shift.

Well, unpacked arrays within packed structs are not allowed by the LRM (it's a Yosys extension which should probably be removed), so I wouldn't use that in any case.

The peepopt shiftmul approach looks pretty promising though so unless you are curious, I don't think its necessary to debug the nowrshmsk approach.

I'd actually be very interested in having nowrshmsk work in your case, comparing it to the shift approach. Could you please have another look, and show how you get an incorrect AST?

@daglem
Copy link
Contributor

daglem commented Aug 4, 2023

PR #3875 gets rid of the multiplication by 117 in the (* nowrshmsk *) AST_CASE generated from reg_demux.v.

@daglem
Copy link
Contributor

daglem commented Aug 5, 2023

I reworked #3875 so that the optimizations for two-dimensional packed ranges also covers variable slices on the form dst[i*w +: w] = src, i.e. you should only have to make the following change to generate an optimized AST:

(* nowrshmsk *) output reg [(NoPorts * 117) - 1:0] out_req_o;

Does this work for you?

@phsauter
Copy link
Contributor Author

phsauter commented Aug 5, 2023

This weekend I don‘t have access to a computer, I will try it out on monday.

But you should be able to check as well by downloading the zip file and running make all and then checking the RTLIL/netlist.

@daglem
Copy link
Contributor

daglem commented Aug 5, 2023

This weekend I don‘t have access to a computer, I will try it out on monday.

Please do! I also made PR #3877 so that indexing of rvalues can be similarily optimized as well.

After adding both #3875 and #3877, you can do:

        (* nowrshmsk *) output reg [(NoPorts * 117) - 1:0] out_req_o;
        (* nordshift *) input wire [(NoPorts * 34) - 1:0] out_rsp_i;

But you should be able to check as well by downloading the zip file and running make all and then checking the RTLIL/netlist.

I'll have to leave it to you to check for correctness - perhaps the result is too good to be true?

$ cat output/*.rpt | grep "Chip area"
   Chip area for module '\reg_demux_noshift': 21636.720000
   Chip area for module '\reg_demux_pow2': 37214.742600
   Chip area for module '\reg_demux': 74589.114600

@phsauter
Copy link
Contributor Author

phsauter commented Aug 7, 2023

Sadly if I do this I now get the AST_CASE and AST_COND as I would expect but they only seem to checks whether in_select_i is equal to zero. Meaning it probably doesn't like multi-dimensional packed arrays.

Okay this issue was fixed by using a newer version of yosys (before it was 0.27 or 0.28, can't remember).
This gives me a correct solution with 32'700um^2 compared to the 37'200um^2 of the pow2 solution. This is roughly what I would expect, even a bit better.

If I apply your PRs and use:

module reg_demux_nowrshmsk_mdim_arr
// ...
(* nowrshmsk *) output reg [(NoPorts * 117) - 1:0] out_req_o;
// ...

This results in an area of 50'000um^2, though for some reason now even the basic version is slightly worse (above 66k).
Same result for the typedef solution, a significant increase compared to the version without the PRs.

using:

module reg_demux_noshift
// ...
(* nowrshmsk *) output reg [(NoPorts * 117) - 1:0] out_req_o;
(* nordshift *) input wire [(NoPorts * 34) - 1:0] out_rsp_i;
// ...

I also get 19k, this is both in-line with expectations from our commercial tools and it is also functionally equivalent, meaning this is a significant improvement over other solutions and should be seen as the gold standard.

Maybe we should move the following discussion over to #3875:
If I apply both PRs I get a base of 66k, with just PR #3875 its even higher at 75k and both the typedef and strided-array access are lower at 50k but only compared to this worse base, compared to typedef and yosys 0.32 which achieves 32k this is too high.
So I think there may be some bug in this PR.

TL;DR: Using both attributes generates fantastic results, PR #3875 diminishes the results from nowrshmsk alone, my issue was likely an old yosys version.

@daglem
Copy link
Contributor

daglem commented Aug 8, 2023

Applying both #3875 and #3877 should now yield something like this:

   Chip area for module '\reg_demux_noshift': 18548.573400
   Chip area for module '\reg_demux_nowrshmsk_mdim_arr': 37543.224600
   Chip area for module '\reg_demux_nowrshmsk_typedef': 37543.224600
   Chip area for module '\reg_demux_pow2': 37214.742600
   Chip area for module '\reg_demux': 71352.640800

Later optimization passes can yield different results depending on unrelated changes in the input file (and the phase of the moon, for all I know), so results will vary somewhat.
I've disabled the removal of multiplication for (* nowrshmsk *), since this for some reason only seemed to yield good results when combined with (* nordshift *).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending-verification This issue is pending verification and/or reproduction
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants