-
Notifications
You must be signed in to change notification settings - Fork 888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
part-selects ($shift
, $shiftx
) create huge MUX-trees
#3833
Comments
For writes, it could be possible to expand on the work I did earlier to optimize access to multirange packed arrays (within structs). I actually also experimented with a similar optimization for reads, using a new attribute |
There's a The reason why that rule doesn't fire off in your case seems to be due an earlier
FWIW there's the |
We actually tried this before but didn't get |
You could actually try out I.e. |
Damn,
I just played around with this for a bit and I think I found out why it didn't do anything in our case:
So in conclusion, this should solve the simplest cases of this problem but anything moderately complex will be left as-is. Still good to know it exists to help me with the pass I am currently working on. In the end it might be worth to replace this simple implementation with a more holistic optimization, though that means I need to understand what happens in |
Quality-of-Results seems to be a lot better, not quite as good as my manual fix but pretty close to it. |
I happen to be writing some patches for the shiftmul matcher, I sure invite you to take a look and provide feedback once it's posted.
It's the
If there isn't a pass already that deals with the constant offset in the shift and implements it as two consecutive shifts maybe we can add that as a separate pattern to
That's restriction on the bitsize of the shift amount, so it restricts the shift to |
Good to know that someone with more experience is also working on this :-) Sorry for not seeing the Below you can find what I have been working on so far, feel free to take whatever you want from it, though I doubt its 'good code' since I am still getting my grasp on the code-base. |
You could test something like this, still without changing anything in Yosys:
Unfortunately packed multirange arrays are currently only available within packed structs in Yosys - otherwise this code could be further simplified. |
Maybe I am still doing it wrong but if I do that, it just reduces everything to one single |
Doesn't this wire have wiretype out_req_t? It looks plausible to me when I run
|
In the final netlist I am getting: (* wiretype = "\\out_req_t" *)
output out_req_o;
wire out_req_o; It notes the wiretype but the size is still only one bit. Is there some additional command I need to run after |
Heh, sorry, I have no idea what is going on there. Could the netlist be wrong (do you use it for anything later)? |
A synthesizer should be able to handle any code construct efficiently, including the ones mentioned in the issue above. Having to engineer sourcecode and adding pragmas to obtain good results is a non-sustainable solution. |
Welcome to reality! |
Evidently I've been on vacation for too long. Change the typedef to this, and things should hopefully start working:
|
And I shouldn't just copy stuff blindly ^^ Sadly if I do this I now get the If I change it to: typedef struct packed {
logic [117-1:0] port[NoPorts-1:0];
} out_req_t; This then does actually give me the correct AST and initial RTLIL (even though it is kinda scary since now it isn't packed anymore). The For documentation purposes, here is a zip file with the initial, manual fix and nowrshmsk approaches: |
I'm using
I'd better look into avoiding the multiplication with a non-power-of-two number, but it should still work.
Well, unpacked arrays within packed structs are not allowed by the LRM (it's a Yosys extension which should probably be removed), so I wouldn't use that in any case.
I'd actually be very interested in having |
PR #3875 gets rid of the multiplication by 117 in the |
I reworked #3875 so that the optimizations for two-dimensional packed ranges also covers variable slices on the form dst[i*w +: w] = src, i.e. you should only have to make the following change to generate an optimized AST:
Does this work for you? |
This weekend I don‘t have access to a computer, I will try it out on monday. But you should be able to check as well by downloading the zip file and running |
Please do! I also made PR #3877 so that indexing of rvalues can be similarily optimized as well. After adding both #3875 and #3877, you can do:
I'll have to leave it to you to check for correctness - perhaps the result is too good to be true?
|
Okay this issue was fixed by using a newer version of yosys (before it was 0.27 or 0.28, can't remember). If I apply your PRs and use: module reg_demux_nowrshmsk_mdim_arr
// ...
(* nowrshmsk *) output reg [(NoPorts * 117) - 1:0] out_req_o;
// ... This results in an area of 50'000um^2, though for some reason now even the basic version is slightly worse (above 66k). using: module reg_demux_noshift
// ...
(* nowrshmsk *) output reg [(NoPorts * 117) - 1:0] out_req_o;
(* nordshift *) input wire [(NoPorts * 34) - 1:0] out_rsp_i;
// ... I also get 19k, this is both in-line with expectations from our commercial tools and it is also functionally equivalent, meaning this is a significant improvement over other solutions and should be seen as the gold standard. Maybe we should move the following discussion over to #3875: TL;DR: Using both attributes generates fantastic results, PR #3875 diminishes the results from |
Applying both #3875 and #3877 should now yield something like this:
Later optimization passes can yield different results depending on unrelated changes in the input file (and the phase of the moon, for all I know), so results will vary somewhat. |
Version
Yosys 0.28 (git sha1 0d6f4b0, gcc 11.2.0 -fPIC -Os)
On which OS did this happen?
Linux
Reproduction Steps
shift_issue.zip
Then check the reports in output and compare them.
Expected Behavior
The two modules should be about the same size after synthesis as they do the same thing.
Actual Behavior
The module using part-select is about twice as large as the other one.
Background
During synthesis of our (soon-to-be) open-source SoC Iguana, we noticed that indexed part-selects like
array[i*33+:33]
leads to a very large number of muxes during techmap.A large amount of them get optimized away later on but:
In the worst modules this leads to a roughly 4x-ish larger module than what we would expect.
Example
Attached I have a rather simple example resulting from converting the register-interface demux module using sv2v.
The verilog file has two modules, they should be identical in function but in one I replaced the problematic part-select with another construct.
The original module has a roughly 2x larger area than the manually rewritten one.
(it should be noted that a normal invocation of
techmap
would produce even worse results but the waytechmap
is called right now aids in seeing the problem).To generate the netlists and reports to see the problem, run:
Per default this will download the IHP13-PDK, you may change the used
.lib
to any other, it should not matter.What we think happens
As far as we can tell the problem seems to be an inherent problem of how
$shift
and$shiftx
work. They work with a generic shift amount, by which the data can be shifted, it essentially looses the information about how the parts in a part-select are grouped.Together with their implementation in
techmap.v
, using a logarithmic barrel shifter, seems to result in yosys building large mux trees and then failing to optimize them completely later on.So in an expression like
array[i*33+:33]
, I know that always 33 bit belong together and any other shift value than a multiple of 33 is impossible.This information seems to be lost and it just shifts by a generic amount, implemented using log-shifters.
For POW2 grouped data (or to be more specific, if the set-bits are compact), this is not an issue as a lot of bits will be constant over all possible indices . For example for
array[j*32+:32]
the optimization steps will later see that bits 0-4 ofj*32
(the shift amount) will always be constant and the corresponding log-shifter stages are also constant and can be replaced by wiring.However, if we have something like
array[i*33+:33]
this optimization is no longer possible. Since the LSB of the multiplier is set, any of the following bits could change depending on the value ofi
. This means no log-shifter stages can be removed and we generate a significant overhead, as we use a much more powerful operation (a generic shift) to implement the part selection.The provided example should show that during techmap of
$shiftx
and$shift
a large amount of mux cells are created, some of them get optimized away but not all of them, leading to a significant area-overhead.Possible Solutions
I think the cleanest solution would be to have a dedicated
$select
(or whatever it will be called) construct that represents a selection of a group of bits from an array of groups, this would explicitly implement this concept.This would give a significant amount of flexibility how it will be mapped later on and for different targets.
Though it seems
$mux
is already multibit capable and it might be possible to use it instead to represent a part-select.This would likely be a simpler fix but might create other problems for different targets.
The text was updated successfully, but these errors were encountered: