Optimization of nowrshmsk #3875

daglem · 2023-08-04T21:48:02Z

Generation of minimal AST_CASE for variable slices on the form dst[i*stride +: width] = src.

Corrections of corner cases, now correctly tested by tests/verilog/dynamic_range_lhs.v

phsauter · 2023-08-07T11:56:15Z

This PR seems to reduce the Quality-of-Results for nowrshmsk, the attached zip-file can serve as a base to investigate the issue.

shift_issue3.zip

yosys 0.32:

reg_demux: area: 72k; normalized runtime: 1
reg_demux_pow2: area: 37k; normalized runtime: 0.25
reg_demux_nowrshmsk_typedef: area: 33k; normalized runtime: 0.25
(reg_demux_nowrshmsk_mdim_arr: area: 36k; normalized runtime: 17.5)
mdim_arr only added for completeness, yosys 0.32 can't handle it properly

PR #3875:

reg_demux: area: 75k; normalized runtime: 0.9
reg_demux_pow2: area: 37k; normalized runtime: 0.13
reg_demux_nowrshmsk_typedef: area: 51k; normalized runtime: 0.15
reg_demux_nowrshmsk_mdim_arr: area: 51k; normalized runtime: 0.15

daglem · 2023-08-08T07:19:31Z

Good find!

The resource increase seems to be caused by the simplification of non_opt_range in verilog_parser.y, affecting reads from out_rsp_i. Perhaps some optimization pass relies on the exact pattern (@selfsz@((index)*(width)))+(0)) for the range LSB?

I'll see if I can avoid changing non_opt_range, instead extracting stride from the original AST.

BTW please don't use unpacked arrays within packed structs; this is not allowed by the LRM. The following (correct) typedef is functionally equivalent:

	typedef struct packed {
	    logic [12:0][116:0] port;
	} out_req_t;

phsauter · 2023-08-08T09:37:08Z

If you look at the reports in the attached zip, you can see that your solution is actually smaller right up to running abc, even if you look at the rtlil and verilog dumps during the synthesis process, they are usually almost identical.

So I see three possibilities:

The optimization potential cannot be detected anymore by abc (its likely that different abc scripts would then find it
The abc call does not pass the circuit the same way, maybe because of the attribute
The optimization potential is actually lost for some reason (I think thats unlikely)

Sadly I am somewhat out of time for the next 2-3 weeks (I will still try to work on it, just less).

shift_issue_pr3875.zip

daglem · 2023-08-08T10:47:14Z

If you look at the reports in the attached zip, you can see that your solution is actually smaller right up to running abc, even if you look at the rtlil and verilog dumps during the synthesis process, they are usually almost identical.

Yeah. I found that the real (current) optimization was always the simplification of the AST_CASE using stride, seemingly the removal of multiplication of index by stride will in certain scenarios only do harm ATM (I guess some optimization pass removes the multiplication in any case).

As a pragmatic measure I've disabled the removal of the multiplication for now, and at the same time I generalized the AST_CASE simplification to handle variable slices on the form dst[i*stride +: width] = src

Sadly I am somewhat out of time for the next 2-3 weeks (I will still try to work on it, just less).

If you like, you can probably use this PR together with #3877 right away for testing of the SoC design. When you continue working on optimization of the shift approach, the results from nowrshmsk + nordshift can hopefully serve as a comparison baseline.

phsauter · 2023-08-08T12:10:44Z

With disabled mult optimization (last commit):

reg_demux: area: 72k; normalized runtime: 1.06
reg_demux_pow2: area: 37k; normalized runtime: 0.15
reg_demux_nowrshmsk_typedef: area: 33k; normalized runtime: 0.18
reg_demux_nowrshmsk_mdim_arr: area: 33k; normalized runtime: 0.19

I can confirm that this solves the QoR regression on the given example.
I think it would still be interesting to explore why this happens but for now this PR looks fine and is a definite improvement overall.

And I am absolutely going to use this + PR #3877 as a golden reference for the other approach in #3873 and #3880.
This evening I am going to prep a larger test derived from the problematic SoC module for this purpose.

daglem · 2023-11-15T21:43:40Z

@povik I did a cleanup by squashing the commits.

As a torture test I've also run make test after temporarily setting bool use_case_method = true; in simplify.cc.

This resulted in two failures:

tests/svtypes/struct_dynamic_range.ys
- expected, since this test explicitly checks counts of mul, shift, and shiftx cells.
tests/various/dynamic_part_select/forloop_select.v
- inexplicably (to me) fails to prove equivalence with forloop_select_gate.v

Running ../../yosys -p "read_verilog -dump_vlog2 ./dynamic_part_select/forloop_select.v" yields code which AFAICT should yield identical results to forloop_select_gate.v, considering the constraints implied by the stride in each case. Setting stride = 1 in simplify.cc makes the test pass, but I am at a loss as to what the logical difference could be.

To confuse things further, the failure summary of sat in dynamic_part_select.log seems to show impossible combinations of ctrl, din, and dout for both gold and gate?!?

Do you have any idea what can be wrong here, either in the generated optimized CASE, or in the equivalence test?

daglem · 2023-11-16T06:06:04Z

The only hypothesis I've been able to come up with regarding the failure for tests/various/dynamic_part_select/forloop_select.v is that ctrl*sel in dout[(ctrl*sel)+:SLICE] <= din; can overflow, yielding an index which is not covered by the generated CASE.

Given that ctrl and sel are 4 and 5 bits, respectively, e.g. 3*15 = 45 (6 bits) would assumedly, due to SystemVerilog expression size determination rules, be truncated to 5 bits, yielding 13, while the generated CASE only covers 0 and 15.

povik · 2023-11-20T14:36:53Z

frontends/ast/simplify.cc

 				int start_bit = source_offset + i;
-				int end_bit = std::min(start_bit+result_width,source_width) - 1;


I think the earlier code here was confused since start_bit+result_width had the source_offset included but was compared to source_width which had not. So this led to incorrect behavior on wires with non-zero source_offset, right?

Btw shouldn't we rename all those source_ variables? It's the target of the assignment, so it's confusing.

I think the earlier code here was confused since start_bit+result_width had the source_offset included but was compared to source_width which had not. So this led to incorrect behavior on wires with non-zero source_offset, right?

Yes.

Btw shouldn't we rename all those source_ variables? It's the target of the assignment, so it's confusing.

I agree! I've been thinking about changing that, however I didn't dare to change the "source_" naming which originates with source_widthin the shift and mask code.

Should I change all source_offset and source_width to e.g. dst_offset and dst_width?

Yes, go ahead. I would suggest wire_offset and wire_width, but really anything will be an improvement it seems.

OK, changed to wire_offset and wire_width now.

povik · 2023-11-20T14:46:30Z

frontends/ast/simplify.cc

+				if (start_bit%bitno_div != 0 || (stride == 0 && start_bit != 0))
+					continue;
+
+				AstNode *cond = new AstNode(AST_COND, mkconst_int(start_bit, case_sign_hint, case_width_hint));


So here by forcing the sign and width of shift_expr on the start_bit constant in the case, can we make it overflow? Either way what's the purpose?

Yes, it can overflow. Not considering this can lead to incorrect behavior, as I found here: #3875 (comment)

OK, so we consider the overflow in figuring out the reachable values of shift_expr, but shouldn't we then take the overflown value for the bit-select on the target wire?

We do, by checking whether it is possible to generate the current index by a possibly overflown shift_expr (start_bit%gcd(stride, shift_mod) == 0).

I think I see now. So we are not relying on mkconst_int(start_bit, case_sign_hint, case_width_hint) wrapping around the constant since at that point it should never be outside the [min_offset,max_offset] range. If that's the case I suggest we have it as mkconst_int(start_bit, true) so it's clearer.

That would unfortunately make tests/verilog/dynamic_range_lhs.v fail again.
The width is required, since the bit pattern for a negative number depends on the width.

E.g. here is the diff between the working and non-working generated CASE in the first failure in the test above, generated with ../../yosys -DALT=1 -DSPAN=1 -DLEFT=-3 -DRIGHT=0 -p "read_verilog -dump_vlog2 ./dynamic_range_lhs.v":

--- case-ok.txt 2023-11-20 21:24:03.486296307 +0100 +++ case-not-ok.txt 2023-11-20 21:25:05.460245527 +0100 @@ -46,15 +46,15 @@ 1: begin case ((sel1)*(sel2)) - 2'b 00: + 0: out_u[0:0] = data; endcase case (/** AST_TO_SIGNED **/) - 2'b 10: + -2: out_s[-2:-2] = data; - 2'b 11: + -1: out_s[-1:-1] = data; - 2'b 00: + 0: out_s[0:0] = data; endcase end

daglem · 2023-11-20T15:09:51Z

All tests pass now with a forced use_case_method = true, except for the expected failure of tests/svtypes/struct_dynamic_range.ys

The possible overflow of i*stride required a different take on this in order to optimize the generated CASE. Lots of hours for a seemingly small change, as always 🙈

In the end I had to add an internal gcd function to make the CI run - sorry about all the force pushes.

IMO this is ready to be merged now, with a possible change of source_ to dst_ or similar.

povik · 2023-11-20T15:14:42Z

In the end I had to add an internal gcd function to make the CI run - sorry about all the force pushes.

No worries!

povik · 2023-11-20T17:40:17Z

frontends/ast/simplify.cc

+			long long shift_mod = 1ll << (case_width_hint - case_sign_hint);
+			// std::gcd requires C++17
+			// long long bitno_div = std::gcd(stride, shift_mod);
+			long long bitno_div = gcd((long long)stride, shift_mod);


I feel like this undoes much of the optimisation making bitno_div only be power-of-2, not that I don't see why it's done to address the overflown cases.

Well, it's not ideal, although power-of-2 will perfectly optimize common (most?) cases (e.g. gcd(8, 256) = 8), and partially optimize others (e.g. gcd(12, 256) = 4).

Since we know the value of stride, it might be possible to identify cases where the index expression cannot possibly overflow, and fully optimize those cases. I'll look into it.

How about now? 😅

All tests still pass with a forced use_case_method = true, and the results from running the tests in shift_issue3.zip above are:

Chip area for module '\reg_demux_nowrshmsk_mdim_arr': 19213.891200 Chip area for module '\reg_demux_nowrshmsk_typedef': 19115.271000 Chip area for module '\reg_demux_pow2': 5513.961600 Chip area for module '\reg_demux': 21857.661000

As far as I can tell, the PR now only brings improvements, without any regressions.

povik · 2023-12-07T12:47:38Z

@daglem Sorry for the protracted review. I should get to it soon.

povik · 2023-12-11T20:22:01Z

frontends/ast/ast.h

@@ -321,6 +321,9 @@ namespace AST
 		static AstNode *mkconst_str(const std::vector<RTLIL::State> &v);
 		static AstNode *mkconst_str(const std::string &str);

+		// helper function to create an AST node for a temporary register
+		AstNode *mktemp_logic(const std::string &name, AstNode *mod, bool nosync, int range_left, int range_right, bool is_signed);


This should be static

It can't be static, as long as it's encoding the filename and location in wire->str.

Is that a problem? If so, I guess I could make a separate (non-static) function to generate the name, and pass the result of that into mktemp_logic, but I'm not sure I see the benefit.

Ah sorry, missed that. No, it's not a problem per se

FYI I've also thought about enhancing the function later, adding a parameter to decide whether the name should be local (no name decoration) or global (name decorated as it is now).

I believe the function could then be used to replace most of the current code to instantiate temporaries, for which I could make a PR.

@povik As promised did some more tests with shift_issue3.zip (including reg_demux_noshift in the Makefile). I applied #3877 to reproduce increased chip area when an unconditional temporary was introduced for the rhs.

Without any changes (no temporary in this case):
Chip area for module '\reg_demux_noshift': 18689.907600
Forcing use of the temporary, which is currently assigned by AST_ASSIGN_EQ in the current block, instantiating a (* nosync *) register:
Chip area for module '\reg_demux_noshift': 18815.101200
Forcing use of the temporary using AST_ASSIGN at the module level instead, instantiating a wire, since it could possibly be argued that this should use less resources than a register in this case (it did the contrary):
Chip area for module '\reg_demux_noshift': 18896.749200

As it stands for this test case, I'm hesitant to make any further changes.

Perhaps someone wants to investigate why introducing a temporary makes any difference at all (e.g. should it have been completely optimized away in simplify.cc / genrtlil.cc in this case, since it's assigned from a simple 117 bit wire?), however this is outside the scope for this PR.

Thanks! I experimented with this, and as far as I can tell the difference in area is due to the chaotic nature of the synthesis flow, not because of any systematic effect of introducing the temporary. When that's the case I think we want to make the temporary non-conditional, just so we spare a bit of control flow in the congested simplify.cc code.

povik · 2024-01-10T15:49:14Z

The CI here is stuck because of the bug introduced in #4061 before it was fixed with #4067. @daglem Can you please rebase the branch? Or if you give me the go-ahead, I can push to your branch with a rebased version.

This makes tests/verilog/dynamic_range_lhs.v pass, after ensuring that nowrshmsk is actually tested. Stride is extracted from indexing of two-dimensional packed arrays and variable slices on the form dst[i*stride +: width] = src, and is used to optimize the generated CASE block. Also uses less confusing variable names for indexing of lhs wires.

Avoid repeating complex rvalue expressions for each condition.

povik · 2024-01-10T19:42:13Z

Thanks! Let's wait for CI to finish then merge.

daglem · 2024-01-10T19:42:23Z

The CI here is stuck because of the bug introduced in #4061 before it was fixed with #4067. @daglem Can you please rebase the branch? Or if you give me the go-ahead, I can push to your branch with a rebased version.

Done!

daglem requested a review from zachjs as a code owner August 4, 2023 21:48

daglem force-pushed the nowrshmsk-optimization branch from 679ecbd to fb77965 Compare August 4, 2023 22:02

daglem mentioned this pull request Aug 4, 2023

part-selects ($shift, $shiftx) create huge MUX-trees #3833

Closed

daglem force-pushed the nowrshmsk-optimization branch from fb77965 to 6d69930 Compare August 5, 2023 09:11

phsauter mentioned this pull request Aug 8, 2023

peepopt: Add shiftadd pattern #3883

Merged

3 tasks

povik self-assigned this Nov 13, 2023

daglem force-pushed the nowrshmsk-optimization branch from a2e6956 to 8a925ce Compare November 15, 2023 21:12

daglem force-pushed the nowrshmsk-optimization branch 3 times, most recently from 2292984 to df8e732 Compare November 20, 2023 14:12

povik reviewed Nov 20, 2023

View reviewed changes

daglem force-pushed the nowrshmsk-optimization branch from df8e732 to 34873af Compare November 20, 2023 14:42

povik reviewed Nov 20, 2023

View reviewed changes

daglem force-pushed the nowrshmsk-optimization branch from 34873af to b811e48 Compare November 20, 2023 14:59

povik reviewed Nov 20, 2023

View reviewed changes

daglem force-pushed the nowrshmsk-optimization branch from 1cf4925 to 7fec839 Compare November 21, 2023 13:45

daglem mentioned this pull request Nov 27, 2023

Added nordshift attribute #3877

Open

daglem force-pushed the nowrshmsk-optimization branch from edcd997 to be87ff0 Compare November 30, 2023 10:32

daglem force-pushed the nowrshmsk-optimization branch 2 times, most recently from 09026cc to d38a50d Compare December 9, 2023 20:58

daglem mentioned this pull request Dec 11, 2023

Frontend/AST: signed assign to indexed part-select #4064

Closed

povik reviewed Dec 11, 2023

View reviewed changes

daglem mentioned this pull request Dec 12, 2023

Respect the sign of the right operand of AST_SHIFT and AST_SHIFTX #4065

Merged

daglem force-pushed the nowrshmsk-optimization branch from d38a50d to 1e7b3fb Compare December 12, 2023 12:30

daglem mentioned this pull request Dec 12, 2023

Simplify and correct array slice assignment #4069

Merged

povik approved these changes Jan 10, 2024

View reviewed changes

daglem added 4 commits January 10, 2024 20:28

Add torture test for (* nowrshmsk *) stride optimization

a105d2c

Include x bits in test of lhs dynamic part-select

dbec704

Assign from rvalue via temporary register in nowrshmsk CASE

1a2b475

Avoid repeating complex rvalue expressions for each condition.

daglem force-pushed the nowrshmsk-optimization branch from 1e7b3fb to 1a2b475 Compare January 10, 2024 19:40

povik merged commit a921f59 into YosysHQ:master Jan 10, 2024
16 checks passed

phsauter mentioned this pull request Sep 26, 2024

Efficient handling of patterns emitted by sv2v #4615

Open

		int start_bit = source_offset + i;
		int end_bit = std::min(start_bit+result_width,source_width) - 1;

Optimization of nowrshmsk #3875

Optimization of nowrshmsk #3875

Conversation

daglem commented Aug 4, 2023 • edited Loading

phsauter commented Aug 7, 2023

daglem commented Aug 8, 2023 • edited Loading

phsauter commented Aug 8, 2023

daglem commented Aug 8, 2023

phsauter commented Aug 8, 2023

daglem commented Nov 15, 2023

daglem commented Nov 16, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

povik Nov 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daglem commented Nov 20, 2023

povik commented Nov 20, 2023

povik Nov 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

povik commented Dec 7, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

povik commented Jan 10, 2024

povik commented Jan 10, 2024

daglem commented Jan 10, 2024

daglem commented Aug 4, 2023 •

edited

Loading

daglem commented Aug 8, 2023 •

edited

Loading

povik Nov 20, 2023 •

edited

Loading

povik Nov 20, 2023 •

edited

Loading