Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization of nowrshmsk #3875

Merged
merged 4 commits into from
Jan 10, 2024
Merged

Conversation

daglem
Copy link
Contributor

@daglem daglem commented Aug 4, 2023

Generation of minimal AST_CASE for variable slices on the form dst[i*stride +: width] = src.

Corrections of corner cases, now correctly tested by tests/verilog/dynamic_range_lhs.v

@phsauter
Copy link
Contributor

phsauter commented Aug 7, 2023

This PR seems to reduce the Quality-of-Results for nowrshmsk, the attached zip-file can serve as a base to investigate the issue.

shift_issue3.zip

yosys 0.32:

  • reg_demux: area: 72k; normalized runtime: 1
  • reg_demux_pow2: area: 37k; normalized runtime: 0.25
  • reg_demux_nowrshmsk_typedef: area: 33k; normalized runtime: 0.25
  • (reg_demux_nowrshmsk_mdim_arr: area: 36k; normalized runtime: 17.5)
    mdim_arr only added for completeness, yosys 0.32 can't handle it properly

PR #3875:

  • reg_demux: area: 75k; normalized runtime: 0.9
  • reg_demux_pow2: area: 37k; normalized runtime: 0.13
  • reg_demux_nowrshmsk_typedef: area: 51k; normalized runtime: 0.15
  • reg_demux_nowrshmsk_mdim_arr: area: 51k; normalized runtime: 0.15

@daglem
Copy link
Contributor Author

daglem commented Aug 8, 2023

Good find!

The resource increase seems to be caused by the simplification of non_opt_range in verilog_parser.y, affecting reads from out_rsp_i. Perhaps some optimization pass relies on the exact pattern (@selfsz@((index)*(width)))+(0)) for the range LSB?

I'll see if I can avoid changing non_opt_range, instead extracting stride from the original AST.

BTW please don't use unpacked arrays within packed structs; this is not allowed by the LRM. The following (correct) typedef is functionally equivalent:

	typedef struct packed {
	    logic [12:0][116:0] port;
	} out_req_t;

@phsauter
Copy link
Contributor

phsauter commented Aug 8, 2023

If you look at the reports in the attached zip, you can see that your solution is actually smaller right up to running abc, even if you look at the rtlil and verilog dumps during the synthesis process, they are usually almost identical.

So I see three possibilities:

  1. The optimization potential cannot be detected anymore by abc (its likely that different abc scripts would then find it
  2. The abc call does not pass the circuit the same way, maybe because of the attribute
  3. The optimization potential is actually lost for some reason (I think thats unlikely)

Sadly I am somewhat out of time for the next 2-3 weeks (I will still try to work on it, just less).

shift_issue_pr3875.zip

@daglem
Copy link
Contributor Author

daglem commented Aug 8, 2023

If you look at the reports in the attached zip, you can see that your solution is actually smaller right up to running abc, even if you look at the rtlil and verilog dumps during the synthesis process, they are usually almost identical.

Yeah. I found that the real (current) optimization was always the simplification of the AST_CASE using stride, seemingly the removal of multiplication of index by stride will in certain scenarios only do harm ATM (I guess some optimization pass removes the multiplication in any case).

As a pragmatic measure I've disabled the removal of the multiplication for now, and at the same time I generalized the AST_CASE simplification to handle variable slices on the form dst[i*stride +: width] = src

Sadly I am somewhat out of time for the next 2-3 weeks (I will still try to work on it, just less).

If you like, you can probably use this PR together with #3877 right away for testing of the SoC design. When you continue working on optimization of the shift approach, the results from nowrshmsk + nordshift can hopefully serve as a comparison baseline.

@phsauter
Copy link
Contributor

phsauter commented Aug 8, 2023

With disabled mult optimization (last commit):

  • reg_demux: area: 72k; normalized runtime: 1.06
  • reg_demux_pow2: area: 37k; normalized runtime: 0.15
  • reg_demux_nowrshmsk_typedef: area: 33k; normalized runtime: 0.18
  • reg_demux_nowrshmsk_mdim_arr: area: 33k; normalized runtime: 0.19

I can confirm that this solves the QoR regression on the given example.
I think it would still be interesting to explore why this happens but for now this PR looks fine and is a definite improvement overall.

And I am absolutely going to use this + PR #3877 as a golden reference for the other approach in #3873 and #3880.
This evening I am going to prep a larger test derived from the problematic SoC module for this purpose.

@phsauter phsauter mentioned this pull request Aug 8, 2023
3 tasks
@povik povik self-assigned this Nov 13, 2023
@daglem
Copy link
Contributor Author

daglem commented Nov 15, 2023

@povik I did a cleanup by squashing the commits.

As a torture test I've also run make test after temporarily setting bool use_case_method = true; in simplify.cc.

This resulted in two failures:

  • tests/svtypes/struct_dynamic_range.ys
    • expected, since this test explicitly checks counts of mul, shift, and shiftx cells.
  • tests/various/dynamic_part_select/forloop_select.v
    • inexplicably (to me) fails to prove equivalence with forloop_select_gate.v

Running ../../yosys -p "read_verilog -dump_vlog2 ./dynamic_part_select/forloop_select.v" yields code which AFAICT should yield identical results to forloop_select_gate.v, considering the constraints implied by the stride in each case. Setting stride = 1 in simplify.cc makes the test pass, but I am at a loss as to what the logical difference could be.

To confuse things further, the failure summary of sat in dynamic_part_select.log seems to show impossible combinations of ctrl, din, and dout for both gold and gate?!?

Do you have any idea what can be wrong here, either in the generated optimized CASE, or in the equivalence test?

@daglem
Copy link
Contributor Author

daglem commented Nov 16, 2023

The only hypothesis I've been able to come up with regarding the failure for tests/various/dynamic_part_select/forloop_select.v is that ctrl*sel in dout[(ctrl*sel)+:SLICE] <= din; can overflow, yielding an index which is not covered by the generated CASE.

Given that ctrl and sel are 4 and 5 bits, respectively, e.g. 3*15 = 45 (6 bits) would assumedly, due to SystemVerilog expression size determination rules, be truncated to 5 bits, yielding 13, while the generated CASE only covers 0 and 15.

@daglem daglem force-pushed the nowrshmsk-optimization branch 3 times, most recently from 2292984 to df8e732 Compare November 20, 2023 14:12
int start_bit = source_offset + i;
int end_bit = std::min(start_bit+result_width,source_width) - 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the earlier code here was confused since start_bit+result_width had the source_offset included but was compared to source_width which had not. So this led to incorrect behavior on wires with non-zero source_offset, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw shouldn't we rename all those source_ variables? It's the target of the assignment, so it's confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the earlier code here was confused since start_bit+result_width had the source_offset included but was compared to source_width which had not. So this led to incorrect behavior on wires with non-zero source_offset, right?

Yes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw shouldn't we rename all those source_ variables? It's the target of the assignment, so it's confusing.

I agree! I've been thinking about changing that, however I didn't dare to change the "source_" naming which originates with source_widthin the shift and mask code.

Should I change all source_offset and source_width to e.g. dst_offset and dst_width?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, go ahead. I would suggest wire_offset and wire_width, but really anything will be an improvement it seems.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, changed to wire_offset and wire_width now.

if (start_bit%bitno_div != 0 || (stride == 0 && start_bit != 0))
continue;

AstNode *cond = new AstNode(AST_COND, mkconst_int(start_bit, case_sign_hint, case_width_hint));
Copy link
Member

@povik povik Nov 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here by forcing the sign and width of shift_expr on the start_bit constant in the case, can we make it overflow? Either way what's the purpose?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can overflow. Not considering this can lead to incorrect behavior, as I found here: #3875 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so we consider the overflow in figuring out the reachable values of shift_expr, but shouldn't we then take the overflown value for the bit-select on the target wire?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do, by checking whether it is possible to generate the current index by a possibly overflown shift_expr (start_bit%gcd(stride, shift_mod) == 0).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I see now. So we are not relying on mkconst_int(start_bit, case_sign_hint, case_width_hint) wrapping around the constant since at that point it should never be outside the [min_offset,max_offset] range. If that's the case I suggest we have it as mkconst_int(start_bit, true) so it's clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would unfortunately make tests/verilog/dynamic_range_lhs.v fail again.
The width is required, since the bit pattern for a negative number depends on the width.

E.g. here is the diff between the working and non-working generated CASE in the first failure in the test above, generated with ../../yosys -DALT=1 -DSPAN=1 -DLEFT=-3 -DRIGHT=0 -p "read_verilog -dump_vlog2 ./dynamic_range_lhs.v":

--- case-ok.txt	2023-11-20 21:24:03.486296307 +0100
+++ case-not-ok.txt	2023-11-20 21:25:05.460245527 +0100
@@ -46,15 +46,15 @@
             1:
               begin
                 case ((sel1)*(sel2))
-                  2'b 00:
+                  0:
                     out_u[0:0] = data;
                 endcase
                 case (/** AST_TO_SIGNED **/)
-                  2'b 10:
+                  -2:
                     out_s[-2:-2] = data;
-                  2'b 11:
+                  -1:
                     out_s[-1:-1] = data;
-                  2'b 00:
+                  0:
                     out_s[0:0] = data;
                 endcase
               end

@daglem
Copy link
Contributor Author

daglem commented Nov 20, 2023

All tests pass now with a forced use_case_method = true, except for the expected failure of tests/svtypes/struct_dynamic_range.ys

The possible overflow of i*stride required a different take on this in order to optimize the generated CASE. Lots of hours for a seemingly small change, as always 🙈

In the end I had to add an internal gcd function to make the CI run - sorry about all the force pushes.

IMO this is ready to be merged now, with a possible change of source_ to dst_ or similar.

@povik
Copy link
Member

povik commented Nov 20, 2023

In the end I had to add an internal gcd function to make the CI run - sorry about all the force pushes.

No worries!

long long shift_mod = 1ll << (case_width_hint - case_sign_hint);
// std::gcd requires C++17
// long long bitno_div = std::gcd(stride, shift_mod);
long long bitno_div = gcd((long long)stride, shift_mod);
Copy link
Member

@povik povik Nov 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this undoes much of the optimisation making bitno_div only be power-of-2, not that I don't see why it's done to address the overflown cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it's not ideal, although power-of-2 will perfectly optimize common (most?) cases (e.g. gcd(8, 256) = 8), and partially optimize others (e.g. gcd(12, 256) = 4).

Since we know the value of stride, it might be possible to identify cases where the index expression cannot possibly overflow, and fully optimize those cases. I'll look into it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about now? 😅

All tests still pass with a forced use_case_method = true, and the results from running the tests in shift_issue3.zip above are:

   Chip area for module '\reg_demux_nowrshmsk_mdim_arr': 19213.891200
   Chip area for module '\reg_demux_nowrshmsk_typedef': 19115.271000
   Chip area for module '\reg_demux_pow2': 5513.961600
   Chip area for module '\reg_demux': 21857.661000

As far as I can tell, the PR now only brings improvements, without any regressions.

@povik
Copy link
Member

povik commented Dec 7, 2023

@daglem Sorry for the protracted review. I should get to it soon.

@daglem daglem force-pushed the nowrshmsk-optimization branch 2 times, most recently from 09026cc to d38a50d Compare December 9, 2023 20:58
@@ -321,6 +321,9 @@ namespace AST
static AstNode *mkconst_str(const std::vector<RTLIL::State> &v);
static AstNode *mkconst_str(const std::string &str);

// helper function to create an AST node for a temporary register
AstNode *mktemp_logic(const std::string &name, AstNode *mod, bool nosync, int range_left, int range_right, bool is_signed);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be static

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can't be static, as long as it's encoding the filename and location in wire->str.

Is that a problem? If so, I guess I could make a separate (non-static) function to generate the name, and pass the result of that into mktemp_logic, but I'm not sure I see the benefit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, missed that. No, it's not a problem per se

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI I've also thought about enhancing the function later, adding a parameter to decide whether the name should be local (no name decoration) or global (name decorated as it is now).

I believe the function could then be used to replace most of the current code to instantiate temporaries, for which I could make a PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@povik As promised did some more tests with shift_issue3.zip (including reg_demux_noshift in the Makefile). I applied #3877 to reproduce increased chip area when an unconditional temporary was introduced for the rhs.

Without any changes (no temporary in this case):
Chip area for module '\reg_demux_noshift': 18689.907600
Forcing use of the temporary, which is currently assigned by AST_ASSIGN_EQ in the current block, instantiating a (* nosync *) register:
Chip area for module '\reg_demux_noshift': 18815.101200
Forcing use of the temporary using AST_ASSIGN at the module level instead, instantiating a wire, since it could possibly be argued that this should use less resources than a register in this case (it did the contrary):
Chip area for module '\reg_demux_noshift': 18896.749200

As it stands for this test case, I'm hesitant to make any further changes.

Perhaps someone wants to investigate why introducing a temporary makes any difference at all (e.g. should it have been completely optimized away in simplify.cc / genrtlil.cc in this case, since it's assigned from a simple 117 bit wire?), however this is outside the scope for this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I experimented with this, and as far as I can tell the difference in area is due to the chaotic nature of the synthesis flow, not because of any systematic effect of introducing the temporary. When that's the case I think we want to make the temporary non-conditional, just so we spare a bit of control flow in the congested simplify.cc code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, done.

@povik
Copy link
Member

povik commented Jan 10, 2024

The CI here is stuck because of the bug introduced in #4061 before it was fixed with #4067. @daglem Can you please rebase the branch? Or if you give me the go-ahead, I can push to your branch with a rebased version.

This makes tests/verilog/dynamic_range_lhs.v pass, after ensuring that
nowrshmsk is actually tested.

Stride is extracted from indexing of two-dimensional packed arrays and
variable slices on the form dst[i*stride +: width] = src, and is used
to optimize the generated CASE block.

Also uses less confusing variable names for indexing of lhs wires.
Avoid repeating complex rvalue expressions for each condition.
@povik
Copy link
Member

povik commented Jan 10, 2024

Thanks! Let's wait for CI to finish then merge.

@daglem
Copy link
Contributor Author

daglem commented Jan 10, 2024

The CI here is stuck because of the bug introduced in #4061 before it was fixed with #4067. @daglem Can you please rebase the branch? Or if you give me the go-ahead, I can push to your branch with a rebased version.

Done!

@povik povik merged commit a921f59 into YosysHQ:master Jan 10, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants