MegaBoom, clock trees, macro placement and global routing congestion #4522

oharboe · 2024-01-11T06:33:55Z

oharboe
Jan 11, 2024
Collaborator

How can I tell if clock skew is increasing the minimum clock period for a design?

Here is my current understanding:

If two flip flops are not connected, then the clock skew between the clocks that drive those two flip flops doesn't matter because there is no timing path between these two flip flops.

Skew can be good and it can be bad. If there is a long timing path between two flip flops, then a negative skew for the starting flip flops or postivive skew for the capturing flip flop would make it easier to meet timing.

As a first order approximation though, the CTS will try to minimize clock skew, because in the end a very large clock skew will catch up with you and increase the minimum clock period.

Latest MegaBoom update:

I have modified MegaBoom so that it no longer has a PLL, but a clock for the TileLink (top level memory/peripheral interface) and for the RISC-V core.

As I understand, though I don't know the code very well, the RISC-V core is connected to the TileLink via an asynchronous FIFO(or equivalent thereof).

Therefore there are no ChipTop inputs/outputs that have an insertion point relative to the clock for the RISC-V core. This seems like a clever way of doing things, because then the insertion latency of the RISC-V clock doesn't matter(though clock uncertainty which I would expect to grow with a long clock insertation latency) for the clock period.

>>> report_clock_skew
Clock clock_uncore
Latency      CRPR       Skew
system/tile_prci_domain/tile_reset_domain_boom_tile/lsu/_728430_/CLK ^
 882.50
system/tile_prci_domain/tile_reset_domain_boom_tile/core/int_issue_unit/slots_32/_3607_/CLK ^
1008.42      0.00    -125.92

Clock serial_tl_0_clock
Latency      CRPR       Skew
system/serial_tl_domain/_1154_/CLK ^
  67.76
system/serial_tl_domain/_1275_/CLK ^
  64.77      0.00       2.99

Some notes:

The area of some of those macros are mocked to be smaller than they really are so as to fit this into 1000um x 1000um and have reasonable turnaround times in builds. The L2 is tiny in area...
The longest timing path is part of the top level design(macros are not involved), so nothing material should be lost by ignoring what is inside the macros for now. The macros are abstracts from floorplan, so completely unrealistic.
A quick test the other day put a flattened design at ca. 3mm^2 and 3 million instances(big caveat, as I write this from my flawed memory, I'm not currently studying a flattened design). To be revisited later maybe.
W.r.t. the global placement congestion, macro placement has a fix coming up, so I don't think there's anything interesting to study w.r.t. macro placement and global routing congestion until a new build has been done after that fix. mpl2: use hpwl to evaluate macro flipping #4519
almost no hold cells, 20, there used to be thousands or even tens of thousands. . 🤯
running time for CTS is now much more reasonable, 5000s down from 30000s. This is not entirely suprising: now that the clock tree is not pathologically formed, the repair job is probably quick.

Log                       Elapsed seconds
1_0_mem                          173
1_1_yosys                       3606
1_1_yosys_hier_report           3542
2_1_floorplan                    277
2_2_floorplan_io                   8
No elapsed time found in bazel-bin/logs/asap7/ChipTop/base/2_3_floorplan_tdms.log
2_4_floorplan_macro              496
2_5_floorplan_tapcell            130
2_6_floorplan_pdn                206
3_1_place_gp_skip_io             448
3_2_place_iop                     14
3_3_place_gp                    4564
3_4_place_resized                863
3_5_place_dp                    1033
4_1_cts                         5069
5_1_grt                        14525
Total                          34954

rovinski · 2024-01-11T07:47:29Z

rovinski
Jan 11, 2024
Collaborator

If two flip flops are not connected, then the clock skew between the clocks that drive those two flip flops doesn't matter because there is no timing path between these two flip flops.

Correct

Skew can be good and it can be bad. If there is a long timing path between two flip flops, then a negative skew for the starting flip flops or postivive skew for the capturing flip flop would make it easier to meet timing.

Correct

As a first order approximation though, the CTS will try to minimize clock skew, because in the end a very large clock skew will catch up with you and increase the minimum clock period.

The OR CTS engine tries to minimize skew because it is algorithmically simpler, but not necessarily the most optimal. Commercial engines use a technique called "concurrent clock optimization" which will look at the timing paths and purposefully skew certain registers if it makes timing better.

Concurrent clock optimization has a similar effect to register retiming - the former shifts the clock so it borrows setup time from one stage to give to another stage. The latter shifts logic from one stage to another stage and therefore also shifts setup time.

As I understand, though I don't know the code very well, the RISC-V core is connected to the TileLink via an asynchronous FIFO(or equivalent thereof).

Yes, async FIFOs or other clock domain crossings (CDCs) are convenient ways to break up and decouple clock trees. Clock trees cannot become too large, because the larger they are, the more power they consume and the more difficult it is to minimize skew/jitter/uncertainty. A clock tree can become so large that the jitter becomes larger than the clock period, in which case timing is impossible to meet. There is a design tradeoff between how many clock domains there are and data latency because the CDCs add one or more cycles when transmitting data across the interface.

How can I tell if clock skew is increasing the minimum clock period for a design?

I might rephrase the question more simply as "How can I tell if clock skew is bad for a design?" because clock skew always impacts the clock period, as alluded above. This is one of the areas where it takes a lot of intuition, experimentation, and heuristics to evaluate because the answer is rarely clear. A soft and perhaps unuseful rule of thumb would be "when the clock skew/jitter/uncertainty becomes a significant fraction of the clock period". There are no hard rules of thumb, because sometimes high skew can be tolerated in order to keep the design fully synchronous. I personally start to get suspicious if the skew is eating more than 20-40% of the clock period. But the size of the clock tree also matters and how much skew you would expect from a clock tree of that size.

There are some red flags, though, to identify purely suboptimal results. One is if there are many, many hold buffers being inserted. This is usually due to bad timing constraints, but it could also be due to bad skew in the clock tree.

Another red flag is if a path is failing both setup time and hold time. This most often happens not because of skew but because of jitter caused by on-chip variation. Jitter can cause clock edges to be both early and late, which means that if a path is failing both then the jitter is too high.

0 replies

oharboe · 2024-01-11T11:52:46Z

oharboe
Jan 11, 2024
Collaborator Author

This is the asynchronous connection between TileLink and the rest of the system is in the Verilog code.

Here is the expected gray counter and the corresponding Chisel code.

0 replies

oharboe · 2024-01-11T14:07:53Z

oharboe
Jan 11, 2024
Collaborator Author

I chased down the synchronous reset although it doesn't show up in the most critical path at the ChipTop level.

For now, I have created a macro out of the BranchPredictor to rein in build times. In that macro the synchronous reset has a very large fanout, which obviously is a disaster for timing.

After some investigation at the top level, I have found out that MegaBoom, as documented, is relying heavily on register retiming and that the synchronous reset is in fact pipelined.

However, since the design is hierarchical and not flattened, the design won't be able to take advantage of these three pipeline stages. Also, yosys does not support retiming.

Retiming in OpenROAD/yosys has been discussed in some detail previously, I wanted to share the results of my investigation into synchronous reset specifically for MegaBoom.

7 replies

rovinski Jan 12, 2024
Collaborator

Using CTS for a reset signal would cause higher power and area for no performance improvement.

CTS is used to balance skew so that the signal reaches endpoints at the same time. This isn't required for reset because it only needs to arrive before the next clock edge. Skew balancing would require extra power.
Data signals like reset can be pipelined but clocks cannot. Regular buffering along with pipelining is good enough.
Reset signals from a system perspective generally aren't important. It doesn't affect performance if it takes 1 cycle to reset vs. 10 cycles.
Similarly, if the clock is slowed down during reset, the penalty to performance is not that high because it only happens once during startup.

Because of these, it saves area, power, and tool time to relax constraints on reset signals.

oharboe Jan 12, 2024
Collaborator Author

Makes sense.

Additionally, I would say that a synchronous reset should not be thought of as a reset.

It is better thought of as a synchronous enable signal with a giant fan-out. The solution to handling this fanout is indeed what you describe above: add pipeline stages.

MegaBoom does add these pipeline stages(though I think 3 stages for a fanout of at least 50000 is a bit light at higher frequencies), but it relies on retiming. There is nothing wrong with relying on retiming for MegaBoom's case. I speculate that the project was under tight time and resource constraints for its tapeout and that relying on retiming was one of many ways to cut corners, so they did.

The synchronous reset signal can easily be extended with more pipeline stages, but it requires changes to the RTL.

If one uses macros, then that has to be reflected in the RTL: there needs to be reset pipeline stages outside of the macros and inside of the macros.

To my mind, the problem of the distribution of any high fanout synchronous signal is a problem that has to be solved in synthesis and even in the RTL prior to synthesis.

Yosys + OpenROAD adds a constraint on the RTL, for now, w.r.t. synchronous high fanout signals because the entire design can't be flattened(too long build times) and there is no retiming in Yosys. There are many constraints in Yosys + OpenROAD or indeed any tool that is put on the RTL. Commercial tools, being much more mature, have fewer constraints on RTL.

That said, register retiming replaces one problem with another. The control over what is generated is abstracted and there is a verification concern too. It is not unheard of that projects won't touch register timing with a ten foot pole. My experience with register retiming in FPGA has been exclusively positive, but I'm unsure where I would put register retiming in Yosys' feature priorities.

I say there are many constraints, so to back that up a bit, I'll mention one more: if I want to define a macro for the RISC-V core itself (BoomTile), because I want to add 4 CPU cores before the L2, then I should consider having one clock domain per core. This can be achieved with an asynchronous interface(TileLink) between the CPU core to the L2 & peripherals(SystemBus) in the image below.

If you read the paper, you will find such an asynchronous clock crossing in the diagrams:

What I will try to do next is to create a BoomTile with an asynchronous clock crossing. For now that alone, excluding in the SystemBus and beyond and putting it into a macro, is plenty for Yosys + OpenROAD to manage and it makes sense in a scenario where 4 cores are connected to the SystemBus and L2.

rovinski Jan 12, 2024
Collaborator

Recall the discussion here The-OpenROAD-Project/OpenROAD-flow-scripts#1710 (comment). Resets require register cloning, not retiming. Retiming can be done to some extent in synthesis (it depends on the timing model). Gate cloning largely can only be done with physical information and is therefore done during place and route.

Here is an example where reset is pipelined and it needs to reach two macros:

No matter how much you pipeline the reset signal, the optimal place for the last register is half way between the two macros. Register cloning can do something like this:

A duplicate register is created for the third stage of the pipeline. Before, the fanout was to both Macro 1 and Macro 2. Now, one register fans out only to Macro 1 and another fans out only to Macro 2. The fanout for the second pipeline register is increased from 1 to 2.

But, the net result is that the registers can be moved closer to the macros and the critical path is shorter. One thing to note is that this will change the netlist to differ from the RTL model because it is adding registers, but the netlist is still guaranteed to be equivalent.

To my mind, the problem of the distribution of any high fanout synchronous signal is a problem that has to be solved in synthesis and even in the RTL prior to synthesis.

This is an issue that spans both logical design and physical design. It will require some iteration between the two to reach a reasonable solution. The physical design dictates what are the timing and constraints on the reset signal, and the RTL needs to adjust how many pipeline stages there are for functional correctness, ensuring test benches are set up correctly, etc.

oharboe Jan 12, 2024
Collaborator Author

Makes sense. I'm struggling with the distinctions between all these optimizations, but repetition helps. Thanks!

maliberty Jan 12, 2024
Maintainer

Cloning the register wouldn't be that hard in OR but it also requires a verification methodology that can support it. I'm not sure one exists in OSS tools (I could be wrong). If its a scan register it is still more complex.

oharboe · 2024-01-12T21:05:45Z

oharboe
Jan 12, 2024
Collaborator Author

For my part, the questions were answered so closing.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MegaBoom, clock trees, macro placement and global routing congestion #4522

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 7 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

MegaBoom, clock trees, macro placement and global routing congestion #4522

oharboe Jan 11, 2024 Collaborator

Replies: 4 comments · 7 replies

rovinski Jan 11, 2024 Collaborator

oharboe Jan 11, 2024 Collaborator Author

oharboe Jan 11, 2024 Collaborator Author

rovinski Jan 12, 2024 Collaborator

oharboe Jan 12, 2024 Collaborator Author

rovinski Jan 12, 2024 Collaborator

oharboe Jan 12, 2024 Collaborator Author

maliberty Jan 12, 2024 Maintainer

oharboe Jan 12, 2024 Collaborator Author

oharboe
Jan 11, 2024
Collaborator

Replies: 4 comments 7 replies

rovinski
Jan 11, 2024
Collaborator

oharboe
Jan 11, 2024
Collaborator Author

oharboe
Jan 11, 2024
Collaborator Author

rovinski Jan 12, 2024
Collaborator

oharboe Jan 12, 2024
Collaborator Author

rovinski Jan 12, 2024
Collaborator

oharboe Jan 12, 2024
Collaborator Author

maliberty Jan 12, 2024
Maintainer

oharboe
Jan 12, 2024
Collaborator Author