To add some more technical context for those interested, the most challenging part was the 6 level binary adder tree. In Chisel, it’s easy to write reduceTree(_ +& _), but at 130nm, the routing congestion becomes real when you're trying to meet a 10ns period with 86k cells. I ended up manually inserting pipeline registers between the 3rd and 4th levels of the tree to balance the slack.
I’m also curious about the Hold Violations I encountered. OpenROAD handled them by inserting massive amounts of buffer padding, which is why the utilization is around 39% despite the logic being quite dense. Has anyone here had experience balancing area vs hold-slack for high-speed dataflow like this?