Open VDF: Öztürk Multiplier Design Overview

0_6AayQsff3TdL7JgF.png

Design Features

The current architecture under consideration is based on the Öztürk multiplier, which was presented in this MIT VDF Day talk. This design includes several features that are critical to a low latency hardware implementation.

  • Redundant polynomial representation — Eliminates the need to propagate a carry across 2k bits. A 2k carry would result in a very deep gate depth and slow design. Instead our longest carry is across a single digit and can be varied by choosing different digit sizes. In general carry propagation is a big challenge for low latency.

  • Fully unrolled pipeline — A typical traditional throughput oriented design uses pipelining to share resources and reduce area and power. In contrast this design stamps out enough hardware to eliminate sharing and extract all available parallelism. While resource sharing is not required, the design can be pipelined if desired.

  • Log depth compressor trees — Summing of partial products and reduction values comprises the bulk of the hardware. The design makes use of adder trees, resulting in a gate depth that is O(log n) with the number of inputs to sum.

  • Fixed modulus — With a fixed modulus the lookup tables are constants. We use 1-bit lookup tables, which means for each bit above 2k we’re choosing either the constant reduction value or zero to send to the compression trees. Zeros in the precomputed reduction values will be eliminated by the synthesis tool through constant propagation. In the end the reduction tables get absorbed into the reduction compression trees.

  • Critical path optimization — One clever optimization is to reduce from both the top and the bottom simultaneously, as opposed to from the top (Barrett) or bottom (Montgomery) alone. In doing so we can take advantage of the squarer triangle shape. The shallow portions needed for reduction (red) need to be fast to generate the reduction values and drive the reduction compression tree. The deepest portion of the triangle white can also be the slowest, since that result can intercept the end of the reduction compression tree. This reduces the overall critical path depth for the design and saves area and power.

Redundant Polynomial Representation Explained

The MSU architecture is optimized for a repeated-squaring setting. A novel redundant representation is utilized to achieve extremely low latency per modular squaring operation.

For efficiency of modular squaring operations, we utilize two levels of redundancy:

  • Each coefficient holds one extra bit to hold the extra carry bit resulting from coefficient-wise additions. This way, carry propagation is contained within each coefficient.

  • Each polynomial representing an integer holds an extra coefficient to reduce levels of reduction. The reduction operation consists of accumulating ~1024 2048-bit numbers together, producing a 2048+11 bit result. Instead of adding a second level of reduction to reduce these extra 11 bits, we utilize an extra redundant coefficient. This extra coefficient has negligible effect on the complexity of squaring and reduction operations, and the result of the reduction operation is still contained within 2048+11 bits.

As stated, the purpose of the redundant representation is to break the carry chain. Typically a squarer would have to propagate carries across the entire bit width being squared, resulting in a very deep critical path and long latency. In this design we break the carry chain at polynomial coefficient boundaries, resulting in a longest carry chain in the 20’s of bits.

Upcoming Posts

In the upcoming posts we will dive into more detail on designing a low latency MSU by exploring:

  • Carry propagate adders (CPA)

  • Carry save adders (CSA) and compressor trees

  • Partial product generation (PPG)

  • Ideal coefficient bitwidth

  • Complete MSU design

Source Code

Source code for the designs presented here are available on GitHub at supranational/hardware and is licensed under the Apache 2 license.