Topic 10 · Timing, Numerics & PPA

Numerical Architecture Trade-offs

Video 2 of 4 · ~15 minutes

Dr. Mike Borowczak · Electrical & Computer Engineering · CECS · UCF

TimingNumericsPPAASIC Context

🌍 Where This Lives

In Industry

Every DSP chip markets by “multiply-accumulate operations per second.” A Qualcomm Hexagon DSP does ~1012 MACs/s because each multiplier is a dedicated block. A Google TPU has 65,536 multipliers in one systolic array. Even your phone's voice assistant runs on custom arithmetic units optimized for matrix math. Architecture choices at this layer determine how many FLOPS per watt — the central metric of the AI era.

In This Course

Your Topic 5 counter uses a + 1. Your Topic 11 UART uses baud_cnt < HALF_PERIOD. Your capstone with filters or transforms will need real multipliers. Choosing the right arithmetic architecture for the job is the difference between “fits at 25 MHz” and “doesn't fit.”

⚠️ + Is Not One Thing

❌ Wrong Model

“The + operator in Verilog always produces the same hardware. What else could ‘addition' be?”

✓ Right Model

Synthesizers choose among several adder architectures based on context, target, and width: ripple-carry (simple, slow — O(N) delay), carry-lookahead (fast, larger — O(log N) delay), carry-select (medium), Kogge-Stone (fastest, largest, ASIC-only usually). On iCE40, Yosys uses the dedicated SB_CARRY chain — essentially fast ripple with hardware acceleration.

The receipt: A 32-bit Kogge-Stone adder is ~4× larger but ~4× faster than a 32-bit ripple adder. Different architectures for different optimization targets.

👁️ I Do — Adder Architectures

ArchitectureDelay (N-bit)Area (N-bit)Use when
Ripple-carryO(N)O(N)Narrow (<16 bits), low clock
iCE40 SB_CARRY chainO(N), fastO(N)Default on iCE40 — use this
Carry-lookaheadO(log N)O(N log N)ASIC, wide, high clock
Carry-selectO(√N)O(N)Middle-ground FPGA designs
Kogge-StoneO(log N)O(N log N)Very wide, aggressive ASIC
My thinking: On iCE40, always write c = a + b and let Yosys use SB_CARRY. The dedicated carry chain is fast (propagates in dedicated routing, not LUTs). Don't hand-roll ripple-carry adders — they're strictly worse. Trust the tool.

👁️ I Do — Multiplication: The LUT Explosion

// 8-bit unsigned multiplier
assign product = a * b;
// compact source, expensive silicon
  • 4×4 multiply: ~20 LUTs
  • 8×8 multiply: ~80 LUTs
  • 16×16 multiply: ~350 LUTs (27% HX1K!)
  • 32×32 multiply: does not fit
RTL diagram of a single-cycle parallel multiplier. Two operand inputs a (orange) and b (orange) feed a combinational block (red) labeled 'N×N multiplier — N² AND gates → partial products + carry tree of N adders'. The result product[2N-1:0] (green) emerges combinationally. The block is annotated with area scaling: 8×8 ≈ 80 LUTs, 16×16 ≈ 350 LUTs.
The reality: Multiplication is fundamentally O(N2) in area for parallel implementation. On a chip with only 1280 LUTs, you can only afford 1-2 wide multipliers. This is why DSP-heavy FPGAs (Xilinx 7-series, Intel Cyclone) include hardened multiplier blocks (DSP48, DSP slice) — to offload the area cost from general-purpose LUTs.

🤝 We Do — Sequential Shift-and-Add Multiplier

module mul_seq #(parameter W = 8) (
    input  wire           i_clk, i_start,
    input  wire [W-1:0]   i_a, i_b,
    output reg  [2*W-1:0] o_p, output reg o_done
);
    reg [W-1:0]  r_a, r_mask;
    reg [2*W-1:0] r_sum;
    reg [$clog2(W)-1:0] r_step;
    reg r_busy;
    always @(posedge i_clk) begin
        if (i_start) begin
            r_a <= i_a; r_sum <= 0; r_step <= 0; r_busy <= 1; o_done <= 0;
            r_mask <= i_b;
        end else if (r_busy) begin
            if (r_mask[0]) r_sum <= r_sum + ({{W{1'b0}}, r_a} << r_step);
            r_mask <= r_mask >> 1;
            r_step <= r_step + 1;
            if (r_step == W-1) begin r_busy <= 0; o_done <= 1; o_p <= r_sum; end
        end
    end
endmodule
RTL diagram of mul_seq sequential multiplier. i_clk (blue) clocks five registers (purple): r_a (latched a), r_mask (shift-right b), r_sum (2W-bit accumulator), r_step (cycle counter), and r_busy (state flag). i_a and i_b (orange) feed the latches; i_start (magenta) initiates a transaction. A central red shift-and-add block (r_a << r_step) accumulates into r_sum when r_mask[0]=1. r_sum loops back for accumulation. Outputs o_p (green) and o_done (green) emit when r_step reaches W-1. The single adder is reused W times.
Together: Instead of 80 LUTs in parallel, we use ~20 LUTs across W=8 clock cycles. One adder is reused; the state lives in r_* registers. Saves 75% area at the cost of 8× latency. Classic throughput-vs-area tradeoff.

Fixed-Point Arithmetic: Q-Format

Integers can't represent fractions, but full floating-point is expensive (~500 LUTs for a 32-bit FMUL). Fixed-point is the middle way:

Q4.12 format: 16 bits total = 4 integer bits + 12 fractional bits
  Range: -8.0 to +7.9998
  Resolution: 2^-12 = 0.000244

  Example: decimal 3.5 → binary 0011.100000000000 = 0x3800
           (3.5 * 2^12 = 14336 = 0x3800)
Key insight: A Q4.12 multiply is just a 16×16 integer multiply, followed by a right-shift of 12 to rescale. No special hardware needed — just integer operations and careful scaling. This is how all signal-processing on FPGAs/DSPs works (before floating-point was cheap enough for realtime).

🧪 You Do — Signed Arithmetic Gotcha

reg  [7:0] a = 8'hFF;    // unsigned
wire [8:0] sum = a + 1;  // what is sum?
RTL diagram of an 8-bit + 1 adder. a[7:0] (orange) and constant 1 feed an 8-bit adder block (yellow, uses SB_CARRY chain) producing sum[8:0] (green). Two interpretation panels at the bottom: green box reads it as unsigned, 255+1 = 256 = 9'h100; red box reads it as signed two's complement, −1+1 = 0 = 9'h000. Same wires, two interpretations.

What does sum equal?

Answer: sum = 9'h100 = 256. a is unsigned, so 8'hFF = 255. Plus 1 = 256. The 9-bit result correctly extends to hold the carry.
The trap: If you'd declared a as reg signed [7:0] a = 8'hFF;, then a = -1 (two's complement), and sum = 9'h000 = 0. Same bits, different interpretation. Always use signed keyword explicitly when you mean signed arithmetic; never leave it to the implied default.

📐 LUT Cost Cheat Sheet (iCE40 LUT4)

Back-of-the-envelope estimates before you run make stat. Numbers are LUT4s on iCE40; + assumes the SB_CARRY chain is used.

ConstructVerilog≈ LUT4 countWhy
Single k-input gate&{a,b,c,...} (k inputs)⌈(k−1)/3⌉Cascade LUT4s, 3 new inputs each
N-bit adder (carry chain)a + b≈ N1 LUT + 1 SB_CARRY per bit
N-bit adder (no carry chain)hand-rolled ripple≈ 2NSum + carry both eat LUTs
N-bit equality comparea == b≈ ⌈N/3⌉ + ⌈N/9⌉XOR per bit, then OR-tree
N-bit magnitude comparea < b≈ NSubtractor + sign bit
M:1 mux, 1-bit datacase(sel)≈ ⌈(M−1)/3⌉Tree of 4:1 muxes per output bit
M:1 mux, W-bit dataindexed select≈ W · ⌈(M−1)/3⌉Replicated per bit
N-bit registeralways @(posedge clk)0 LUTs (N FFs)FFs are free of LUTs
N×N parallel multipliera * b≈ N² (no DSP block)Partial products + carry tree
How to use it: estimate before you synthesize, then run make stat and check. If your guess and the tool disagree by more than ~20%, something interesting happened — find out what (carry chain not inferred? a constant-folded? case statement collapsed?). That gap is where the learning is.
▶ LIVE DEMO

Adder & Multiplier Synthesis Comparison

~5 minutes

▸ COMMANDS

cd lecture_examples/week3_day10/d10_s2_ex1/
make stat WIDTH=8
make stat WIDTH=16
make stat WIDTH=32
# vs sequential multiplier:
make stat_seq WIDTH=16

▸ EXPECTED STDOUT

Parallel mult:
  8-bit:  82 LUTs
  16-bit: 348 LUTs
  32-bit: FAILS (too big)

Sequential mult:
  16-bit: 58 LUTs, 16 cycles

▸ KEY OBSERVATION

Same function, 6× area reduction by accepting 16× latency. When to pick which? Throughput requirement. One-sample-per-100-cycles audio processing? Sequential. One-sample-per-cycle filter tap? Parallel.

🤖 Check the Machine

Ask AI: “I need a 24×24 fixed-point Q8.16 multiplier. Parallel won't fit on iCE40. Design a sequential shift-and-add multiplier and estimate area and latency.”

TASK

AI designs a sequential wide multiplier.

BEFORE

Predict: ~70 LUTs + 24 cycles. Must document Q8.16 scaling.

AFTER

Strong AI discusses Q-format scaling. Weak AI treats it as integer-only.

TAKEAWAY

Fixed-point requires explicit scaling discussion. AI skipping it = bug.

Key Takeaways

+ is a family of architectures. On iCE40, use SB_CARRY (write +, trust tool).

 Multiplication is O(N2) area. Wide multipliers fill small FPGAs fast.

 Sequential multiplier: time-for-area. Essential when parallel won't fit.

 Fixed-point (Q-format) = integer math + virtual decimal + right-shift.

 Signed arithmetic: always declare signed explicitly.

Match the arithmetic architecture to the throughput requirement. Never default to parallel.

🔗 Transfer

PPA: Performance, Power, Area

Video 3 of 4 · ~11 minutes

▸ WHY THIS MATTERS NEXT

You've now made timing and arithmetic choices. Video 3 steps back: every hardware choice sits on three axes — Performance (Fmax, throughput), Power (static + dynamic), Area (LUTs, EBRs, dollars). Real engineering lives in that three-dimensional space. You'll learn to chart your design, do design-space exploration, and report honestly.