Topic 10 · Timing, Numerics & PPA

Numerical Architecture Trade-offs

Video 2 of 4 · ~15 minutes

Dr. Mike Borowczak · Electrical & Computer Engineering · CECS · UCF

Video 2 of 4. What + and * actually build — the architectures behind the operators.

🌍 Where This Lives

In Industry

Every DSP chip markets by “multiply-accumulate operations per second.” A Qualcomm Hexagon DSP does ~10¹² MACs/s because each multiplier is a dedicated block. A Google TPU has 65,536 multipliers in one systolic array. Even your phone's voice assistant runs on custom arithmetic units optimized for matrix math. Architecture choices at this layer determine how many FLOPS per watt — the central metric of the AI era.

In This Course

Your Topic 5 counter uses a + 1. Your Topic 11 UART uses baud_cnt < HALF_PERIOD. Your capstone with filters or transforms will need real multipliers. Choosing the right arithmetic architecture for the job is the difference between “fits at 25 MHz” and “doesn't fit.”

⚠️ `+` Is Not One Thing

❌ Wrong Model

“The + operator in Verilog always produces the same hardware. What else could ‘addition' be?”

✓ Right Model

Synthesizers choose among several adder architectures based on context, target, and width: ripple-carry (simple, slow — O(N) delay), carry-lookahead (fast, larger — O(log N) delay), carry-select (medium), Kogge-Stone (fastest, largest, ASIC-only usually). On iCE40, Yosys uses the dedicated SB_CARRY chain — essentially fast ripple with hardware acceleration.

The receipt: A 32-bit Kogge-Stone adder is ~4× larger but ~4× faster than a 32-bit ripple adder. Different architectures for different optimization targets.

👁️ I Do — Adder Architectures

Architecture	Delay (N-bit)	Area (N-bit)	Use when
Ripple-carry	O(N)	O(N)	Narrow (<16 bits), low clock
iCE40 `SB_CARRY` chain	O(N), fast	O(N)	Default on iCE40 — use this
Carry-lookahead	O(log N)	O(N log N)	ASIC, wide, high clock
Carry-select	O(√N)	O(N)	Middle-ground FPGA designs
Kogge-Stone	O(log N)	O(N log N)	Very wide, aggressive ASIC

My thinking: On iCE40, always write c = a + b and let Yosys use SB_CARRY. The dedicated carry chain is fast (propagates in dedicated routing, not LUTs). Don't hand-roll ripple-carry adders — they're strictly worse. Trust the tool.

Architecture table. On iCE40, the SB_CARRY dedicated chain beats everything hand-written. Always write +, not hand-rolled full adders.

👁️ I Do — Multiplication: The LUT Explosion

// 8-bit unsigned multiplier
assign product = a * b;
// compact source, expensive silicon

4×4 multiply: ~20 LUTs
8×8 multiply: ~80 LUTs
16×16 multiply: ~350 LUTs (27% HX1K!)
32×32 multiply: does not fit

The reality: Multiplication is fundamentally O(N²) in area for parallel implementation. On a chip with only 1280 LUTs, you can only afford 1-2 wide multipliers. This is why DSP-heavy FPGAs (Xilinx 7-series, Intel Cyclone) include hardened multiplier blocks (DSP48, DSP slice) — to offload the area cost from general-purpose LUTs.

🤝 We Do — Sequential Shift-and-Add Multiplier

module mul_seq #(parameter W = 8) (
    input  wire           i_clk, i_start,
    input  wire [W-1:0]   i_a, i_b,
    output reg  [2*W-1:0] o_p, output reg o_done
);
    reg [W-1:0]  r_a, r_mask;
    reg [2*W-1:0] r_sum;
    reg [$clog2(W)-1:0] r_step;
    reg r_busy;
    always @(posedge i_clk) begin
        if (i_start) begin
            r_a <= i_a; r_sum <= 0; r_step <= 0; r_busy <= 1; o_done <= 0;
            r_mask <= i_b;
        end else if (r_busy) begin
            if (r_mask[0]) r_sum <= r_sum + ({{W{1'b0}}, r_a} << r_step);
            r_mask <= r_mask >> 1;
            r_step <= r_step + 1;
            if (r_step == W-1) begin r_busy <= 0; o_done <= 1; o_p <= r_sum; end
        end
    end
endmodule

Together: Instead of 80 LUTs in parallel, we use ~20 LUTs across W=8 clock cycles. One adder is reused; the state lives in r_* registers. Saves 75% area at the cost of 8× latency. Classic throughput-vs-area tradeoff.

Fixed-Point Arithmetic: Q-Format

Integers can't represent fractions, but full floating-point is expensive (~500 LUTs for a 32-bit FMUL). Fixed-point is the middle way:

Q4.12 format: 16 bits total = 4 integer bits + 12 fractional bits
  Range: -8.0 to +7.9998
  Resolution: 2^-12 = 0.000244

  Example: decimal 3.5 → binary 0011.100000000000 = 0x3800
           (3.5 * 2^12 = 14336 = 0x3800)

Key insight: A Q4.12 multiply is just a 16×16 integer multiply, followed by a right-shift of 12 to rescale. No special hardware needed — just integer operations and careful scaling. This is how all signal-processing on FPGAs/DSPs works (before floating-point was cheap enough for realtime).

🧪 You Do — Signed Arithmetic Gotcha

reg  [7:0] a = 8'hFF;    // unsigned
wire [8:0] sum = a + 1;  // what is sum?

What does sum equal?

Answer: sum = 9'h100 = 256. a is unsigned, so 8'hFF = 255. Plus 1 = 256. The 9-bit result correctly extends to hold the carry.

The trap: If you'd declared a as reg signed [7:0] a = 8'hFF;, then a = -1 (two's complement), and sum = 9'h000 = 0. Same bits, different interpretation. Always use signed keyword explicitly when you mean signed arithmetic; never leave it to the implied default.

The signed/unsigned bug is the most common arithmetic bug in Verilog. Explicit signed declaration is mandatory in any signed-arithmetic design.

📐 LUT Cost Cheat Sheet (iCE40 LUT4)

Back-of-the-envelope estimates before you run make stat. Numbers are LUT4s on iCE40; + assumes the SB_CARRY chain is used.

Construct	Verilog	≈ LUT4 count	Why
Single k-input gate	`&{a,b,c,...}` (k inputs)	⌈(k−1)/3⌉	Cascade LUT4s, 3 new inputs each
N-bit adder (carry chain)	`a + b`	≈ N	1 LUT + 1 SB_CARRY per bit
N-bit adder (no carry chain)	hand-rolled ripple	≈ 2N	Sum + carry both eat LUTs
N-bit equality compare	`a == b`	≈ ⌈N/3⌉ + ⌈N/9⌉	XOR per bit, then OR-tree
N-bit magnitude compare	`a < b`	≈ N	Subtractor + sign bit
M:1 mux, 1-bit data	`case(sel)`	≈ ⌈(M−1)/3⌉	Tree of 4:1 muxes per output bit
M:1 mux, W-bit data	indexed select	≈ W · ⌈(M−1)/3⌉	Replicated per bit
N-bit register	`always @(posedge clk)`	0 LUTs (N FFs)	FFs are free of LUTs
N×N parallel multiplier	`a * b`	≈ N² (no DSP block)	Partial products + carry tree

How to use it: estimate before you synthesize, then run make stat and check. If your guess and the tool disagree by more than ~20%, something interesting happened — find out what (carry chain not inferred? a constant-folded? case statement collapsed?). That gap is where the learning is.

This is the table I wish I'd had on day one. None of these numbers are exact — synthesis will fold constants, share logic, and surprise you — but they're close enough to make a prediction before you hit the tool. The k-input gate row is the cascade rule from Topic 1. The adder row tells you why an N-bit add costs roughly N LUTs on iCE40, not 2N: the carry chain takes half the work into dedicated routing. The compare and mux rows tell you why a wide case statement or a giant priority encoder can dominate your area. The multiplier row is the punchline of this whole video: parallel multiplication scales as N squared, which is why a 16x16 multiplier is already a quarter of the iCE40. Use this table to predict, then use yosys stat to verify. The gap between prediction and reality is where you actually learn how the synthesizer thinks.

▶ LIVE DEMO

Adder & Multiplier Synthesis Comparison

~5 minutes

▸ COMMANDS

cd lecture_examples/week3_day10/d10_s2_ex1/
make stat WIDTH=8
make stat WIDTH=16
make stat WIDTH=32
# vs sequential multiplier:
make stat_seq WIDTH=16

▸ EXPECTED STDOUT

Parallel mult:
  8-bit:  82 LUTs
  16-bit: 348 LUTs
  32-bit: FAILS (too big)

Sequential mult:
  16-bit: 58 LUTs, 16 cycles

▸ KEY OBSERVATION

Same function, 6× area reduction by accepting 16× latency. When to pick which? Throughput requirement. One-sample-per-100-cycles audio processing? Sequential. One-sample-per-cycle filter tap? Parallel.

🤖 Check the Machine

Ask AI: “I need a 24×24 fixed-point Q8.16 multiplier. Parallel won't fit on iCE40. Design a sequential shift-and-add multiplier and estimate area and latency.”

TASK

AI designs a sequential wide multiplier.

BEFORE

Predict: ~70 LUTs + 24 cycles. Must document Q8.16 scaling.

AFTER

Strong AI discusses Q-format scaling. Weak AI treats it as integer-only.

TAKEAWAY

Fixed-point requires explicit scaling discussion. AI skipping it = bug.

Key Takeaways

① + is a family of architectures. On iCE40, use SB_CARRY (write +, trust tool).

② Multiplication is O(N²) area. Wide multipliers fill small FPGAs fast.

③ Sequential multiplier: time-for-area. Essential when parallel won't fit.

④ Fixed-point (Q-format) = integer math + virtual decimal + right-shift.

⑤ Signed arithmetic: always declare signed explicitly.

Match the arithmetic architecture to the throughput requirement. Never default to parallel.

🔗 Transfer

PPA: Performance, Power, Area

Video 3 of 4 · ~11 minutes

▸ WHY THIS MATTERS NEXT

You've now made timing and arithmetic choices. Video 3 steps back: every hardware choice sits on three axes — Performance (Fmax, throughput), Power (static + dynamic), Area (LUTs, EBRs, dollars). Real engineering lives in that three-dimensional space. You'll learn to chart your design, do design-space exploration, and report honestly.

Numerical Architecture Trade-offs

🌍 Where This Lives

In Industry

In This Course

⚠️ + Is Not One Thing

❌ Wrong Model

✓ Right Model

👁️ I Do — Adder Architectures

👁️ I Do — Multiplication: The LUT Explosion

🤝 We Do — Sequential Shift-and-Add Multiplier

Fixed-Point Arithmetic: Q-Format

🧪 You Do — Signed Arithmetic Gotcha

📐 LUT Cost Cheat Sheet (iCE40 LUT4)

Adder & Multiplier Synthesis Comparison

🤖 Check the Machine

Key Takeaways

PPA: Performance, Power, Area

⚠️ `+` Is Not One Thing