Video 2 of 4 · ~15 minutes
Dr. Mike Borowczak · Electrical & Computer Engineering · CECS · UCF
Every DSP chip markets by “multiply-accumulate operations per second.” A Qualcomm Hexagon DSP does ~1012 MACs/s because each multiplier is a dedicated block. A Google TPU has 65,536 multipliers in one systolic array. Even your phone's voice assistant runs on custom arithmetic units optimized for matrix math. Architecture choices at this layer determine how many FLOPS per watt — the central metric of the AI era.
Your Topic 5 counter uses a + 1. Your Topic 11 UART uses baud_cnt < HALF_PERIOD. Your capstone with filters or transforms will need real multipliers. Choosing the right arithmetic architecture for the job is the difference between “fits at 25 MHz” and “doesn't fit.”
+ Is Not One Thing“The + operator in Verilog always produces the same hardware. What else could ‘addition' be?”
Synthesizers choose among several adder architectures based on context, target, and width: ripple-carry (simple, slow — O(N) delay), carry-lookahead (fast, larger — O(log N) delay), carry-select (medium), Kogge-Stone (fastest, largest, ASIC-only usually). On iCE40, Yosys uses the dedicated SB_CARRY chain — essentially fast ripple with hardware acceleration.
| Architecture | Delay (N-bit) | Area (N-bit) | Use when |
|---|---|---|---|
| Ripple-carry | O(N) | O(N) | Narrow (<16 bits), low clock |
iCE40 SB_CARRY chain | O(N), fast | O(N) | Default on iCE40 — use this |
| Carry-lookahead | O(log N) | O(N log N) | ASIC, wide, high clock |
| Carry-select | O(√N) | O(N) | Middle-ground FPGA designs |
| Kogge-Stone | O(log N) | O(N log N) | Very wide, aggressive ASIC |
c = a + b and let Yosys use SB_CARRY. The dedicated carry chain is fast (propagates in dedicated routing, not LUTs). Don't hand-roll ripple-carry adders — they're strictly worse. Trust the tool.
// 8-bit unsigned multiplier
assign product = a * b;
// compact source, expensive silicon
module mul_seq #(parameter W = 8) (
input wire i_clk, i_start,
input wire [W-1:0] i_a, i_b,
output reg [2*W-1:0] o_p, output reg o_done
);
reg [W-1:0] r_a, r_mask;
reg [2*W-1:0] r_sum;
reg [$clog2(W)-1:0] r_step;
reg r_busy;
always @(posedge i_clk) begin
if (i_start) begin
r_a <= i_a; r_sum <= 0; r_step <= 0; r_busy <= 1; o_done <= 0;
r_mask <= i_b;
end else if (r_busy) begin
if (r_mask[0]) r_sum <= r_sum + ({{W{1'b0}}, r_a} << r_step);
r_mask <= r_mask >> 1;
r_step <= r_step + 1;
if (r_step == W-1) begin r_busy <= 0; o_done <= 1; o_p <= r_sum; end
end
end
endmodule
Integers can't represent fractions, but full floating-point is expensive (~500 LUTs for a 32-bit FMUL). Fixed-point is the middle way:
Q4.12 format: 16 bits total = 4 integer bits + 12 fractional bits
Range: -8.0 to +7.9998
Resolution: 2^-12 = 0.000244
Example: decimal 3.5 → binary 0011.100000000000 = 0x3800
(3.5 * 2^12 = 14336 = 0x3800)
reg [7:0] a = 8'hFF; // unsigned
wire [8:0] sum = a + 1; // what is sum?
What does sum equal?
a is unsigned, so 8'hFF = 255. Plus 1 = 256. The 9-bit result correctly extends to hold the carry.
a as reg signed [7:0] a = 8'hFF;, then a = -1 (two's complement), and sum = 9'h000 = 0. Same bits, different interpretation. Always use signed keyword explicitly when you mean signed arithmetic; never leave it to the implied default.
Back-of-the-envelope estimates before you run make stat.
Numbers are LUT4s on iCE40; + assumes the SB_CARRY chain is used.
| Construct | Verilog | ≈ LUT4 count | Why |
|---|---|---|---|
| Single k-input gate | &{a,b,c,...} (k inputs) | ⌈(k−1)/3⌉ | Cascade LUT4s, 3 new inputs each |
| N-bit adder (carry chain) | a + b | ≈ N | 1 LUT + 1 SB_CARRY per bit |
| N-bit adder (no carry chain) | hand-rolled ripple | ≈ 2N | Sum + carry both eat LUTs |
| N-bit equality compare | a == b | ≈ ⌈N/3⌉ + ⌈N/9⌉ | XOR per bit, then OR-tree |
| N-bit magnitude compare | a < b | ≈ N | Subtractor + sign bit |
| M:1 mux, 1-bit data | case(sel) | ≈ ⌈(M−1)/3⌉ | Tree of 4:1 muxes per output bit |
| M:1 mux, W-bit data | indexed select | ≈ W · ⌈(M−1)/3⌉ | Replicated per bit |
| N-bit register | always @(posedge clk) | 0 LUTs (N FFs) | FFs are free of LUTs |
| N×N parallel multiplier | a * b | ≈ N² (no DSP block) | Partial products + carry tree |
make stat and check.
If your guess and the tool disagree by more than ~20%, something interesting happened — find out what
(carry chain not inferred? a constant-folded? case statement collapsed?). That gap is where the learning is.
~5 minutes
▸ COMMANDS
cd lecture_examples/week3_day10/d10_s2_ex1/
make stat WIDTH=8
make stat WIDTH=16
make stat WIDTH=32
# vs sequential multiplier:
make stat_seq WIDTH=16
▸ EXPECTED STDOUT
Parallel mult:
8-bit: 82 LUTs
16-bit: 348 LUTs
32-bit: FAILS (too big)
Sequential mult:
16-bit: 58 LUTs, 16 cycles
▸ KEY OBSERVATION
Same function, 6× area reduction by accepting 16× latency. When to pick which? Throughput requirement. One-sample-per-100-cycles audio processing? Sequential. One-sample-per-cycle filter tap? Parallel.
Ask AI: “I need a 24×24 fixed-point Q8.16 multiplier. Parallel won't fit on iCE40. Design a sequential shift-and-add multiplier and estimate area and latency.”
TASK
AI designs a sequential wide multiplier.
BEFORE
Predict: ~70 LUTs + 24 cycles. Must document Q8.16 scaling.
AFTER
Strong AI discusses Q-format scaling. Weak AI treats it as integer-only.
TAKEAWAY
Fixed-point requires explicit scaling discussion. AI skipping it = bug.
① + is a family of architectures. On iCE40, use SB_CARRY (write +, trust tool).
② Multiplication is O(N2) area. Wide multipliers fill small FPGAs fast.
③ Sequential multiplier: time-for-area. Essential when parallel won't fit.
④ Fixed-point (Q-format) = integer math + virtual decimal + right-shift.
⑤ Signed arithmetic: always declare signed explicitly.
🔗 Transfer
Video 3 of 4 · ~11 minutes
▸ WHY THIS MATTERS NEXT
You've now made timing and arithmetic choices. Video 3 steps back: every hardware choice sits on three axes — Performance (Fmax, throughput), Power (static + dynamic), Area (LUTs, EBRs, dollars). Real engineering lives in that three-dimensional space. You'll learn to chart your design, do design-space exploration, and report honestly.