On this page
- What You Need to Know First
- The Bypass Module as a Three-Way Traffic Cop
- Inside the Bypass Decision Engine
- Step 1: Extract register addresses from the pipeline stages
- Step 2: Resolve which operand slot to use for source B
- Step 3: Detect the conflict with two guards
- Step 4: Apply priority and drive the mux select
- Step 5: The mux at the ALU input
- When Forwarding Isn’t Enough: The Load-Use Stall
- Why This Is Fast Enough at 100 MHz
- The Invisible Bugs This Logic Prevents
A pipelined processor doesn’t execute one instruction at a time. Five instructions are in flight simultaneously, each at a different stage: one being fetched, one decoded, one executing, one accessing memory, one writing its result. This works beautifully — until one instruction produces a value that the very next instruction needs to read.
Consider two consecutive add instructions where the second uses the result of the first. The first instruction computes its result in the Execute stage. The second instruction reads its operands in the Decode stage — which happens one cycle earlier. Without intervention, the second instruction reads a stale value from the register file, because the first hasn’t written its result back yet.
This tutorial dissects the bypass module — sixty lines of Verilog that solve this problem without ever stalling the pipeline for register-to-register dependencies. It detects the conflict in real time and reroutes the computed value directly from the later pipeline stage back to the ALU input, before the clock edge commits the wrong answer.
This technique is called “data forwarding” or “bypassing.” It’s one of the three canonical solutions to data hazards in pipelined processors (the others being stalling and compiler-level instruction reordering). Every modern CPU from the ARM Cortex-M4 to Apple’s M-series implements some form of it.
What You Need to Know First
- The five stages of a classic RISC pipeline (Fetch, Decode, Execute, Memory, Writeback)
- How an instruction’s opcode, source registers, and destination register are encoded in a 32-bit word
- The role of pipeline latches (registers that separate stages)
Key instruction format for this ISA:
| Bits | Field | Description |
|---|---|---|
| 31–27 | Opcode | Instruction type (5 bits) |
| 26–22 | RD/RT | Destination register |
| 21–17 | RS | Source register A |
| 16–12 | RT/RD | Source register B (R-type) or destination (I-type) |
Files under discussion:
modules/processor/bypass.v(60 lines)modules/processor/stall.v(26 lines)modules/processor/processor.v(lines 64–68, 167–168, 230–231)
The Bypass Module as a Three-Way Traffic Cop
Think of the pipeline as five workers on an assembly line. Worker 3 (Execute) just finished computing a value and is about to pass it forward. But Worker 2 (Decode) is simultaneously reading the old version of that same value from the shared parts shelf (register file).
The bypass module is a supervisor standing between Worker 2 and the parts shelf. Before Worker 2 grabs a value, the supervisor checks: “Did Worker 3 or Worker 4 just produce a newer version of this part?” If so, the supervisor intercepts the read and hands over the fresh value instead.
flowchart TD
DX["Decode/Execute Stage\n(needs operands)"]
XM["Execute/Memory Stage\n(has fresh result)"]
MW["Memory/Writeback Stage\n(has older result)"]
RF["Register File\n(has oldest value)"]
MUX["4:1 Bypass Mux"]
ALU["ALU Input"]
XM -->|"Priority 1: xm_O"| MUX
MW -->|"Priority 2: data_writeReg"| MUX
RF -->|"Priority 3: dx_A / dx_B"| MUX
MUX --> ALU
DX -.->|"Bypass select\n(from bypass.v)"| MUX
Inside the Bypass Decision Engine
Step 1: Extract register addresses from the pipeline stages
Each pipeline stage carries a full copy of the instruction it’s executing. The bypass module extracts the destination register from the Execute/Memory (XM) and Memory/Writeback (MW) stages, and the source registers from the current Decode/Execute (DX) stage:
wire [4:0] dx_rs = dx_inst[21:17]; // source A
wire [4:0] dx_rt = dx_inst[26:22]; // source B (I-type)
wire [4:0] dx_rd = dx_inst[16:12]; // source B (R-type)
wire [4:0] xm_rd = (xm_is_setx || xm_ovf) ? 5'd30 : xm_inst[26:22];
wire [4:0] mw_rd = (mw_is_setx || mw_ovf) ? 5'd30 : mw_inst[26:22];
The override for setx and overflow forces the destination register to R30 (the status register) when those conditions are active. This ensures the forwarding network sends the status value to any subsequent instruction reading R30, regardless of what the instruction’s RD field says.
Step 2: Resolve which operand slot to use for source B
R-type instructions encode their second operand in bits [16:12], while I-type instructions use bits [26:22]. A single mux resolves this before the comparison begins:
wire dx_is_r_inst = (dx_op == 5'b00000);
wire [4:0] dx_r2 = dx_is_r_inst ? dx_rd : dx_rt;
Step 3: Detect the conflict with two guards
The core comparison is simple — does the DX source register match the XM or MW destination? Two guards prevent false matches:
wire dx_rs_xm_rd_eq = (dx_rs == xm_rd) && (xm_rd != 5'b0);
wire dx_rs_mw_rd_eq = (dx_rs == mw_rd) && (mw_rd != 5'b0);
The != 5'b0 guard is essential. R0 is hardwired to zero in this ISA — even if an instruction writes to R0, the result is discarded. Without this guard, every instruction that uses R0 as a source would incorrectly trigger a forward.
Step 4: Apply priority and drive the mux select
When both XM and MW stages have matching destinations, the XM result is newer and takes priority. Tri-state drivers encode this:
wire bypassA0 = !xm_is_sw && !xm_is_branch && dx_rs_xm_rd_eq; // XM hit
wire bypassA1 = !mw_is_sw && !mw_is_branch && dx_rs_mw_rd_eq; // MW hit
tri_state2 sA0(.out(bypass_selectA), .in(2'd0), .en(bypassA0));
tri_state2 sA1(.out(bypass_selectA), .in(2'd1), .en(bypassA1 && !bypassA0));
tri_state2 sA2(.out(bypass_selectA), .in(2'd2), .en(!bypassA0 && !bypassA1));
Stores and branches are excluded (!xm_is_sw && !xm_is_branch) because they don’t produce register results — forwarding from them would inject garbage into the ALU.
Step 5: The mux at the ALU input
In processor.v, the bypass select drives a four-input mux at each ALU operand:
mux_4 alu_mux1(
.out(alu_A_in),
.select(alu_bypass_selectA),
.in0(xm_O), // select=0: forward from XM stage
.in1(data_writeReg), // select=1: forward from MW stage
.in2(dx_A), // select=2: use original register value
.in3(32'b0) // select=3: unused
);
The entire forwarding decision — conflict detection, priority encoding, mux selection — executes combinationally within a single clock cycle. No pipeline stall.
When Forwarding Isn’t Enough: The Load-Use Stall
There’s one case where forwarding fails. A load instruction doesn’t produce its result until the end of the Memory stage. If the very next instruction needs that value in Execute, even the XM forwarding path is too late — the data hasn’t arrived yet.
The stall module detects this specific pattern and freezes the pipeline for exactly one cycle:
assign stall = dx_lw
&& fd_uses_alu
&& ((fd_rs == dx_rd)
|| ((fd_rt == dx_rd) && !fd_sw));
The stall signal must engage before the bypass mux commits to a forwarding selection. In processor.v, both are computed in parallel, and the stall prevents the DX latch from advancing — so even if the bypass logic produces a selection, the downstream stage never sees it. If the stall were delayed by one cycle, a wrong value would propagate through the ALU silently.
Why This Is Fast Enough at 100 MHz
The forwarding network adds combinational logic between the pipeline latches and the ALU. This logic — two five-bit comparators, two AND gates, a four-input mux — must settle within one clock half-period (5 ns at 100 MHz).
The worst-case critical path for operand A:
- XM latch outputs
xm_inst[26:22](destination register) - Five-bit equality comparator:
dx_rs == xm_rd(~2 gate delays) - AND gate with
!xm_is_swexclusion (~1 gate delay) - Tri-state enable on bypass select (~1 gate delay)
- Four-input mux selects
xm_O(~2 gate delays) - ALU receives
alu_A_in
Total: approximately six gate delays. At 100 MHz with latches sampling on the falling edge, this leaves margin even on a moderately slow FPGA fabric. The logic is shallow — just comparators and a mux, no sequential elements or memory accesses in the path.
The Invisible Bugs This Logic Prevents
Forwarding from a store. A store instruction doesn’t write to a register — it writes to memory. Without the !xm_is_sw guard, a store followed by an instruction with a matching register field would forward the memory address (not useful data) into the ALU.
Forwarding from a branch. Branch instructions compare two registers but don’t produce a result. Their RD field is meaningless as a forwarding source. Without !xm_is_branch, a branch followed by a matching instruction receives a forwarded value that represents nothing.
Forwarding R0. R0 is always zero. If instruction A writes to R0 (a no-op by convention) and instruction B reads R0, forwarding A’s computed result instead of zero violates the ISA contract. The xm_rd != 5'b0 guard enforces this invariant.
Priority inversion. If both XM and MW match, using the MW value (older) instead of the XM value (newer) corrupts the computation. The bypassA1 && !bypassA0 condition ensures XM always wins when both match.
The same pattern scales to wider pipelines. A seven-stage design might need forwarding from three or four stages instead of two, but the structure is identical: extract destination registers from later stages, compare against source registers in the current stage, prioritize by recency, drive a mux. The bypass.v module is a minimal, correct implementation of a pattern that applies universally.