featured image

The Instruction That Needs an Answer Before the Question Is Asked

A detailed walkthrough of the `bypass` module in this Wordle project — a combinational forwarding network that detects and resolves data hazards in a five-stage RISC pipeline, ensuring correct execution without stalling for register-to-register dependencies.

Published

Sat Aug 09 2025

Technologies Used

RISC Verilog
Intermediate 22 minutes

When Two Pipeline Stages Both Want the Same Register at the Same Time

A pipelined processor does not execute one instruction at a time. Five instructions are in flight simultaneously, each at a different stage: one is being fetched, one decoded, one executed, one accessing memory, and one writing its result. This works beautifully — until one instruction produces a value that the very next instruction needs to read.

Consider two consecutive add instructions where the second uses the result of the first. The first instruction computes its result in the Execute stage. The second instruction reads its operands in the Decode stage — which happens one cycle earlier. Without intervention, the second instruction reads a stale value from the register file, because the first instruction has not written its result back yet.

This tutorial dissects the bypass module — sixty lines of Verilog that solve this problem without ever stalling the pipeline for register-to-register dependencies. It detects the conflict in real time and reroutes the computed value directly from the later pipeline stage back to the ALU input, before the clock edge commits the wrong answer.

🔵 Deep Dive: This technique is called “data forwarding” or “bypassing.” It is one of the three canonical solutions to data hazards in pipelined processors (the others being stalling and compiler-level instruction reordering). Every modern CPU from the ARM Cortex-M4 to Apple’s M-series implements some form of forwarding.

Prerequisites: Pipeline Stages and Register Encoding

Concepts you should understand:

  • The five stages of a classic RISC pipeline (Fetch, Decode, Execute, Memory, Writeback)
  • How an instruction’s opcode, source registers, and destination register are encoded in a 32-bit word
  • The role of pipeline latches (registers that separate stages)

Key instruction format for this ISA:

BitsFieldDescription
31–27OpcodeInstruction type (5 bits)
26–22RD/RTDestination register
21–17RSSource register A
16–12RT/RDSource register B (R-type) or destination (I-type)

Files under discussion:

  • modules/processor/bypass.v (60 lines)
  • modules/processor/stall.v (26 lines)
  • modules/processor/processor.v (lines 64–68, 167–168, 230–231)

The Forwarding Network: A Three-Way Data Rerouter

Think of the pipeline as five workers on an assembly line. Worker 3 (Execute) just finished computing a value and is about to pass it forward. But Worker 2 (Decode) is simultaneously reading the old version of that same value from the shared parts shelf (register file).

The bypass module is a supervisor who stands between Worker 2 and the parts shelf. Before Worker 2 grabs a value, the supervisor checks: “Did Worker 3 or Worker 4 just produce a newer version of this part?” If so, the supervisor intercepts the read and hands over the fresh value instead.

flowchart TD
    DX["Decode/Execute Stage\n(needs operands)"]
    XM["Execute/Memory Stage\n(has fresh result)"]
    MW["Memory/Writeback Stage\n(has older result)"]
    RF["Register File\n(has oldest value)"]
    MUX["4:1 Bypass Mux"]
    ALU["ALU Input"]

    XM -->|"Priority 1: xm_O"| MUX
    MW -->|"Priority 2: data_writeReg"| MUX
    RF -->|"Priority 3: dx_A / dx_B"| MUX
    MUX --> ALU
    DX -.->|"Bypass select\n(from bypass.v)"| MUX

Inside the Bypass Decision Engine

Step 1: Identify who is producing what

Each pipeline stage carries a full copy of the instruction it is executing. The bypass module extracts the destination register from the Execute/Memory (XM) and Memory/Writeback (MW) stages, and the source registers from the current Decode/Execute (DX) stage:

// What register is the DX instruction reading?
wire [4:0] dx_rs = dx_inst[21:17];       // source A
wire [4:0] dx_rt = dx_inst[26:22];       // source B (I-type)
wire [4:0] dx_rd = dx_inst[16:12];       // source B (R-type)

// What register did the XM instruction write to?
wire [4:0] xm_rd = (xm_is_setx || xm_ovf) ? 5'd30 : xm_inst[26:22];

// What register did the MW instruction write to?
wire [4:0] mw_rd = (mw_is_setx || mw_ovf) ? 5'd30 : mw_inst[26:22];

Notice the override for setx and overflow: when these conditions are active, the destination register is forced to R30 (the status register), regardless of what the instruction’s RD field says. This ensures the forwarding network sends the status value to any subsequent instruction reading R30.

Step 2: Decide which operand slot to forward

R-type instructions encode their second operand in bits [16:12], while I-type instructions use bits [26:22]. A single mux resolves this before the comparison begins:

wire dx_is_r_inst = (dx_op == 5'b00000);
wire [4:0] dx_r2 = dx_is_r_inst ? dx_rd : dx_rt;  // operand B depends on type

Step 3: Detect the conflict

The core comparison is simple — does the DX source register match the XM or MW destination? Two critical guards prevent false matches:

// Does DX's source A match XM's destination? (Exclude R0 — it's always zero.)
wire dx_rs_xm_rd_eq = (dx_rs == xm_rd) && (xm_rd != 5'b0);
wire dx_rs_mw_rd_eq = (dx_rs == mw_rd) && (mw_rd != 5'b0);

The != 5'b0 guard is essential. R0 is hardwired to zero in this ISA — even if an instruction writes to R0, the result is discarded. Without this guard, every instruction that uses R0 as a source would incorrectly trigger a forward.

Step 4: Apply priority and drive the mux select

When both the XM and MW stages have matching destinations, the XM result is newer and takes priority. Tri-state drivers encode this priority:

wire bypassA0 = !xm_is_sw && !xm_is_branch && dx_rs_xm_rd_eq;  // XM hit
wire bypassA1 = !mw_is_sw && !mw_is_branch && dx_rs_mw_rd_eq;  // MW hit

tri_state2 sA0(.out(bypass_selectA), .in(2'd0), .en(bypassA0));             // XM: select 0
tri_state2 sA1(.out(bypass_selectA), .in(2'd1), .en(bypassA1 && !bypassA0)); // MW: select 1
tri_state2 sA2(.out(bypass_selectA), .in(2'd2), .en(!bypassA0 && !bypassA1)); // Original: select 2

Stores and branches are excluded (!xm_is_sw && !xm_is_branch) because they do not produce register results — forwarding from them would inject garbage into the ALU.

Step 5: The mux connects to the ALU

In processor.v, the bypass select drives a four-input mux at each ALU input:

mux_4 alu_mux1(
    .out(alu_A_in),
    .select(alu_bypass_selectA),
    .in0(xm_O),           // select=0: forward from XM stage
    .in1(data_writeReg),  // select=1: forward from MW stage
    .in2(dx_A),           // select=2: use original register value
    .in3(32'b0)           // select=3: unused
);

The entire forwarding decision — conflict detection, priority encoding, and mux selection — executes combinationally within a single clock cycle. No pipeline stall is required.

When Forwarding Is Not Enough: The Load-Use Stall

There is one case where forwarding fails. A load-word instruction (LW) does not produce its result until the end of the Memory stage. If the very next instruction needs that value in the Execute stage, even the XM forwarding path is too late — the data has not arrived yet.

The stall module detects this specific pattern and freezes the pipeline for exactly one cycle:

assign stall = dx_lw                  // the DX instruction is a load
            && fd_uses_alu             // the FD instruction needs the ALU
            && ((fd_rs == dx_rd)       // ...and its source A matches the load's destination
             || ((fd_rt == dx_rd) && !fd_sw));  // ...or source B matches (unless it's a store)

🔴 Danger: The stall signal must engage before the bypass mux commits to a forwarding selection. In processor.v, both are computed in parallel, and the stall prevents the DX latch from advancing — so even if the bypass logic produces a selection, the downstream stage never sees it. If the stall were delayed by one cycle, a single wrong value would propagate through the ALU, corrupting one instruction’s result silently.

Timing on the Critical Path: Why This Design Is Fast Enough

The forwarding network adds combinational logic between the pipeline latches and the ALU. This logic — two five-bit comparators, two AND gates, and a four-input mux — must settle within a single clock half-period (5 ns at 100 MHz).

Consider the worst-case critical path for operand A:

  1. XM latch outputs xm_inst[26:22] (destination register)
  2. Five-bit equality comparator: dx_rs == xm_rd (~2 gate delays)
  3. AND gate with !xm_is_sw exclusion (~1 gate delay)
  4. Tri-state enable on bypass select (~1 gate delay)
  5. Four-input mux selects xm_O (~2 gate delays)
  6. ALU receives alu_A_in

Total: approximately six gate delays. At 100 MHz (10 ns period), with latches sampling on the falling edge (5 ns half-period), this leaves margin even for a moderately slow FPGA fabric. The design achieves timing closure because the forwarding logic is shallow — just comparators and a mux, with no sequential elements or memory accesses in the path.

The Invisible Bugs: Forwarding from the Wrong Instruction

Bug 1: Forwarding from a store. A store instruction (SW) does not write to a register — it writes to memory. If the bypass logic forwarded from a store’s “destination” field, it would inject the memory address (not useful data) into the ALU. The !xm_is_sw guard prevents this.

Bug 2: Forwarding from a branch. Branch instructions compare two registers but do not produce a result. Their RD field is meaningless in this context. Without the !xm_is_branch exclusion, a branch followed by an instruction with a matching register would receive a forwarded value that represents nothing.

Bug 3: Forwarding R0. R0 is always zero. If instruction A writes to R0 (a no-op by convention) and instruction B reads R0, the bypass would forward A’s computed result instead of zero. The xm_rd != 5'b0 guard enforces the R0 invariant.

Bug 4: Priority inversion. If both XM and MW match, using the MW value (which is older) would overwrite the XM value (which is newer). The bypassA1 && !bypassA0 condition ensures XM always wins.

You Now Know How to Eliminate Data Stalls With Combinational Forwarding

The core skill is recognizing that a data hazard is a conflict between pipeline stages, not between instructions. The bypass module does not look at the original program — it compares live instruction metadata across three pipeline stages and makes a real-time routing decision.

This same pattern scales to wider pipelines. A seven-stage pipeline might need forwarding from three or four stages instead of two, but the structure is identical: extract destination registers from later stages, compare against source registers in the current stage, prioritize by recency, drive a mux. The bypass.v module is a minimal, correct implementation of this universal pattern.

We respect your privacy.

← View All Tutorials

Related Projects

    Ask me anything!