One Processor, Two Worlds: Running FPGA Hardware on a Desktop Without Changing a Line of RTL

The FPGA Is Gone — But the Processor Still Needs to Think It Has a Keyboard and a Screen

You have designed a processor. It runs on an FPGA, reads a PS/2 keyboard, and drives a VGA display. Now you need to test it — but you do not have the FPGA board. You cannot plug a PS/2 keyboard into your laptop. Your monitor does not have a VGA input.

The conventional answer is to rewrite the design for simulation. Strip out the keyboard module, replace the display with waveform probes, and verify behavior by reading signal traces. This works for functional verification, but it does not let you play the game. You cannot see letters appear on a grid. You cannot type a guess and watch it turn green.

This tutorial dissects the abstraction layer that lets the same MIPS processor — with identical pipeline logic, register file, and memory — run interactively on a desktop computer. The simulation top module (sim_top.v) replaces exactly two boundaries: where keystrokes enter and where pixels leave. Everything between those boundaries is untouched RTL.

This technique — “hardware-in-the-loop simulation with I/O virtualization” — shows up in automotive, aerospace, and game console development wherever physical hardware is expensive or unavailable during the development cycle. The principle is the same at every scale: decouple I/O from logic, then swap the I/O layer.

What You Need to Understand to Follow Along

Concepts:

How Verilator compiles Verilog into a C++ class with input/output member variables
The difference between a module’s port interface and its internal logic
Clock domain relationships (system clock vs. pixel clock)
The SDL3 event loop and texture rendering model

Tools:

Verilator >= 5.0
CMake >= 3.16
SDL3 (fetched automatically by CPM)
C++17 compiler

Files under discussion:

sim/sim_top.v — simulation top-level (282 lines)
sim/main.cpp — C++ SDL3 harness (182 lines)
modules/processor/Wrapper.v — FPGA top-level (131 lines)
modules/memory/regfile.v — register file with dual-write R15 (122 lines)

The Boundary Diagram: What Changes and What Does Not

The processor core — pipeline, ALU, register file, ROM, RAM — is a black box that consumes a clock, a reset, and instruction/data memory interfaces. It does not know or care whether its inputs come from physical hardware or a C++ program.

The abstraction boundary is drawn at exactly two points: the keyboard input path and the display output path. Everything inside the boundary is shared code. Everything outside is swapped per target.

flowchart TB
    subgraph SHARED["Shared RTL (identical in both targets)"]
        CPU["5-Stage Processor"]
        REG["Register File\n(R0–R31)"]
        ROM["Instruction ROM\n(main.mem)"]
        RAM["Data RAM\n(words.mem)"]
        LFSR["LFSR + Word ROM"]
        VGA["VGA Timing\n(simple_480p)"]
        SPR["Sprite Engine\n(board + letter)"]
        COLOR["Color Logic\n(green/yellow/gray)"]
    end

    subgraph FPGA["FPGA I/O Layer (Wrapper.v)"]
        PS2["PS/2 Interface\n(VHDL)"]
        KB["Keyboard_input.v"]
        ILA["ILA Debug Probe"]
        VGAOUT["VGA Pins\n(4-bit R/G/B)"]
    end

    subgraph SIM["Simulation I/O Layer (sim_top.v + main.cpp)"]
        SDL["SDL3 Event Loop"]
        TEX["Texture Renderer"]
        VCD["VCD Trace Output"]
    end

    PS2 -->|"scan code"| KB -->|"ASCII → letter"| REG
    SDL -->|"scancode → letter"| REG

    REG -->|"word1–word6"| SPR
    CPU <--> REG
    CPU <--> ROM
    CPU <--> RAM
    LFSR -->|"actual word"| REG

    VGA --> SPR --> COLOR
    COLOR -->|"4-bit paint"| VGAOUT
    COLOR -->|"8-bit SDL"| TEX

Dissecting the Swap: FPGA Keyboard vs. Simulated Keyboard

The FPGA path (Wrapper.v, lines 95–100)

On the FPGA, a physical PS/2 keyboard generates a serial clock-data signal pair. The Ps2Interface VHDL module decodes this protocol into eight-bit scan codes. A Verilog module (Keyboard_input) buffers the scan code, looks up its ASCII value in a 256-entry ROM, and asserts a letter_ready pulse. The wrapper then encodes the ASCII value into the format the processor expects:

// Wrapper.v — FPGA keyboard path
wire [6:0] ascii0;
wire letter_ready;

Keyboard_input kbin(.clk(clock), .ps2_clk(ps2_clk), .ps2_data(ps2_data),
                     .ascii_val0(ascii0), .letter_ready(letter_ready));

assign letter = ascii0 - 65 + {1'b1, 31'b0};  // ASCII 'A'=65 → code 0, bit 31 set

This path involves a VHDL module (not supported by Verilator), asynchronous PS/2 protocol timing, and a ROM lookup — none of which can run in a pure Verilog simulation.

The simulation path (main.cpp, lines 155–166)

In simulation, SDL3 captures keyboard events from the operating system. The C++ harness converts the scancode directly to the same 32-bit format the processor expects, then drives the Verilator model’s input ports:

// main.cpp — Simulation keyboard path
if (ev.key.scancode >= SDL_SCANCODE_A && ev.key.scancode <= SDL_SCANCODE_Z) {
    int code = ev.key.scancode - SDL_SCANCODE_A;    // A=0, B=1, ..., Z=25
    top->letter    = static_cast<uint32_t>(code) | 0x80000000u;  // bit 31 = enable
    top->letter_en = 1;
    letter_en_countdown = LETTER_EN_CYCLES;          // hold for 8 clock toggles
}

The critical insight is that the register file does not care which path produced the value. Both paths deliver a 32-bit value on the letter port and a pulse on letter_en. The processor reads R15 and sees identical data regardless of the source.

The dual-write register that makes this possible (regfile.v, lines 93–102)

R15 is unique in this register file. It accepts writes from two sources — the external keyboard and the processor itself — with the keyboard taking priority:

wire en15_proc;
and enable15_proc(en15_proc, write_select[15], ctrl_writeEnable);

wire en15 = en15_proc | letter_en;                     // either source can write
wire [31:0] d15 = letter_en ? letter : data_writeReg;  // keyboard has priority

register32 register15(.d(d15), .clk(clock), .q(read15), .clr(ctrl_reset), .en(en15));

This dual-write design exists because the assembly code needs to clear R15 after reading a letter (addi $letter, $r0, 0). If R15 were write-only from the keyboard, the processor’s clear instruction would silently fail, and R15 would retain the old letter value forever — causing the game loop to process the same letter repeatedly.

If both sources write on the same clock edge, the priority mux ensures the keyboard value wins. In practice this race is extremely unlikely — the keyboard pulses letter_en for only eight cycles, and the processor clears R15 many cycles later. But the priority mux guarantees correctness even in the degenerate case.

Dissecting the Swap: VGA Pins vs. SDL Texture

The FPGA output (artix_wordle.v)

On the FPGA, the display controller outputs four-bit red, green, and blue channels plus hsync and vsync signals to physical VGA pins. The monitor reconstructs the image from these analog signals.

The simulation output (sim_top.v, lines 269–279)

In simulation, there is no monitor. Instead, the module outputs ten-bit pixel coordinates (sdl_sx, sdl_sy), a data-enable flag (sdl_de), and eight-bit color channels — all registered on the pixel clock edge:

always @(posedge clk_pix) begin
    sdl_sx <= sx;
    sdl_sy <= sy;
    sdl_de <= de;
    sdl_r  <= {2{paint_r}};   // 4-bit to 8-bit: 0xA becomes 0xAA
    sdl_g  <= {2{paint_g}};
    sdl_b  <= {2{paint_b}};
end

The C++ harness reads these outputs on every clock evaluation and writes them into a pixel buffer:

if (top->sdl_de) {
    Pixel *p = &framebuffer[top->sdl_sy * H_RES + top->sdl_sx];
    p->a = 0xFF;
    p->r = top->sdl_r;
    p->g = top->sdl_g;
    p->b = top->sdl_b;
}

At the end of each frame (detected when sdl_sy rolls past the active region), the buffer is uploaded to an SDL texture and presented to the window. The result is identical to what the VGA monitor would show — same pixels, same colors, same sixty-frame-per-second refresh.

The Clock Domain Problem That Does Not Exist

In a real FPGA, the 25 MHz pixel clock is generated by a phase-locked loop (PLL) from the 100 MHz system clock. PLLs are vendor-specific IP — Xilinx, Intel, and Lattice all have different primitives. This is a portability problem.

The simulation avoids it entirely with a two-bit counter:

reg [1:0] pixCounter = 0;
always @(posedge clk) pixCounter <= pixCounter + 1;
wire clk_pix = pixCounter[1];   // toggles at 1/4 of clk frequency

This divider is deterministic — clk_pix has a fixed phase relationship to clk, with no jitter and no startup delay. In hardware, a PLL achieves the same result with better jitter characteristics, but for simulation, the divider is functionally identical and completely portable across any Verilog simulator.

The C++ harness simply toggles the main clock on every iteration. It does not need to know about the pixel clock at all — the Verilog model handles the division internally, and the SDL outputs update at the correct rate automatically:

while (running) {
    top->clk ^= 1;   // toggle main clock
    top->eval();      // Verilator evaluates all combinational + sequential logic
    // ... sample outputs, render frame when ready ...
}

What Breaks When You Get the Abstraction Wrong

Scenario 1: The letter pulse is too short. If letter_en is held high for only one clock toggle (half a cycle) instead of eight, the register file might miss the write entirely — en15 needs to be high during a rising clock edge, and a single toggle might land on the wrong edge. The eight-cycle pulse guarantees at least four rising edges, providing ample capture margin.

Scenario 2: The pixel buffer overruns. The sdl_de signal is only high during the 640×480 active region. If the bounds check on x and y were removed from the C++ harness, pixels during the blanking interval (with coordinates up to 799×524) would write outside the framebuffer array, corrupting the stack or heap. The bounds check (x < H_RES && y < V_RES) is not defensive programming — it is a hard correctness requirement.

Scenario 3: The frame boundary detection misses. Frame rendering triggers when sdl_sy == 480 && sdl_sx == 0. Because the pixel clock divides the main clock by four, and the C++ loop toggles the main clock once per iteration, the same (sx, sy) pair is visible for approximately eight consecutive iterations. The SDL render call therefore fires multiple times at the frame boundary — but since SDL_RenderPresent is idempotent (presenting the same texture twice is a no-op), this causes no visible artifact, only a negligible performance overhead.

What the Abstraction Boundary Teaches Beyond Verilog

The transferable skill here isn’t Verilator syntax or SDL3 API calls. It’s I/O boundary isolation: if all platform-specific behavior is concentrated at the input and output edges, the entire core becomes portable by construction.

In this project, the processor doesn’t know it’s running on an FPGA. It doesn’t know it’s running in a simulator. It fetches instructions from ROM, reads registers, computes ALU results, and writes back — identically in both environments. The only difference is who fills R15 with a letter and who reads the color of pixel (302, 147).

In software architecture, the same pattern appears as Ports and Adapters (Hexagonal architecture): the domain logic is the core, and I/O adapters — database drivers, HTTP handlers, message queues — are swappable shells. The FPGA Wordle project is a hardware implementation of that exact same idea.