On this page
- What Happens When You Design Every Layer of the Stack
- The Gap Between Understanding a CPU and Actually Building One
- A Game Console Built from First Principles
- Why These Design Choices
- The Hardest Problem: Three Hazard Systems That Cannot Interfere
- What Building a Computer Teaches You About Building Software
- Where This Goes Next
What Happens When You Design Every Layer of the Stack
Imagine playing Wordle — but instead of opening a browser, the game runs on a processor you designed yourself. Not a pre-built ARM core, not a soft CPU IP block, but a five-stage pipelined MIPS processor hand-wired in Verilog, executing assembly you wrote, rendering to a VGA display through a sprite engine you built, reading input from a PS/2 keyboard through a protocol handler you integrated. Every instruction fetch, every pixel color, every hazard stall — authored from scratch.
That’s what this project is. FPGA Wordle is a fully hardware-implemented word-guessing game targeting the Artix-7 FPGA, with a parallel desktop simulation path via Verilator and SDL3. The target outcome is a playable, interactive game that proves the entire stack works in concert.
The Gap Between Understanding a CPU and Actually Building One
Computer architecture courses teach pipelines, hazards, and forwarding in the abstract. You draw diagrams, trace instructions through stages on paper, answer exam questions about what happens when a load is followed by an add. There’s an enormous gulf between understanding these concepts and orchestrating them into a system that does something useful.
Wordle is a trivial problem algorithmically. The real challenge is integration: making a processor that handles data hazards while simultaneously driving a VGA display at sixty frames per second, reading asynchronous keyboard input, validating words against a twelve-thousand-entry dictionary stored in block RAM, and doing all of this within the timing constraints of a physical FPGA clock domain. The best way to prove you understand a computer is to build one that does something real.
A Game Console Built from First Principles
A pipelined processor that handles its own hazards. The five-stage MIPS pipeline includes a dedicated forwarding unit that detects register dependencies across the Execute and Memory stages and reroutes data before it would cause a stall. When forwarding isn’t possible — such as when a load instruction’s result is consumed immediately — a separate stall unit freezes the pipeline for exactly one cycle. Multiply and divide operations, which require multiple clock cycles, serialize gracefully without corrupting in-flight instructions.
A real-time display engine driven by register state. Six processor registers are hardwired directly to the VGA display controller. When the processor writes a guess into any of these registers, the new word appears on screen within the next frame — no bus transaction, no interrupt, no display driver. The rendering pipeline reads a precomputed board bitmap, overlays fifty-by-fifty-pixel character sprites, and color-codes each tile based on a combinational comparison against the target word.
A dictionary validator running in assembly. The game validates every five-letter guess against a dictionary of nearly thirteen thousand words using binary search implemented entirely in MIPS assembly. The algorithm was hand-optimized to avoid the multiply-divide unit — using arithmetic right shifts for the midpoint calculation instead of division — keeping each search iteration at pipeline speed rather than waiting thirty-three cycles for a hardware divider.
A dual-target architecture. The same processor, register file, and game logic run identically on physical FPGA hardware and on a desktop computer via Verilator simulation with SDL3 rendering. Only the I/O boundary changes: PS/2 scan codes on hardware become SDL keyboard events in simulation, VGA sync signals become pixel buffer writes. The processor is never modified.
Why These Design Choices
Tri-state bus architecture for the register file. I used thirty-two tri-state drivers per read port — one per register — with a decoder enabling exactly one driver at a time. A software engineer’s instinct would be to use a multiplexer tree, but tri-state buses map more efficiently to FPGA routing fabric and handle memory-mapped I/O registers without additional decode logic. The keyboard input register is just another tri-state driver on the same bus, gated by its own enable signal. Adding special-purpose registers costs exactly one driver per port, not a wider mux.
Zero-latency display coupling through register hardwiring. A conventional approach would memory-map the display buffer and have the processor write through store instructions. Instead, six registers have their outputs wired directly to the display controller as combinational signals. The moment the processor’s writeback stage commits a value, the display controller sees it on the very next clock edge. This eliminates an entire class of synchronization problems and guarantees display updates are never more than one frame behind the processor. This decision was only possible because the game state is small enough — six thirty-two-bit words — to justify dedicated wiring.
Assembly-level pipeline awareness in the binary search. The dictionary search avoids the hardware divider entirely. A divide operation would stall the pipeline for over thirty cycles per iteration. With fourteen iterations per search, that compounds to over four hundred wasted cycles per guess validation. Replacing division with a single-cycle arithmetic right shift makes the search run at full pipeline throughput. This is an optimization that only makes sense when you know your own pipeline’s latency characteristics because you designed it.
The Hardest Problem: Three Hazard Systems That Cannot Interfere
The most complex engineering in the project is the interaction between three independent hazard-resolution mechanisms that must coexist without corrupting each other.
The forwarding unit continuously compares the destination register of instructions in the Execute and Memory stages against the source registers of the instruction currently in Decode. When a match is detected, it reroutes the computed value back to the ALU inputs before the next clock edge. This logic runs combinationally — it must produce a valid forwarding selection before the ALU begins its computation.
The stall unit monitors a different condition: whether a load-word instruction in the Execute stage has its result consumed by the very next instruction. Forwarding can’t solve this because the loaded value isn’t available until the Memory stage completes. The stall unit freezes the program counter and the Fetch-Decode latch for one cycle.
The third mechanism governs multiply and divide operations, which require seventeen to thirty-three clock cycles. A dedicated latch captures the operands at the onset of the operation and holds the pipeline frozen until the result is ready.
The critical design challenge: a single stall signal gates all three mechanisms. If a load-use hazard and a multiply operation were to trigger simultaneously — or if a branch instruction needed to flush the pipeline while a multiply was in progress — incorrect gating could allow corrupted data to propagate. The solution is a unified stall signal that is the logical OR of all three conditions, freezing the entire front end of the pipeline whenever any hazard is active. Both the stall logic and the forwarding logic examine the same instruction metadata, so they’re computed in parallel — the stall prevents the downstream stage from latching an invalid forwarding result.
What Building a Computer Teaches You About Building Software
Abstraction boundaries are the architecture. The most important decisions weren’t how to implement the ALU or the VGA timing generator. They were where to draw the line between the processor and the display, between the keyboard and the register file, between hardware and simulation. The register file’s special-purpose outputs, the simulation top module’s I/O ports, the parameterized memory paths — these boundary decisions determined whether the project would be a monolithic FPGA design or a portable, testable system. The value is not in the components, but in the contracts between them.
Constraints produce clarity. Working within a five-stage pipeline with no out-of-order execution forces you to think about every instruction’s dependency chain. There’s no garbage collector, no virtual memory, no operating system to absorb your mistakes. This constraint made the binary search implementation better — not despite the limitation, but because of it. The shift-instead-of-divide optimization emerges only when you can’t hide behind abstraction. In product engineering, artificial constraints — fixed deadlines, limited APIs, performance budgets — serve the same clarifying function.
End-to-end ownership reveals integration risk. When you own every layer from transistor to pixel, you discover that the hardest bugs live at boundaries. A register file that maps word outputs to the wrong indices. A display module that declares its inputs as one-bit instead of thirty-two-bit. An assembly program that tries to clear a register the hardware doesn’t allow it to write. None of these bugs exist within any single module — they only manifest when modules meet. This is the strongest argument for full-stack ownership: the person who understands both sides of an interface finds the bugs that neither specialist would.
Where This Goes Next
Input editing with backspace support. The current assembly accumulates letters into a word register through irreversible bit-shifting. Supporting backspace requires either a dedicated clear instruction in the ISA, a bitmask-and-rewrite pattern in assembly, or a small input buffer managed by the keyboard controller. This is a compelling extension because it touches every layer of the stack — hardware, ISA, assembly, and display — for a single user-facing feature.
Correct duplicate-letter scoring. The current comparison logic checks whether a guessed letter exists anywhere in the target word, which produces incorrect yellow highlights when a letter appears multiple times. Proper Wordle scoring requires a frequency-aware algorithm that consumes matches greedily — greens first, then yellows up to the remaining count. Implementing this in combinational Verilog without sequential state is a non-trivial puzzle.
Game state transitions and restart. The game currently halts on a win with an infinite no-op loop. A complete implementation would render a victory or defeat screen, wait for restart input, re-trigger the random word generator, and reset all game-state registers. This extends the project’s state machine from a linear sequence to a proper game lifecycle — the difference between a demo and a product.
Try It Out
Check out the live demo or explore the source code on GitHub.