featured image

FPGA Wordle: Building a Complete Game Console from Transistors to Touchpoints

A deep dive into the design and implementation of FPGA Wordle, a fully hardware-implemented word-guessing game that runs on a custom five-stage pipelined MIPS processor, with real-time VGA display and PS/2 keyboard input — all built from first principles in Verilog and assembly.

Published

Sun Aug 03 2025

Technologies Used

FPGA MIPS Verilog WebAssembly
View on GitHub

Live Demo

Loading demo...

What Happens When You Design Every Layer of the Stack

Imagine playing Wordle — but instead of opening a browser, the game runs on a processor you designed yourself. Not a pre-built ARM core, not a soft CPU IP block, but a five-stage pipelined MIPS processor hand-wired in Verilog, executing assembly you wrote, rendering to a VGA display through a sprite engine you built, reading input from a PS/2 keyboard through a protocol handler you integrated. Every instruction fetch, every pixel color, every hazard stall — authored from scratch.

That is what this project delivers. FPGA Wordle is a fully hardware-implemented word-guessing game targeting the Artix-7 FPGA, with a parallel desktop simulation path via Verilator and SDL3. The target audience is anyone who wants to understand what it really means to build a computer from the ground up — and the target outcome is a playable, interactive game that proves the entire stack works in concert.

The Gap Between “Understanding” a CPU and Actually Building One

Computer architecture courses teach you about pipelines, hazards, and forwarding in the abstract. You draw diagrams, you trace instructions through stages on paper, and you answer exam questions about what happens when a load is followed by an add. But there is an enormous gulf between understanding these concepts and orchestrating them into a system that actually does something useful — something with real-time display, physical keyboard input, and game logic that depends on every layer functioning correctly.

The pain point is not computational. Wordle is a trivial problem algorithmically. The real challenge is integration: making a processor that handles data hazards while simultaneously driving a VGA display at sixty frames per second, reading asynchronous keyboard input, validating words against a twelve-thousand-entry dictionary stored in block RAM, and doing all of this within the timing constraints of a physical FPGA clock domain. This project exists because the best way to prove you understand a computer is to build one that does something no one can dismiss as a toy.

A Game Console Built from First Principles

The system delivers four tightly integrated capabilities, each representing a distinct engineering discipline:

  • A Pipelined Processor That Handles Its Own Hazards. The five-stage MIPS pipeline includes a dedicated forwarding unit that detects register dependencies across the Execute and Memory stages and reroutes data before it would cause a stall. When forwarding is not possible — such as when a load instruction’s result is consumed immediately — a separate stall unit freezes the pipeline for exactly one cycle. Multiply and divide operations, which require multiple clock cycles, serialize gracefully without corrupting in-flight instructions.

  • A Real-Time Display Engine Driven by Register State. Six processor registers are hardwired directly to the VGA display controller. When the processor writes a guess into any of these registers, the new word appears on screen within the next frame — no bus transaction, no interrupt, no display driver. The rendering pipeline reads a precomputed board bitmap, overlays fifty-by-fifty-pixel character sprites, and color-codes each tile based on a combinational comparison against the target word.

  • A Dictionary Validator Running in Assembly. The game validates every five-letter guess against a dictionary of nearly thirteen thousand words using binary search implemented entirely in MIPS assembly. The algorithm was hand-optimized to avoid the multiply-divide unit — using arithmetic right shifts for the midpoint calculation instead of division — keeping each search iteration at pipeline speed rather than waiting thirty-three cycles for a hardware divider.

  • A Dual-Target Architecture. The same processor, register file, and game logic run identically on physical FPGA hardware and on a desktop computer via Verilator simulation with SDL3 rendering. Only the I/O boundary changes: PS/2 scan codes on hardware become SDL keyboard events in simulation, and VGA sync signals become pixel buffer writes. The processor itself is never modified.

Engineering the Invisible: Architecture and the Reasoning Behind It

The Stack

LayerTechnologyRole
Processor DesignVerilog (IEEE 1364-2001)Five-stage pipeline, ALU, forwarding, stalling, multiply-divide
Peripheral InterfaceVHDLPS/2 keyboard protocol handling
Game LogicMIPS AssemblyMain loop, input handling, binary search, win/lose detection
Display RenderingVerilogVGA timing generation, sprite lookup, color-coded tile rendering
Simulation HarnessC++17, SDL3, CMakeDesktop execution without FPGA hardware
Build ToolingVerilator, CPM, Custom MIPS AssemblerCompilation, simulation, cross-platform assembly
Data PipelinePythonDictionary encoding, sprite conversion, memory file generation

Why These Choices Matter

Tri-state bus architecture for the register file. The register file uses thirty-two tri-state drivers per read port — one per register — with a decoder enabling exactly one driver at a time. This is an FPGA-native design pattern. A software engineer’s instinct would be to use a multiplexer tree, but tri-state buses map more efficiently to FPGA routing fabric and naturally handle the memory-mapped I/O registers without additional decode logic. The keyboard input register, for instance, is simply another tri-state driver on the same bus, gated by its own enable signal. Adding special-purpose registers costs exactly one driver per port, not a wider mux.

Zero-latency display coupling through register hardwiring. A conventional approach would memory-map the display buffer and have the processor write pixels or characters through store instructions. Instead, six registers in the register file have their outputs wired directly to the display controller as combinational signals. The moment the processor’s writeback stage commits a value, the display controller sees it on the very next clock edge. This eliminates an entire class of synchronization problems and guarantees that display updates are never more than one frame behind the processor. This decision was only possible because the game state is small enough — six thirty-two-bit words — to justify dedicated wiring.

Assembly-level pipeline awareness in the binary search. The dictionary search avoids the hardware divider entirely. A divide operation would stall the pipeline for over thirty cycles per iteration, and with fourteen iterations per search, that compounds to over four hundred wasted cycles per guess validation. By replacing division with a single-cycle arithmetic right shift, the search runs at full pipeline throughput. This is not a generic optimization — it is a decision that only makes sense when you know your own pipeline’s latency characteristics because you designed it.

The Hardest Problem: Three Hazard Systems That Cannot Interfere

The most complex engineering in the project is not any single module. It is the interaction between three independent hazard-resolution mechanisms that must coexist without corrupting each other.

The forwarding unit continuously compares the destination register of instructions in the Execute and Memory stages against the source registers of the instruction currently in Decode. When a match is detected, it reroutes the computed value back to the ALU inputs before the next clock edge, eliminating what would otherwise be a two-cycle data dependency stall. This logic runs combinationally — it must produce a valid forwarding selection before the ALU begins its computation.

Simultaneously, the stall unit monitors a different condition: whether a load-word instruction in the Execute stage has its result consumed by the very next instruction. Forwarding cannot solve this because the loaded value is not available until the Memory stage completes. The stall unit freezes the program counter and the Fetch-Decode latch for one cycle, allowing the load to complete before its consumer proceeds.

The third mechanism governs multiply and divide operations, which require seventeen to thirty-three clock cycles. A dedicated latch captures the operands at the onset of the operation and holds the pipeline frozen until the result is ready, at which point it is injected directly into the writeback stage.

The critical design challenge is that a single stall signal gates all three mechanisms. If a load-use hazard and a multiply operation were to trigger simultaneously — or if a branch instruction needed to flush the pipeline while a multiply was in progress — incorrect gating could allow corrupted data to propagate. The solution is a unified stall signal that is the logical OR of all three conditions, freezing the entire front end of the pipeline whenever any hazard is active. This is conceptually simple but required careful verification: the stall must engage before the forwarding unit commits to a selection, yet the forwarding logic depends on the same instruction metadata that the stall logic examines. The design resolves this by computing both in parallel and using the stall to prevent the downstream stage from latching an invalid forwarding result.

What Building a Computer Teaches You About Building Software

Abstraction boundaries are the architecture. The most important decision in this project was not how to implement the ALU or the VGA timing generator. It was where to draw the line between the processor and the display, between the keyboard and the register file, between hardware and simulation. The register file’s special-purpose outputs, the simulation top module’s I/O ports, the parameterized memory paths — these boundary decisions determined whether the project would be a monolithic FPGA design or a portable, testable system. The same principle applies to any software architecture: the value is not in the components, but in the contracts between them.

Constraints produce clarity. Working within a five-stage pipeline with no out-of-order execution forces you to think about every instruction’s dependency chain. There is no garbage collector, no virtual memory, no operating system to absorb your mistakes. This constraint made the binary search implementation better — not despite the limitation, but because of it. The shift-instead-of-divide optimization is the kind of insight that emerges only when you cannot hide behind abstraction. In product engineering, artificial constraints — fixed deadlines, limited APIs, performance budgets — serve the same clarifying function.

End-to-end ownership reveals integration risk. When you own every layer from transistor to pixel, you discover that the hardest bugs live at the boundaries. A register file that maps word outputs to the wrong indices. A display module that declares its inputs as one-bit instead of thirty-two-bit. An assembly program that tries to clear a register the hardware does not allow it to write. None of these bugs exist within any single module — they only manifest when modules meet. This is the strongest argument for full-stack ownership, even in software teams: the person who understands both sides of an interface will find the bugs that neither specialist would.

Where This Goes Next

Three natural extensions would elevate the project from a proof-of-concept to a polished product:

Input editing with backspace support. The current assembly accumulates letters into a word register through irreversible bit-shifting. Supporting backspace requires either a dedicated clear instruction in the ISA, a bitmask-and-rewrite pattern in assembly, or a small input buffer managed by the keyboard controller. This is a compelling extension because it touches every layer of the stack — hardware, ISA, assembly, and display — for a single user-facing feature.

Correct duplicate-letter scoring. The current comparison logic checks whether a guessed letter exists anywhere in the target word, which produces incorrect yellow highlights when a letter appears multiple times. Proper Wordle scoring requires a frequency-aware algorithm that consumes matches greedily — greens first, then yellows up to the remaining count. Implementing this in combinational Verilog without sequential state is a non-trivial puzzle that would demonstrate advanced hardware design skill.

Game state transitions and restart. The game currently halts on a win with an infinite no-op loop. A complete implementation would render a victory or defeat screen, wait for a restart input, re-trigger the random word generator, and reset all game-state registers. This extends the project’s state machine from a linear sequence to a proper game lifecycle, which is the difference between a demo and a product.

Try It Out

Check out the live demo or explore the source code on GitHub.

We respect your privacy.

← View All Projects

Related Tutorials

    Ask me anything!