RTL Development Guide
Project Overview
This is a multi-cycle non-pipelined RISC-V RV32IMACF CPU implementation in SystemVerilog with Rust-based verification.
Key Components:
- Architecture: Multi-cycle non-pipelined design with 12-state FSM (including S_ATOMIC_RMW for atomic operations) and variable-latency memory support
- Memory Interface: Ready/valid handshaking for instruction and data memory operations
- Instruction Set: RV32IMACF_Zicsr (118 instructions: 40 base + 8 multiply/divide + 11 atomic + 27 compressed + 26 floating-point + 6 CSR)
Module Hierarchy
top (CPU)
├── fetch_buffer (RV32C fetch buffer - manages compressed instruction alignment)
├── decompress (RV32C instruction decompressor - combinational)
├── decoder (Instruction decoder)
├── alu (ALU operations - RV32I + M extension)
│ └── div_unit (Hardware division unit)
├── regfile (Register file)
├── csr_file (Control and Status Registers)
├── branch_unit (Branch comparison)
├── mem_interface (Memory interface logic)
└── writeback_mux (Result selection)
Multi-cycle Architecture
FSM States
The CPU uses a 12-state finite state machine:
- S_BOOT (0x0): After reset, waits for boot signal before first fetch
- S_FETCH (0x1): Request instruction from memory, wait for
imem_ready
- S_DECODE (0x2): Decode instruction, read registers
- S_EXECUTE (0x3): Execute ALU operation
- S_MEM_ADDR (0x4): Calculate memory address for load/store
- S_MEM_READ (0x5): Request data from memory, wait for
dmem_ready
- S_MEM_WRITE (0x6): Write data to memory, wait for
dmem_ready
- S_WRITEBACK (0x7): Write result to destination register
- S_BRANCH (0x8): Evaluate branch condition and update PC
- S_CSR (0x9): Execute CSR operation
- S_ATOMIC_RMW (0xB): Atomic read-modify-write operations
- S_HALT (0xA): ECALL/EBREAK halt state
Instruction Cycle Counts
Different instruction types require different numbers of cycles:
| Instruction Class |
Base Cycles |
States |
| R-type (ADD, SUB, etc.) |
4 |
FETCH → DECODE → EXECUTE → WRITEBACK |
| I-type Arithmetic |
4 |
FETCH → DECODE → EXECUTE → WRITEBACK |
| Load (LW, LH, LB) |
5 |
FETCH → DECODE → MEM_ADDR → MEM_READ → WRITEBACK |
| Store (SW, SH, SB) |
4 |
FETCH → DECODE → MEM_ADDR → MEM_WRITE |
| Branch |
3 |
FETCH → DECODE → BRANCH |
| Jump (JAL/JALR) |
4 |
FETCH → DECODE → EXECUTE → WRITEBACK |
| Upper Immediate |
4 |
FETCH → DECODE → EXECUTE → WRITEBACK |
| M-Extension (MUL/DIV) |
4 |
FETCH → DECODE → EXECUTE → WRITEBACK |
| System (FENCE) |
2 |
FETCH → DECODE |
| System (ECALL/EBREAK) |
2 |
FETCH → DECODE → HALT |
| CSR Operations |
4 |
FETCH → DECODE → CSR → WRITEBACK |
Note: Memory latency adds additional cycles. For example, with 3-cycle memory latency, a load instruction takes 5 base cycles + 3 cycles in FETCH + 3 cycles in MEM_READ = 11 total cycles.
Memory Interface Signals
The multi-cycle design adds handshaking signals:
Instruction Memory:
imem_req (output): CPU requests instruction fetch
imem_ready (input): Memory has valid instruction data
imem_addr (output): Instruction address
imem_data (input): Instruction data
Data Memory:
dmem_req (output): CPU requests memory operation
dmem_ready (input): Memory operation complete
dmem_addr (output): Data address
dmem_wdata (output): Write data
dmem_rdata (input): Read data
dmem_we (output): Write enable
dmem_re (output): Read enable
dmem_size (output): Operation size (byte/halfword/word)
Instruction Completion Signal
instr_complete (output): High for 1 cycle when instruction finishes execution
Supported Instructions
RV32I Base (40 instructions):
- Arithmetic: ADD, ADDI, SUB
- Logic: AND, ANDI, OR, ORI, XOR, XORI
- Shifts: SLL, SLLI, SRL, SRLI, SRA, SRAI
- Comparison: SLT, SLTI, SLTU, SLTIU
- Branches: BEQ, BNE, BLT, BGE, BLTU, BGEU
- Memory: LW, LH, LB, LHU, LBU, SW, SH, SB
- Upper Immediate: LUI, AUIPC
- Jumps: JAL, JALR
- Memory Ordering: FENCE
- System: ECALL, EBREAK
M Extension - Integer Multiplication and Division (8 instructions):
- Multiplication: MUL, MULH, MULHSU, MULHU
- Division: DIV, DIVU
- Remainder: REM, REMU
A Extension - Atomic Instructions (11 instructions):
- Load-Reserved/Store-Conditional: LR.W, SC.W
- Atomic Memory Operations: AMOSWAP.W, AMOADD.W, AMOXOR.W, AMOAND.W, AMOOR.W
- Atomic MIN/MAX: AMOMIN.W, AMOMAX.W, AMOMINU.W, AMOMAXU.W
C Extension - Compressed Instructions (27 instructions):
- Quadrant 0: C.ADDI4SPN, C.LW, C.SW
- Quadrant 1: C.NOP, C.ADDI, C.JAL, C.LI, C.ADDI16SP, C.LUI, C.SRLI, C.SRAI, C.ANDI, C.SUB, C.XOR, C.OR, C.AND, C.J, C.BEQZ, C.BNEZ
- Quadrant 2: C.SLLI, C.LWSP, C.JR, C.MV, C.EBREAK, C.JALR, C.ADD, C.SWSP
- Benefits: 16-bit encoding (vs 32-bit standard), 25-30% code size reduction, seamless mixing with standard instructions
F Extension - Single-Precision Floating-Point (26 instructions):
- Arithmetic: FADD.S, FSUB.S, FMUL.S, FDIV.S, FSQRT.S
- Fused Multiply-Add: FMADD.S, FMSUB.S, FNMSUB.S, FNMADD.S
- MIN/MAX: FMIN.S, FMAX.S
- Sign Injection: FSGNJ.S, FSGNJN.S, FSGNJX.S
- Comparisons: FEQ.S, FLT.S, FLE.S
- Conversions: FCVT.W.S, FCVT.WU.S, FCVT.S.W, FCVT.S.WU
- Load/Store: FLW, FSW
- Move/Classify: FMV.X.W, FMV.W.X, FCLASS.S
- Features: 32-register FP file (f0-f31), IEEE 754-2008 compliant, FCSR for rounding modes and exception flags
Zicsr Extension (6 instructions):
- CSR Access: CSRRW, CSRRS, CSRRC, CSRRWI, CSRRSI, CSRRCI
Key Design Decisions
- Multi-cycle execution: Instructions take 3-5+ base cycles plus variable memory latency
- FSM-based control: 12-state finite state machine
- Variable-latency memory: Ready/valid handshaking on instruction and data memory interfaces
- Exposed memory ports: Instruction and data memory are external (managed by testbench)
- Register x0 hardwired to zero: Hardware enforcement (not just software convention)
- Separate branch unit: Dedicated branch comparison logic (not ALU-based)
- CSR support: Full Control and Status Register implementation (Zicsr extension)
- FIFO-based debug: MMIO FIFO at 0x40000000 for host communication with packet protocol
- Staging registers: Flip-flop based intermediate storage for multi-cycle operation (FPGA-safe, no latches)
Coding Conventions
Signal Naming
- Use
snake_case for signal names
- Prefix with purpose:
imem_, dmem_, alu_, etc.
- Keep ports consistent with RISC-V naming:
rs1, rs2, rd, funct3, etc.
Reset Conventions
- Use synchronous resets only in project RTL modules.
- Default to active-high reset ports as
rst for internal RTL modules. The supported FPGA flows map these resets efficiently without extra inversion logic, so active-high remains the project default.
- For sequential logic, use
always_ff @(posedge clk) (or the local clock domain) and perform reset inside the block with if (rst).
- Reserve active-low resets for special cases, usually external board or device signals that already arrive active-low. Convert those signals to the internal active-high convention as close to the boundary as practical.
- When a datapath payload register has a separate
valid, pending, or similar control bit, reset the control bit rather than the payload register itself. Write or refresh the payload whenever you capture new data, typically in the same branch where you set/assert the control bit, and downstream logic must ignore the payload whenever that control bit is low.
- Avoiding resets on payload-only registers reduces reset fanout and routing congestion on FPGA hardware without changing functional behavior.
Timing and Frequency Priority
- Maximizing achievable Fmax and timing margin is a repo priority. Prefer registered signals and shorter combinational cones to improve timing closure against the documented target constraint.
- Default to multi-cycle staging when needed instead of forcing large arithmetic, mux, compare, or control cones into a single cycle for Fmax.
- Prefer registering intermediate values and outputs at natural boundaries (
register -> logic -> register) whenever that does not require major architectural changes.
- Signals that cannot be registered without major architectural changes—especially ready/valid-style handshake returns such as
*_ready—are exempt from this guidance.
- When an unregistered path is kept for architectural reasons, document the exemption clearly so the timing trade-off is explicit.
default_nettype Guards (MANDATORY)
Every .sv file must begin with `default_nettype none and end with `default_nettype wire:
`default_nettype none
// … file content …
`default_nettype wire
- The opening
`default_nettype none turns implicit net declarations into compile errors, catching undeclared signals before they silently become 1-bit wires.
- The closing
`default_nettype wire restores the default so the guard does not bleed into other files included after this one.
- All project
.sv files already carry these guards. Any new file added under rtl/ must include them.
Linting
# Lint SystemVerilog files before committing (RTL files are in subdirectories)
find rtl/common -name '*.sv' -exec verilator --lint-only --Wno-MULTITOP {} +
All SystemVerilog code should pass Verilator linting before being committed.
FPGA Synthesis Verification
# Verify RTL can be synthesized to FPGA (whenever SystemVerilog is modified)
(cd rtl/fpga && make)
Important: CI automatically runs FPGA synthesis verification on all SystemVerilog changes. The design must successfully synthesize to the default ECP5 target (ecp5_icepi_zero) using Yosys/nextpnr-ecp5.
Key constraints:
- Target frequency: 50 MHz
- Resource limit: 24,288 LUT4-class combinational cells
- The default open-source target keeps
ENABLE_F_EXT=0; check current build reports for the latest resource/timing headroom
Debugging Hardware
CRITICAL RULE: When debugging hardware, NEVER rely heavily on abstract reasoning about what signals "should" be doing.
Correct Debugging Approach
- Add
$display() statements to observe actual signal values
- Print state transitions to see FSM behavior
- Observe timing with cycle-by-cycle output
- Base hypotheses on concrete data from simulation
- Verify assumptions with additional instrumentation
Example Debug Instrumentation
always_ff @(posedge clk) begin
if (state == S_FETCH) begin
$display("FETCH: pc=%h instr=%h imem_ready=%b", pc, imem_data, imem_ready);
end
if (state == S_EXECUTE) begin
$display("EXECUTE: alu_op=%h rs1_data=%h rs2_data=%h result=%h",
alu_op, rs1_data, rs2_data, alu_result);
end
end
What NOT to Do
- ❌ Assuming signal values without checking them
- ❌ Predicting FSM state transitions without observation
- ❌ Guessing timing relationships
- ❌ Reasoning through complex logic without concrete data
Key Principle: Treat hardware debugging like experimental science - observe first, then reason based on evidence.