3-Stage
RISC-V CPU

1st Dec 2024
Tags:
ASIC,
Verilog,
FPGA

Table of Contents

1. Overview

This project implements a 3-stage pipelined RV32I RISC-V CPU in Verilog, taken through synthesis and place-and-route in SkyWater 130 nm. The design emphasizes pipeline structuring under synchronous SRAM constraints, cache-driven stall behavior, and timing-aware microarchitectural tradeoffs, rather than ISA extensions or IPC optimizations.

The ASIC implementation closed timing at a 20.0 ns clock period (50 MHz) post-PAR. Post-layout timing analysis showed that branch resolution and cache interface logic dominate critical paths, consistent with expectations for a short pipeline coupled to synchronous memories. An earlier FPGA implementation of the same architectural design was used for early software bring-up and control-path validation.

Together, these projects demonstrate how a simple RISC-V microarchitecture can be adapted across platforms, with the ASIC implementation explicitly addressing physical realism, memory timing, stall behavior, and post-layout timing closure.

Note: This writeup describes my design approach and engineering process for a course project. Implementation details and source code are not included to respect academic integrity.

2. ASIC Implementation

Technology and Design Flow

The ASIC implementation targeted a standard-cell flow with synchronous SRAM macros, reflecting constraints commonly encountered in industrial ASIC designs.

ISA: 32-bit RISC-V (RV32I)
RTL: Verilog (SystemVerilog testbenches)
Process: SkyWater 130 nm
SRAMs: SRAM22-generated hard macros
Flow Orchestration: Hammer

EDA Toolchain

Simulation: Synopsys VCS
Synthesis: Cadence Genus
Place & Route + CTS: Cadence Innovus
Post-layout Timing Analysis: Cadence Innovus

Physical verification (DRC/LVS) was out of scope for this project; full physical verification was performed in a separate tapeout project linked here.

Pipeline Architecture

Implemented a 3-stage pipeline (IF/ID, EX/MEM, WB) with:

Forwarding logic to resolve data hazards
Stall control integrated with cache behavior
Branch resolution with misprediction handling

The design required careful analysis of hazard conditions and bypass timing to minimize stalls while maintaining timing closure.

Datapath and Control

Key Datapath Components

The datapath includes:

Program Counter and next-PC logic
Instruction cache interface
Register file
Immediate generator and branch comparator
ALU
Data cache interface

Supporting logic includes:

Pipeline stall control driven by cache state
Partial load/store handling for byte and halfword accesses

Stalling and Hazard Management

The forwarding network minimizes stalls from data hazards, with remaining stalls primarily driven by cache behavior. Cache policy selection required careful analysis of stall frequency versus implementation complexity in a short pipeline.

Control Logic

Control logic is implemented using explicit combinational decoding with control signals pipelined alongside datapath values.

This approach avoids ROM-based decode, simplifies debugging, and maps cleanly onto RV32I semantics. It also avoids introducing additional timing pressure in a tight pipeline.

Cache Architecture

4 KiB Direct-Mapped Write-Back Cache

The CPU integrates a direct-mapped, write-back cache with:

Cache lines sized for efficient refill over a multi-beat external memory interface
Separate data and metadata storage using dedicated SRAM macros
Multi-state FSM managing: idle detection, tag comparison, multi-cycle refill, and dirty eviction

The cache organization balanced timing closure constraints with refill efficiency, avoiding unnecessarily deep FSM pipelines while maintaining single-cycle hit latency.

Additional Cache Implementation Notes

Data storage was partitioned into multiple SRAM banks to enable concurrent access and simplify physical placement. Bank organization was chosen to align with the external memory interface width, minimizing refill latency while maintaining timing closure.

The refill process balances memory bandwidth utilization with state machine complexity, completing line refills over multiple memory transactions without introducing unnecessary pipeline bubbles.

Timing Closure and Physical Design

Timing Results

The design closed timing at a 20.0 ns clock period (50 MHz) post-PAR with reasonable slack margins post-synthesis and post-PAR.

Post-layout timing analysis confirmed that branch-related logic and cache interface paths represented the critical path. This aligns with expectations for a short pipeline with synchronous instruction memory.

Clock Tree Synthesis

Clock tree synthesis produced reasonable balance overall, with skew between shortest and longest paths below timing constraints. This skew was likely influenced by SRAM macro placement, suggesting that improved floorplanning could further improve timing margin.

Optimization Decisions

Key areas of optimization included:

Forwarding network design to minimize data hazard stalls
Cache write policy selection balancing performance and complexity
Branch prediction strategy suitable for a short pipeline
Strategic SRAM macro placement during physical design

These optimizations required analyzing performance tradeoffs across benchmark workloads and balancing against timing closure constraints.

Observed tradeoffs:

SRAM placement affected area and routing modestly, but performance gains were limited at the fixed clock constraint.
Cache behavior dominated stalls. Further gains would likely require deeper pipelining, prefetching, or higher associativity.

Performance Results

Benchmarks were run under four configurations (cache/no-cache, forwarding/no-forwarding) using an intentionally idealized single-cycle main memory model. This setup isolates the structural overhead of cache latencies, rather than modeling a realistic memory hierarchy.

Under this assumption, cache miss penalties dominate execution time, making the cache appear unfavorable. With a realistic multi-cycle main memory, the performance trends would be expected to reverse.

5 benchmarks evaluated different workload characteristics:

Memory-intensive random access patterns
Compute-bound operations with data dependencies
Recursive control flow
Streaming memory access
Mixed memory and compute operations

Key Observations:

Forwarding provides substantial performance gains (typically 40-60%, with compute-intensive workloads showing even higher improvements) by minimizing data hazard stalls
Cache and forwarding address different bottlenecks: forwarding resolves pipeline dependencies while caching reduces memory latency
The cache + forwarding configuration represents the optimal balance for this architecture because:
- Forwarding provides speedup by eliminating data hazard stalls (up to 88% in compute-intensive workloads)
- Combined, they provide the best performance without requiring deeper pipelining or more complex hazard detection

3. FPGA-based Implementation (Separate Project)

A separate FPGA-based RV32I CPU implemented the same high-level RV32I architecture. This project targeted functional validation and software bring-up under FPGA-specific constraints.

Purpose and Relationship to ASIC Design

While the FPGA design informed architectural decisions, the ASIC implementation was developed independently and focuses on synchronous memory integration, cache microarchitecture, and physical timing closure.

Key Differences

Asynchronous memory interfaces instead of synchronous SRAMs
Simplified cache model without write-back behavior
~55 MHz target frequency on Xilinx PYNQ FPGA
No explicit placement, routing, or CTS control
Additional FPGA-specific peripherals (UART, buttons, synchronized I/O)

Address Space Partitioning and Memory-Mapped I/O

The FPGA implementation included address space partitioning and memory-mapped I/O to support software bring-up. A UART-based boot flow allowed programs to be loaded into memory at runtime, while memory-mapped counters enabled CPI measurement directly in hardware.

This infrastructure enabled interactive debugging and validation of control flow, memory operations, and pipeline behavior under real software execution.

Overall Significance

The FPGA implementation acted as a physical implementation of the high-level architecture, while the ASIC implementation focused on microarchitectural tradeoffs and physical design constraints. Taken together, these projects demonstrate that the same architectural design principles can be applied across platforms with differing constraints.

4. Known Limitations

During post-synthesis simulation, an inferred combinational feedback path highlighted the importance of strict default-assignment discipline in complex control logic. While this issue did not affect timing closure, it would be addressed in a follow-on revision through stricter combinational structuring, explicit defaults, and synthesis-time assertions.

Resolving this would be the first priority before extending the architecture.

5. Engineering Takeaways

Synchronous SRAM latency strongly influences pipeline partitioning.
Cache design must be evaluated via stall behavior, not just hit rate.
Short pipelines concentrate worst-case timing in control and branch paths.
Clean RTL discipline becomes critical once designs reach synthesis and PAR.

Note: I received approval from course staff to publish this writeup.

3-StageRISC-V CPU