3-Stage
RISC-V CPU
Table of Contents
1. Overview
This project implements a 3-stage pipelined RV32I RISC-V CPU in Verilog, taken through synthesis and place-and-route in SkyWater 130 nm. The design emphasizes pipeline structuring under synchronous SRAM constraints, cache-driven stall behavior, and timing-aware microarchitectural tradeoffs, rather than ISA extensions or IPC optimizations.
The ASIC implementation closed timing at a 20.0 ns clock period (50 MHz) post-PAR. Post-layout timing analysis showed that branch resolution and cache interface logic dominate critical paths, consistent with expectations for a short pipeline coupled to synchronous memories. An earlier FPGA implementation of the same architectural design was used for early software bring-up and control-path validation.
Together, these projects demonstrate how a simple RISC-V microarchitecture can be adapted across platforms, with the ASIC implementation explicitly addressing physical realism, memory timing, stall behavior, and post-layout timing closure.
Note: This writeup describes my design approach and engineering process for a course project. Implementation details and source code are not included to respect academic integrity.
2. ASIC Implementation
Technology and Design Flow
The ASIC implementation targeted a standard-cell flow with synchronous SRAM macros, reflecting constraints commonly encountered in industrial ASIC designs.
- ISA: 32-bit RISC-V (RV32I)
- RTL: Verilog (SystemVerilog testbenches)
- Process: SkyWater 130 nm
- SRAMs: SRAM22-generated hard macros
- Flow Orchestration: Hammer
EDA Toolchain
- Simulation: Synopsys VCS
- Synthesis: Cadence Genus
- Place & Route + CTS: Cadence Innovus
- Post-layout Timing Analysis: Cadence Innovus
Physical verification (DRC/LVS) was out of scope for this project; full physical verification was performed in a separate tapeout project linked here.
Pipeline Architecture
Implemented a 3-stage pipeline (IF/ID, EX/MEM, WB) with:
- Forwarding logic to resolve data hazards
- Stall control integrated with cache behavior
- Branch resolution with misprediction handling
The design required careful analysis of hazard conditions and bypass timing to minimize stalls while maintaining timing closure.
Datapath and Control
Key Datapath Components
The datapath includes:
- Program Counter and next-PC logic
- Instruction cache interface
- Register file
- Immediate generator and branch comparator
- ALU
- Data cache interface
Supporting logic includes:
- Pipeline stall control driven by cache state
- Partial load/store handling for byte and halfword accesses
Stalling and Hazard Management
The forwarding network minimizes stalls from data hazards, with remaining stalls primarily driven by cache behavior. Cache policy selection required careful analysis of stall frequency versus implementation complexity in a short pipeline.
Control Logic
Control logic is implemented using explicit combinational decoding with control signals pipelined alongside datapath values.
This approach avoids ROM-based decode, simplifies debugging, and maps cleanly onto RV32I semantics. It also avoids introducing additional timing pressure in a tight pipeline.
Cache Architecture
4 KiB Direct-Mapped Write-Back Cache
The CPU integrates a direct-mapped, write-back cache with:
- Cache lines sized for efficient refill over a multi-beat external memory interface
- Separate data and metadata storage using dedicated SRAM macros
- Multi-state FSM managing: idle detection, tag comparison, multi-cycle refill, and dirty eviction
The cache organization balanced timing closure constraints with refill efficiency, avoiding unnecessarily deep FSM pipelines while maintaining single-cycle hit latency. Data storage was partitioned into multiple SRAM banks to enable concurrent access and simplify physical placement. Bank organization was chosen to align with the external memory interface width, minimizing refill latency while maintaining timing closure. The refill process balances memory bandwidth utilization with state machine complexity, completing line refills over multiple memory transactions without introducing unnecessary pipeline bubbles.Additional Cache Implementation Notes
Timing Closure and Physical Design
Timing Results
The design closed timing at a 20.0 ns clock period (50 MHz) post-PAR with reasonable slack margins post-synthesis and post-PAR.
Post-layout timing analysis confirmed that branch-related logic and cache interface paths represented the critical path. This aligns with expectations for a short pipeline with synchronous instruction memory.
Clock Tree Synthesis
Clock tree synthesis produced reasonable balance overall, with skew between shortest and longest paths below timing constraints. This skew was likely influenced by SRAM macro placement, suggesting that improved floorplanning could further improve timing margin.
Optimization Decisions
Key areas of optimization included:
- Forwarding network design to minimize data hazard stalls
- Cache write policy selection balancing performance and complexity
- Branch prediction strategy suitable for a short pipeline
- Strategic SRAM macro placement during physical design
These optimizations required analyzing performance tradeoffs across benchmark workloads and balancing against timing closure constraints.
Observed tradeoffs:
- SRAM placement affected area and routing modestly, but performance gains were limited at the fixed clock constraint.
- Cache behavior dominated stalls. Further gains would likely require deeper pipelining, prefetching, or higher associativity.
Performance Results
Benchmarks were run under four configurations (cache/no-cache, forwarding/no-forwarding) using an intentionally idealized single-cycle main memory model. This setup isolates the structural overhead of cache latencies, rather than modeling a realistic memory hierarchy.
Under this assumption, cache miss penalties dominate execution time, making the cache appear unfavorable. With a realistic multi-cycle main memory, the performance trends would be expected to reverse.
5 benchmarks evaluated different workload characteristics:
- Memory-intensive random access patterns
- Compute-bound operations with data dependencies
- Recursive control flow
- Streaming memory access
- Mixed memory and compute operations
Key Observations:
- Forwarding provides substantial performance gains (typically 40-60%, with compute-intensive workloads showing even higher improvements) by minimizing data hazard stalls
- Cache and forwarding address different bottlenecks: forwarding resolves pipeline dependencies while caching reduces memory latency
- The cache + forwarding configuration represents the optimal balance for this architecture because:
- Forwarding provides speedup by eliminating data hazard stalls (up to 88% in compute-intensive workloads)
- Combined, they provide the best performance without requiring deeper pipelining or more complex hazard detection
3. FPGA-based Implementation (Separate Project)
A separate FPGA-based RV32I CPU implemented the same high-level RV32I architecture. This project targeted functional validation and software bring-up under FPGA-specific constraints.
Purpose and Relationship to ASIC Design
While the FPGA design informed architectural decisions, the ASIC implementation was developed independently and focuses on synchronous memory integration, cache microarchitecture, and physical timing closure.
Key Differences
- Asynchronous memory interfaces instead of synchronous SRAMs
- Simplified cache model without write-back behavior
- ~55 MHz target frequency on Xilinx PYNQ FPGA
- No explicit placement, routing, or CTS control
- Additional FPGA-specific peripherals (UART, buttons, synchronized I/O)
Address Space Partitioning and Memory-Mapped I/O
The FPGA implementation included address space partitioning and memory-mapped I/O to support software bring-up. A UART-based boot flow allowed programs to be loaded into memory at runtime, while memory-mapped counters enabled CPI measurement directly in hardware.
This infrastructure enabled interactive debugging and validation of control flow, memory operations, and pipeline behavior under real software execution.
Overall Significance
The FPGA implementation acted as a physical implementation of the high-level architecture, while the ASIC implementation focused on microarchitectural tradeoffs and physical design constraints. Taken together, these projects demonstrate that the same architectural design principles can be applied across platforms with differing constraints.
4. Known Limitations
During post-synthesis simulation, an inferred combinational feedback path highlighted the importance of strict default-assignment discipline in complex control logic. While this issue did not affect timing closure, it would be addressed in a follow-on revision through stricter combinational structuring, explicit defaults, and synthesis-time assertions.
Resolving this would be the first priority before extending the architecture.
5. Engineering Takeaways
- Synchronous SRAM latency strongly influences pipeline partitioning.
- Cache design must be evaluated via stall behavior, not just hit rate.
- Short pipelines concentrate worst-case timing in control and branch paths.
- Clean RTL discipline becomes critical once designs reach synthesis and PAR.
Note: I received approval from course staff to publish this writeup.