Pipelined Simple RISC Machine - UBC CPEN 211
This post discusses the performance improvements that are achievable by creating a pipelined Simple RISC Machine (SRM) for UBC CPEN 211.
The original idea for this post was to create a guide that described how I pipelined my Lab 8 design and achieved the above mentioned speed-up. Unfortunately, this wasn’t possible due to the terms listed in the Academic Integrity Pledge that all CPEN 211 students have to sign. So instead, I thought I’d share my performance findings so that those who are considering a pipelined Lab 8 submission can see if it’s worthwhile.
About CPEN 211
CPEN 211 - Introduction to Microcomputers: is an introductory course in Digital Logic Design and Computer Architecture taught at the University of British Columbia (UBC).
Retrieved from the CPEN 211 course page:
Boolean algebra; combinational and sequential circuits; organization and operation of microcomputers, memory addressing modes, representation of information, instruction sets, machine and assembly language programming, systems programs, I/O structures, I/O interfacing and I/O programming, introduction to digital system design using microcomputers.
Lab Competition
The majority of the labs in the course are centred around building a simple multi-cycle RISC processor. There are numerous possible correct solutions (so long as they are compatible with the auto-grader).
A competition is held near the end of the term where designs are ranked against each other based on speed.
The purpose of this post is to illustrate the advantages of a making a pipelined design to current students who are considering taking this route.
Pipelined RISC Machine
I implemented and tested 3 different microarchitectures in order to compare performance. The designs are summarised in the table below:
Design A | Design B | Design C | |
---|---|---|---|
f_max (MHz) | 94.48 | 175.25 | 133.74 |
Peak Instructions/Cycle | 0.402 | 0.270 | 0.859 |
Datapath Stages | 2 | 4 | 5 |
Branch Prediction | NO | NO | STATIC |
Operand Forwarding | NO | NO | YES |
Peak Speed-up | 1.00 | 1.30 | 3.14 |
Geometric Mean Speed-up | 1.00 | 1.25 | 2.63 |
- Design A: is a single-cycle design. It’s designed to complete the majority of its instructions in a single clock cycle. It’s also the design I submitted for the competition in my year.
- Design B: is a multi-cycle design and completes its instructions in multiple clock cycles. Only 1 instruction is being executed at any given time.
- Design C: (the focus of this post) is a more aggressively pipelined version of Design B. It is based on the 5 stage pipeline architecture described in Chapter 4 of Patterson & Hennessy.
Design A was 2.4 times faster than the reference design for the Fall 2020 competition and Design C was 2.63 times faster than A.
Because a geometric mean was used to compute both values, this implies that Design C is 2.40 x 2.63 = 6.31 times as fast as the reference design.
This isn’t entirely accurate because a different set of benchmarks was used for the competition, however, it does give us some idea of relative performance.
Design C
Design C is a mostly faithful implementation of the 5 stage pipelined architecture described in Patterson & Hennessy. The only major architectural difference is the location of the Branch Predictor (ID Stage instead of IF) as well as how the memory bus is wired up. The high level design (of the Patterson & Hennessy design) is shown below:
Conclusions & Recommendations
The overall design effort took about 2 weeks of continuous work time. Note that this doesn’t include the time I spent reading & learning about pipelining.
With that said, it seems that it could be possible for someone to implement and submit a similar design for the competition but only if they started started at least two weeks early (and learned about pipelining prior to this).
The best approach would probably be the build the machine along side the regular labs and then swap it in for Lab 8.