ARM vs RISC-V Cores | System level Comparison-Latency, Power

Methodology for System-level Comparison of ARM vs RISC-V Cores for Latency and Power Consumption

System-level analysis is the only methodology to compare the performance and power consumption of two processor architectures. Unfortunately, processor models to perform the comparison are not easily available. This article describes a performance and power comparison methodology using system-level IP between SiFive RISC-V u74 and ARM Cortex A53 processors. The system model of the processor is validated using the Dhrystone and microbench benchmark suites.  The verified processor system models are used to assemble a System-on-Chip (SoC) design, run application use cases and compare the power and performance statistics.  Technical marketing to hardware engineer needs to benchmark results but the current approaches are very difficult, inaccurate and take a long time. Through this paper, we will discuss our modelling experience using processor generators that create models of the RISC-V and ARM cores at a stochastic, hybrid and micro-architecture level of detail. This system-level IP Generator would help system design engineers, research engineers and scholars gain insight on the performance and power of the application on each core with different parameter and configuration changes. Unlike virtual prototyping, this System-level IP can handle power, performance and behavior analysis.

Growing Importance of System Modeling & Challenges

In the early stages of a new product design, the product definitions requirements are coarsely defined.  The architecture exploration effort must refine definitions into a well-articulated and structured plan that address both the problem and solution. Modeling the system and analyzing the results supports the architecting in several ways during the project life cycle, Fig. 1.

Figure 1. Project life cycle incorporating simulation and modeling

System level modelling and analysis is a way to evaluate different solutions. It is a form of design analysis whose aim is to bring a design properties view and test different configurations. It is an extremely effective approach for engineering design projects that can help you reduce development time, uncover problems early in the development cycle and improve overall design [5]. System on Chip design involves assembling of various subsystems and different protocols which brings in a lot of unpredictability. Thus, Modeling and Performance analysis for System on Chip designs are getting more and more important as it helps to remove surprises during implementation and verification and reduce the time to market [4]. Especially the performance of the processor sub system is a major cause of concern as it is one of the key blocks in the entire SoC [3]. At the very beginning, the architect would only have defined few requirements that they wish to achieve 50% better in performance when compared to competing product or to achieve 2x  improvement in power compared to predecessor of the same product. They then have to identify the processor core which satisfies their requirement. Whether to go with ARM architecture or RISC-V or MIPS or Microblaze is just one of the very first choices an architect will have to make. Then they need to look through the product catalog and select the right processor core (ex: ARM Cortex A53, ARM Cortex A77, SiFive U74 etc).  The frequent issue in system design is, the final results for any application or use-case is lacking in expected performance after selecting a processor core and the physical implementation is done. A lot of effort, time and resources are spent, in order to fine tune the parameters available to the architect and improve the performance.  This is why system level modelling becomes such a valuable solution to the architects. The system level modelling will be done at the very early stages in the design process and thus help determine whether the expected performance can be achieved with the current configuration. If the results are not sufficient, then the architect can reconfigure the parameters, try different number of processor cores, different interconnect, add another cache level or vary cache sizes, or even try different processor core or architecture [9,10,11,12]. Since this is done before the implementation or even the RTL schematics are available, all the findings made during the system level modeling provides an insight to the architect and thus could build the design without much hiccups in the later stages of the process [6,7,8].

However, there are many challenges that an architect faces while building the system model of their design. Some of the key challenges are listed below:

  1. Lack of information on the settings configured by the vendor. This lack forces user to Reverse engineer all the way till matching results are obtained (comparing system model results vs real system stats).
  2. Too much time needs to be spent on designing the virtual prototype which is sufficient enough to get within 75 – 80 % accuracy.
  3. Lack a platform where an architect could assemble the entire SoC with components accurate to the micro architecture level
  4. Lack of latest processor core IPs to make the required analysis

In this paper, we will describe how we overcome these challenges and how we achieved accuracy between 90 – 98% when compared with the real board. This accurate processor IP then was plugged into SoC system model and used to identify bottlenecks and incorrect operation in the SoC design. This helped save 30% of the resource (4 months of effort) which would have been used in identifying the bug and resolving it.

We have selected ARM Cortex A53 and SiFive U74 core to demonstrate the efficiency of the proposed methodology. The following sections explain processor architecture of A53 and U74, key parameters and features to focus, mapping of parameters to system model, validation of system model, use of developed Processor IP in a SoC design and the comparative performance of application on ARM and RISC-V core separately. 

Inefficient approaches and the Solution

There are several approaches to modeling and performance analysis. Examples of system modeling include:

  1. RTL schematics – These models match the implementation-level accuracy.  Unfortunately it is not available at the start of the project and also large tests cannot be conducted because of simulation speed issues. So significant amount of time and effort would have been completed when it’s time to simulate the design.
  2. Excel sheets – Though easy to use but failed to provide any sort of dynamism and thus couldn’t account for the bottlenecks when multiple parallel tasks are being performed across the system.
  3. Analytical model – They are quick to complete the simulation, however they are not accurate
  4. SystemC plugins – A widely used methodology. But any change that is required to be made to the plugin requires a lot of effort.  This approach has a steep learning curve to assemble complex data flow models and is not an easy process for the architect.

This is where the system-level modelling methodology comes in. In this paper, we present a plugin which is easy to configure and modify. Depending on the level of accuracy needed, the model can be at the stochastic or cycle accurate level. Moreover, this is done at the very start of the design process. All the findings will be accounted  for in  the design flow and thus the later stages will become bug-free. Our approach is illustrated using a diagram below in Figure 2.

Figure 2. Proposed Methodology. Different levels of abstraction depending on the requirement

Processor Architecture Overview

ARM Cortex A53

The Cortex-A53 processor is a mid-range, low-power processor that implements the Armv8-A architecture. It has 2 levels of cache and AMBA 4 Memory interface [1]. The Cortex A53 block diagram is shown below in Fig 3.

Figure 3. ARM Cortex A53 block diagram [1]

The focus will be on modeling single core of Cortex A53 and once validated, scaled up to multi-core design. The Cortex A53 is a dual issue, in-order Processor in that instructions issue, execute, and commit in order with 8 pipeline stages. The pipeline consists of three sections for instruction fetch, instruction decode, and execute.

The features of the A53 core [1] is listed out in the following table, Table 1.

The Model layout of ARM Cortex A53 built using VisualSim Architect is shown below, Fig 4.

Figure 4. ARM Cortex A53 built using VisualSim Architect platform

The VisualSim Architect Hybrid Processor modelling suite was used for building up the model. The Processor Pipeline was set up using the parameters provided by the Hybrid processor library module, as shown in Fig. 5.

Figure 5. ARM Cortex A53 Processor pipeline settings and parameters being mapped to VisualSim Hybrid Processor module

Fig. 5 shows how the 8 stages of the pipeline are set up and how the processor parameters are setup – processor clock speed, instructions per cycle, number of registers etc. We then used the Instruction_Set library module provided by the VisualSim Architect to define the ARM ISA. Fig. 6 shows how the Instruction_Set module has been setup – number of execution units, instructions per execution units, instruction latency etc. are setup here.

Figure 6. Arm v8-A ISA defined using the Instruction_Set library module in VisualSim Architect

The Cache, Bus and Memory parameters were setup within the ranges specified in the technical reference guide provided by ARM.

SiFive U74

SiFive’s U74 Core Complex is a full-Linux-capable, cache-coherent 64-bit RISC‑V processor available as an IP block. The U74 Core Complex memory system consists of a Data Cache and Instruction Cache [2]. The U74 Core block diagram is shown below in Fig. 7.

The U74 Core Complex includes a 64-bit U7 RISC‑V core, which has a dual-issue, in-order execution pipeline, with a peak execution rate of two instructions per clock cycle. The U7 core supports machine, supervisor, and user privilege modes, as well as standard Multiply (M), Single Precision Floating Point (F), Double-Precision Floating Point (D), Atomic (A), Compressed (C), and Bit Manipulation (B) RISC‑V extensions (RV64GCB). It supports TileLink interconnect standard. 

The features of the U74 core [2] is listed out in the following table, Table 2.

The Model layout of SiFive U74 core built using VisualSim Architect is shown below in Fig 8.

Figure 8. SiFive U74 single core complex built using VisualSim Architect platform

The U74 system model has been built using the Hybrid Processor modeling suite. We made use of cycle-accurate cache, TileLink and DDR plugins available in the tool librabry and thus made the process of assembling the system model easy.  The Processor architecture – pipeline stages, settings were mapped to the system model with the help of parameters provided by the Hybrid Processor, as shown below in Fig. 9.

Fig. 9 shows how the 8 stages of the pipeline are set up and how the processor parameters are set up – Processor clock speed, instructions per cycle, outstanding load-store access count etc. We then used the Instruction_Set library module provided by the VisualSim Architect to define the RISC-V ISA. Fig. 10 shows how the Instruction_Set module has been set up – Number of Execution Units, instructions per execution units, instruction latency etc. are set up here.

Figure 9. SiFive U74 pipeline stages and settings being mapped to the VisualSim Hybrid processor module
Figure 10. RISC-V RV64GCB ISA defined using the Instruction_Set library module in VisualSim Architect

The Cache, Bus and Memory parameters were setup within the ranges specified in the manual provided by SiFive.

System Model Validation

Setting up the demo models for both the ARM Cortex A53 and the SiFive U74 was relatively easy as we reused the plugins provided with the VisualSim Architect. But there were many parameters to be setup. In order to measure the level of accuracy this methodology could achieve, we looked at the market and selected a board for comparing the results between the System model and the real board. For comparing the ARM Cortex A53 System model, we selected the Xilinx ZCU102 board and for SiFive U74, we selected the HiFive Unmatched HF105-000 board. 

Based on the ZCU102 datasheet [13], the cache, bus and memory parameters were set. Some of the key parameters set are shown below, Fig. 11.

Figure 11. The System Level Model parameters which are setup in correlation with the ZCU102 board settings

Once the parameters were setup, we selected few benchmarks to be run on the System model as well as on the board to gauge the accuracy levels. The selected benchmarks were Dhrystone [15] and the microbench [16] suite.

Once the simulation was done, the tool printed out the stats across all cache levels, buses and memory into a csv file which made it easy to analyze the stats. The generated stats are shown below, Fig. 12

Figure 12. The csv file generated during the simulation which provides all the required information like the application latency, cache hit-miss ratio, prefetch count, number of evictions, latency across each cache level, number of write backs etc.

Using the stats listed in Fig. 12, the comparison with ZCU102 board was made. The data is put togther in the following table, Table 3.

While the majority of the benchmarks performed within the range of 90% – 99%, there were a few which performed poorly – mm and stc (Table 3). To debug the cause for such a huge delta, we utilize the Insight_Engine provided by the VisualSim Architect along with the PostProcessor batch mode simulation. The batch mode simulation made it possible to run multiple parameter variations in parallel on different cores and the Insight_Engine swept across each module used in the system model thereby keeping track of every activity happening across them. A Trace was generated at the end of the simulation which gave us information on the time consumed by each instruction across different stages of the pipeline. Using this information and comparing it with the details from ARM Technical reference guide, we were able to find that a parameter responsible for controlling the outstanding Data Cache accesses from the load store unit was not set.   So the required parameter was added at the Hybrid Processor as shown below in Fig 13.

Figure 13. Maximum number of outstanding Data Cache accesses are set using the parameter called “Outstanding_Req_Count”. The required value was obtained from the ARM document.

With the updated demo model, the benchmarks were run again and the obtained results are shown below in Table 4. 

We could see that all the benchmarks are now within 90% – 99% accuracy when compared to the results observed from real board. With this experiment, we could see that acceptable levels of accuracy when comparing the performance can be achieved easily by using the System model where the parameters and settings were obtained from the technical reference documents available online.

Now that the performance accuracy has been taken care of, we went ahead to add power modelling to the A53 System model. For this purpose, we made use of the PowerTable plugin provided with the power modelling suite in VisualSim Architect. The PowerTable has provisions to accept the power per state per device. So we were able to enter the Active, Standby, Wait, Idle and Retention power values as shown below, Fig. 14.

Figure 14. Power value per state for the Cortex A53 core being setup using the PowerTable plugin

Fig. 14 shows that the power value per state is not a static value but rather they are computed dynamically. Each of the variable linked in Fig. 14 are calculated during runtime and the results are plotted. The calculations made for each variable are shown below in Fig. 15.

Figure 15. The equations corresponding to each of the variable is shown here. These variables are calculated every time when a state change occurs for the corresponding device.

The simulation was run for different clock frequency and the average power observed was recorded and compared against the real A53 core. The comparison result is shown in the table below.

From the Table 5, it can be seen that accuracy levels between 90% – 99% were obtained when comparing the power profile across different clock frequencies.

Similarly, the SiFive U74 system model was also validated. We used the Dhrystone benchmark for comparing the accuracy between the SiFive U74 core in HF105-000 board [14] and the system model. The obtained result is shown below in Table 6.

Application Use Cases – Comparing ARM VS RISC-V

Now that the system model has been validated for both ARM and RISC-V core, we plugged the system model into a desired SoC design and ran application sequences on it. The performance and power obtained when using an ARM core vs RISC-V core were compared. We selected a multimedia application to be run, and the results from it to be compared against each other. The block diagram of the System on Chip (SoC) for the application to be run on is shown below in Fig. 16.

Figure 16. SoC Hardware architecture for running the target multimedia application

The same SoC design was built on VisualSim Architect and is shown below, Fig. 17.

Figure 17. System model representation of the SoC which runs multimedia application

Fig.17 shows the Software architecture (at the bottom) as well as the Hardware Architecture (at the top). The Processor_SUBSYSTEM block within the Hardware Architecture contains the processor system model – ARM Cortex A53 or the SiFive U74.  The simulation was run using the ARM and RISC-V core and the results were recorded.

Fig. 18 shows the Power consumed across the SoC when we used ARM Core.

Figure 18. Average Power across SoC running multimedia application which uses ARM core

Fig. 19 shows the number of frames processed per unit time when using an ARM core.

Figure 19. Performance analysis of the multimedia application when using ARM core

Fig. 20 shows the Power consumed across the SoC when we used RISC-V Core.

Figure 20. Average Power across SoC running multimedia application which uses RISC-V core

Fig. 21 shows the number of frames processed per unit time when using an ARM core.

Figure 21. Performance analysis of the multimedia application when using RISC-V core

We could see the amount of power consumed and the number of frames processed per unit time when using the ARM Core and RISC-V core. The ARM core and RISC-V processor subsystems have different configurations – different pipeline settings, different cache settings, different instruction latencies, and interconnect protocols etc which result in providing different power and performance statistics while running the same multimedia application. Architect could fine tune the parameters and get better results. This methodology made it possible to evaluate how the target application would perform on different configurations there by providing an early insight to the architect on what to expect from different configurations and optimize their design.

The proposed methodology for comparing processor cores or different configurations of the same core using system level modelling was proven to be a success. We observed that the system level model achieved 90%-99% accuracy when compared with the real board – both power and performance. Architects could now use the system model to evaluate the application on the target architecture and analyze the performance and power metrics to decide whether to go forward with their proposed design or if changes are required to meet the requirements. Since it can be done very early in the design process, a lot of time and resources can be saved. We compared the amount of time it took to meet the design requirements with and without system modelling and the comparison result can be seen below in Fig. 22.

Figure 22. Comparison on the resources that was taken to complete a project with and without system modelling

The process of modelling a processor core was easier as the template was already well defined and provided to the user and hence this methodology proved to be easier to map the parameters from the official technical reference manual. This saved a lot of time which is evident from Fig. 22.

Acknowledgment

We would like to thank Eric Sondhi from ARM and the GEM5 development team for providing us with data for validating the system model.

References

[1] ARM, “ARM Cortex-A53 MPCore Processor, Technical Reference manual,” Revision: r0p4, 2018.

[2] SiFive Inc., SiFive U74 Core Complex Manual, 21G1.01.00, 2021.

[3] Thin-Fong Tsuei, and Wayne Yamamoto, “A Processor Queuing Simulation model for Multiprocessor system performance analysis,” Sun Microsystems Inc., Unpublished

[4] Gerrit Muller, “System Modelling and Analysis; a practical approach,” Gaudi project, 2021.

[5] Carlos M. Betemps, Mateus S. de Melo, Amir M. Rahmani, Antonio Miele, Nikil Dutt, and Bruno Zatt, “Exploring Heterogeneous Task-Level Parallelism in a BMA Video Coding Application using System-Level Simulation,” in VIII Brazilian Symposium on Computing Systems Engineering, 2018.

[6] A. Asaduzzaman, M. Moniruzzaman, K.K. Chidella, and P. Tamtam, “An efficient simulation method using VisualSim to assess autonomous power systems,” in SoutheastCon, 2016.

[7] K. S. Kushal, Manju Nanda, and J. Jayanthi, “Transaction-Based Models (TBM) and Evaluation of their throughput,” in IEEE Recent Advances in Intelligent Computational Systems (RAICS), 2015.

[8] Md Moniruzzaman, Abu Asaduzzaman, and Muhammad F. Mridha, “Optimizing Controller Area Network System for Vehicular Automation,” in 5th International Conference on Informatics, Electronics and Vision (ICIEV), 2016.

[9] Cagkan Erbas, Andy D. Pimentel, Mark Thompson, and Simon Polstra, “A Framework for System-Level Modeling and Simulation of Embedded Systems Architectures,” in EURASIP Journal on Embedded Systems, 2007

[10] Waleed Khan, Nasru Minallah, and Naina Said, “Benchmarking 4x ARM Cortex-A7 CPU and  4x ARM Cortex-A53 for Multimedia Systems using JPEG Compression,” in International Conference on Computing, Mathematics and Engineering Technologies – iCoMET, 2018.

[11] Benjamin Schwaller, Barath Ramesh, and Alan D. George, “Investigating TI KeyStone II and quad-core ARM Cortex-A53 architectures for on-board space processing,” in IEEE High Performance Extreme Computing Conference (HPEC), 2017.

[12] Michael J. Cannizzaro, Evan W. Gretok, and Alan D. George, “RISC-V Benchmarking for Onboard Sensor Processing,” in IEEE Space Computing Conference (SCC), 2021.

[13] Xilinx, “ZCU102 Evaluation Board User Guide,” UG1182 (v1.6), June 2019.

[14] SiFive, “HF105 Datasheet”.

[15] Reinhold P. Weicker, “Dhrystone Benchmark (Ada Version 2): Rationale and Measurement Rules,” in Ada Letters, July 1989.

[16] Rajagopalan Desikan, Doug Burger, and Stephen W. Keckler, “Measuring Experimental Error in Microprocessor Simulation,” in 28th Annual International Symposium on Computer Architecture (ISCA), 2001.