Learning Objectives
Use the framework available in VisualSim to create a multi-core processor design for the following objectives:
- Determine the accurate sizing of the hardware components (speed and buffering requirements).
- Evaluate the performance improvement between a single and dual processor system.
- Evaluate different Cache/SDRAM data access schemes.
- Find the average response time of the system for various application rates.
- Optimize the application on the target architecture.
Design Methodology
The block diagram of this system is shown in Figure 1. The system has two identical cores on a shared bus with a cache and SRAM.
Figure 1: Dual Processor Design Block Diagram
The system supports the following possible tasks:
- CPU has the data for processing transactions without accessing external memory.
- The CPU has to access data on the Cache to complete the processing of transactions.
The latency of tasks depends on the accessibility of data for
processing of transactions. The data may be available at the CPU or
cache or RAM.
In this modeling exercise we assume, CPU has the data 60% of the time
and remaining 40% of the time it must access from external Cache.
- The initial inter-arrival rate between the arriving transactions is 1.0 seconds.
- For the dual processor edition, the incoming transactions are randomly distributed between CPU1 and CPU2.
VisualSim Model
The key concept in modeling is to model all the elements that are
relevant to the analysis and abstract out the rest. The five
architectural elements (Bus, Cache, SDRAM and CPUs) and the instruction
set are relevant. Items such as the RTOS are not included in this study
as they are common overhead for all the operations.
To translate the above requirements into a model, consider the model to
have five basic elements - architecture, behavior, workload, model
parameters, and analysis. All of the requirements above must fit into
one of these structures.
Refer Fig 2 for the proposed VisualSim model.
Figure 2: Dual Processor Design VisualSim Model
VisualSim Model can be found in the following location
$VS/doc/Training_Material/Tutorial/WebHelp/Tutorial/Performance_Modeling/Dual_Processor_System.xml
Description of the VisualSim Model
A brief description of the VisualSim Model is listed below:
- The architectural element CPU is represented by the
SystemResource_Extend block. The other architectural elements - Bus,
Cache, and RAM - use the hardware accurate blocks.
- Mapper block is used to send transactions to CPU from the behavior flow.
- The generation of transactions is done using the Traffic block.
Decisions to send the transaction to a relevant CPU and to a relevant
task is made at the Expression block.
- Analysis reports are captured using the ResourceStatistics block.
Building of the VisualSim Model
Assumptions
The procedures outlined below makes the following assumptions:
- The user is aware of the process of connecting the blocks.
- Only brief details of the blocks are given. For further details, the user may refer the block documentation.
- For ease of navigation, the procedural flow is segmented under
sub-topics such as Initial Setup, Architectural Elements, Transaction
Generation, Tasks Handling, and Resource.
- Statistics and Reports.
- Lines and rectangles in red color within screenshots are used to
indicate the block that is being built. These are not part of the
VisualSim Modelling tool.
- To assign a customized name to a parameter or a block,
right-click the block and select Customize Name. In the resulting
window, enter the Name.
Blocks Used
The following table list the blocks to be used in this model.
S.No.
|
Library Block
|
Description
|
1
|
Digital Simulator
|
The Digital
Simulator implements the discrete-event Model of Computation (MoC). The
simulator maintains a notion of current time, and processes events
chronologically in this time. Used to model elements that change with
time such as hardware, software and networks.
|
2
|
Traffic
|
Outputs a
new Data Structure (DS) at time intervals specified by the
"Time_Distribution" setting. A Data Structure is also knowns as a
transaction and contains a list of Field Names + Values.
|
3
|
ExpressionList
|
Executes a sequence of expressions in order.
The default block contains one input and one output. The user can add multiple input and output ports.
|
4
|
Mapper
|
Works
with the separation of the behavior and architecture methodology. In
this methodology, the mapper block is placed in behavior flow at every
location where timed resources are required.
|
5
|
Text_Display
|
Displays the values arriving on the input port in a text display dialog.
|
6
|
TimeDataPlotter
|
Plots the incoming
data on the Y-Axis against the current simulation time on the X-axis.
Every wire connected to this block input is considered a separate
dataset and plotted separately.
|
7
|
SystemResource_Extended
|
Forms the architecture part of the behavior and architecture separation methodology.
|
8
|
BusArbiter
|
The Bus Arbiter block is the Arbiter for a Bus Interface.
|
9
|
BusInterface
|
Connects devices to the BusArbiter and has a queue for each port.
|
10
|
RAM
|
Combines the operation of a
basic memory controller (delay function) and the memory array. Handles
pre-fetch, read, write, refresh, and controller operations.
|
11
|
ArchitectureSetup
|
Handles all the address mapping, routing, plotting, statistics, and debugging for the hardware modeling components.
|
12
|
Cache
|
Emulates a cache in an architectural model. There are interfaces on both side of the block for connectivity.
|
13
|
ResourceStatistics
|
Outputs or resets the statistics for all the SystemResource, Channel, Channel_N, Server, and Queues in the model.
|
Initial Setup
In the initial setup, you set the model parameters and instantiate a
Digital Simulator. Model parameters are used to control the system
configurations.
- Create a new block diagram (File > New Block Diagram Editor).
- Define the following parameters from Model Setup > Parameter=.
- CPU_Time: 6.0e-3
- Sim_Time: 50.0
- DRAM_Speed: 250.0
- Cache_Speed: 250.0
- Bus_Speed: 250
- Number_of_CPUs: 2
- Task_Rate: 1.0
- Data_Size: 64
Parameter CPU_Time is the task
processing time excluding memory reference time. The parameter
Number_of_CPUs allows the designer to conduct trade-off between single
core and multi-core architectures. Task_Rate, controls the rate at
which task will be generated for a application.
- Drag the Digital block from Model Setup > DigitalSimulator onto the block diagram.
Figure 3: Digital Simulator
- Double click the Digital block and set the StopTime parameter as
Sim_Time. The value is taken from the value that was set for Sim_Time
in model parameters.
The rest of the parameters remain as default settings.
Figure 4: Digital Simulator Parameters
Architectural Elements
In this stage, you configure blocks to represent the architectural elements - Bus, Cache, SRAM, and CPUs.
- Use the SystemResource_Extend blocks from Resources > SystemResource_Extend to represent the CPUs.
The reason for selecting a statistical
CPU block is to gather more information about the processing platform
and the performance at system level. Once the architecture is selected,
a detailed processor block can be used to replace a statistical
processor block.
- In the respective SystemResource_Extend blocks,
- Enter “CPU1” and “CPU2” for the This_Scheduler_Name parameter.
The Mapper uses this name to call the respective blocks to execute a
transaction.
- To turn on the request for individual statistics, select the Add_Scheduler_Times_to_DS check box.
- Select Non_Blocking_FCFS for the SchedulingType.
The rest of the parameters remain as default settings.
Figure 5: CPU1 Parameters
Figure 6: CPU2 Parameters
- Drag the BusArbiter and BusInterface
blocks from HardwareDevices > BusArbiter and HardwareDevices >
BusInterface to represent the Bus.
- In the BusArbiter block, set the Bus_Speed_Mhz as Bus_Clock.
- Note: The "Devices_Attached.." starts from Port_Number_1 and goess incrementally.
The rest of the parameters retain the default settings.
Figure 7: BusArbiter Parameters
- Add two BusInterfaces from HardwareDevices.
The BusInterface receives/sends data
traffic from/to master/slave device based on the arbitration algorithm
defined in Bus Arbiter.
- In the second BusInterface, change the “Port_Name_1”
parameter to
“Port_Name_3” and the “Port_Name_2” parameter to “Port_Name_4”. Make
sure the Port Numbers are in order. If you add the Device names
into the Bus_Arbiter field for Attached_Devices, the port number and
the name order in the array must match.
The rest of the parameters retain the default settings. The first BusInterface retains all the default parameter settings.
Figure 8: BusInterface 2 Parameters
- Now Instantiate Cache block from ProcessorGenerator > Cache. Connect Cache block to right side of the BusInterface block.
- Configure Cache Block as below.
Figure 9: Cache Parameters
- Instantiate a RAM Block from Memory > RAM. Connect this block to right side of BusInterface2 block. Configure the block as below.
Figure 10: RAM Parameters
- Now, instantiate two DeviceInterface blocks from HardwareSetup > DeviceInterface. Double click on the block and configure as below.
Figure 11: DeviceInterface Parameters
Please note that for the second DeviceInterface block IO_Name must be set to CPU2.
DeviceInterface blocks helps in communicating between Memory Subsystem and CPU’s.
- Instantiate Architecture_Setup block from HardwareSetup > ArchitectureSetup. This block maintains the routing information between components and generates statistics.
- Model with CPUs, Bus, Cache, and Memory must look as below.
Figure 12: Model A
Note that the hardware architecture
definitions are not complete yet as we need more information on
External Cache access. This information is explained later in this
tutorial.
Transaction Generation
To generate transactions, use the Traffic block. In addition, you
use a Mapper block to send the transactions to the relevant CPU.
- Drag and drop the Traffic block from Traffic > Traffic to generate transactions.
- In the Traffic block,
- Set the Data_Structure_Name as “Processor_DS”.
- Set Task_Rate for Value_1, which is the parameter for Time_Distribution.
- Select Fixed (Value_1) for Time_Distribution.
The rest of the parameters remain as default settings.
Figure 13: Traffic Parameters
- Set up a ExpressionList block from Behavior > ExpressionList to route the transaction to a relevant CPU and then to a relevant task.
- In the block, enter the following details for the Expression_List
parameter. (Note: Lines in red color are comments and can be ignored).
/*select the processor1 or processor2 */
input.Select_Processor = irand (0, 1)
input.Task_Destination = (input.Select_Processor == 0 && 2 == Number_of_CPUs)?"CPU1":"CPU2"
input.Task_Number = (input.Task_Destination=="CPU1")?1:2
/* Data Size */
input.A_Bytes = Data_Size
/* command */
input.A_Command = "Read"
|
The rest of the parameters remain as default settings.
If you notice, we have defined a simple logic to decide task execution
on either CPU1 or CPU2. We also defined a simple logic here to identify
if CPU needs to access data from external Cache or data is available in
Registers.
- Instantiate two ExpressionList blocks from Behavior > ExpressionList. This block helps in identifying if there is a Cache Hit.
- Double
Click on the ExpressionList block and configure as below.
Here, we are defining a condition to check if the CPU requires external
cache access or data is already available in the Register. We
considered about 60% of the time data is available in the Registers
while remaining 40% of the time data must be accessed from external
cache/memory.Now let us use this decision in the
architecture definitions to access data from Cache if required; else
use the available data.
- Create an
additional output port to release Processor.
/*Select the type of transaction */
input.Get_Data = rand(0.0, 100.0)
input.Need_Data = (input.Get_Data < 60.0)?"Reg":"Cache_Hit" /*
"Reg" means Data available in Register and "Cache_Hit" means get data
from cache */ |
Figure 14: Decision 1 Parameters
- Connect these two ExpressionList blocks to top output port of CPUs.
- Instantiate SystemResource_Done from Resources > SystemResource_Done
to release the processor to take next instruction for processing.
Connect “Done” port of ExpressionList block to SystemResource_Done
- Set up a Mapper block from Mappers > Mapper to issue tasks to the CPUs.
- In the Mapper block,
- Enter “Task_Destination” as the Parent_Scheduler_Name.
- Enter “Task_Number” for “Task_Number”.
- Set Task_Priority as 0.
- Set Task_Time as CPU_Time.
Task_Plot_ID retains the default value.
Figure 15: Mapper Parameters
- Assembled model must look as below; Save the model at this time stamp.
Figure 16: Model B
Now the model is functionally complete. To understand the system
performance and resource statistics we must insert blocks to capture
the reports. The next section talks about generating Statistics and
Reports.
Resource Statistics and Reports
Multiple blocks are available to generate statistics and report. In
this tutorial, we generate processing latency for the tasks, Resource
Utilizations (Bus, Cache, DRAM and CPU’s) and activity profile for
CPU’s.
Processing latency plot will help the designer to understand the total
time that the target platform takes to process a task which includes
memory reference time as well. Resource Utilization stats provides
details on platform utilization for the current set of applications and
helps to make decisions for future task processing requirements as well.
You may use the blocks detailed in the following steps to generate the reports mentioned above.
- Attach the TimeDataPlotter from Result > TimeDataPlotter
to the plot output (task_plot) port of the CPU blocks. This block plots
the activity of the Tasks at these devices, on the Y-Axis against the
current simulation time on the X-axis.
- Set up a ExpressionList block from Behavior > ExpressionList to route the processing statistics and latency.
- Right-click the ExpressionList block, select Customize > Ports, and Add a port titled Latency for output.
Figure 17: Customize Ports
- In the ExpressionList block, enter details for the following parameters
- Expression_List: /* No Expressions. */
- Output_Ports: output,latency
- Output_Values: input,input.TIME_RESPONSE
- Output_Condition: true,true
Figure 18: ExpressionList 2 Parameters
- Attach a TextDisplay block from Results >TextDisplay to output transactions.
- Attach a TimeDataPlotter from Result > TimeDataPlotter to latency port to output the latency details.
- Use the ResourceStatistics from Results > ResourceStatistics and TextDisplay from Results > TextDisplay to output the statistics
- In the ResourceStatistics block, enter details for the following parameters.
- Resource_Name{"CPU1", "CPU2"}/* list of all the Resources as strings in an array */
- Resource_Length: {1,1} /* Number of Queues in each Resource and match the order in Resource_Name */
- Number_of_Samples: 1
- Statistics: true
- SimTime: Sim_Time
The Block_Name and Statistics parameters retain the default settings.
Figure 19: Resource Statistics Parameters
- Add a TextDisplay to output port of Resource Statistics block and Architecture_Setup block.
The model must look as below.
Figure 20: Final Model
Statistics and Reports
Statistics and Reports must be analyzed to understand if the system
is meeting the requirement s. If we look at the latency plot below; the
plot tells that external memory access is very limited as majority of
the times requested data is available in external cache. This tells us
that the designer must select a shared L2 Cache with a min Hit rate of
95%.
The following images depict the results of the analysis.
Figure 21: Transactions Out
Figure 22: Collective Statistics
Figure 23: Processing Latency
By default Timing Diagram plot that shows the timeline activity for
the CPU’s does not provide names for the timing diagram sequence as
shown in figure below. To get the names associated with each plot
select Edit > Format and define as below for Y-ticks.
Figure 24: Set Plot Format
Figure 25: TimeLine Usage
Further Analysis
Perform the following modifications to conduct further analysis.
- Reduce inter-arrival time of the transactions by changing
Task_Rate (0.1->5.0) so as to analyze the effect on the utilizations.
- Change the Number_of_CPU value from 2 to 1. Check the statistics.
- Experiment by modifying the clock speed for the hardware devices.
- Evaluate response time computation by using the TIME field and TNOW. (End-to-End Latency)