Block Overview
The Processor block in VisualSim supports preemption at the instruction-level. Using this feature, the user can explore the impact of priority on the execution latency of tasks. To enable preemption, the user must add a hidden parameter called Preemption_Enable:true
Make sure to send one task at a time to the Processor. External to the processor, keep track of the A_Priority value of the current executing task. The first task starts executing. When a higher priority (Over the current task) becomes available at the Proc_Preemption block, it is sent to the Processor block. The Processor interrupts the current task, set certain fields in the current task and sends it out via the insstr_out port. The fields set are A_Preempt:true(new field), and A_IDY=A_IDX, which is the last instruction executed. The new task starts executing. If the current task encounters a DMA (or large load operation), it starts the DMA by sending the task to the external DMA, and sends out the task on the instr_out port. The fild modified are A_Start_DMA=true and A_IDY=A_IDX. Another task can be sent in to start executing. When the DMA operation has completed and is returned to the Processor, it sends out the same data structure to the instr_out port. Here the A_Start_DMA:false.
17
examples are provided in the VS_AR/doc/Doc_Support/ directory of the
VisualSim. They have the name of Processor_Model_PPT_xx.xml.
Instruction
Set Setup
Architecture
Setup
Timing
Diagram
Basic
Processor Model- Shows the setup of the processor block
Multi-Thread
Processor- Shows the definition of multi-threaded processor
Multi-Core
Processor- Shows the use of the Processor block for defining
multi-core
SIMD
Processor Model- Single Instruction- Multi Data
MIMD
Processor Model- Multi Instruction- Multi Data
Using
SoftGen to generate profile-based synthetic instructions for
the
Processor
MIMD
Processor Model- Multi Instruction- Multi Data
Changing
the Clock Speed- Used when a specific operation or
the stage of the pipeline needs to be expanded.
Using
multi cycle delay for the flush from annotate C-code
Preemption
Enabled Adding preemption to the Processor
Static Checker Check the correctness of the Optional Parameters
Parameter_Value Window
The Parameter_Value column can also reference other model parameters.
Parameter_Name
Parameter_Value
Processor_Instruction_Set:
MyInstructionSet Name of the Instruction Set block
associated with this Processor
Processor_Registers:
32
Number
of registers
Context_Switch_Cycles:
200
Number
of cycles between transactions. Between two task or between an
DMA request and the next task or between a DMA return and the currently
processing task. This must be a minimum of 10.
Processor_Speed_Mhz:
Processor_Speed Speed of the processor in Mhz.
Instruction_Queue_Length:
6
Length of the input and holds multiple
DS (tasks)
Instructions_per_Cycle:
6
Optional parameter. Support multiple instruction per cycle
ROB_Size
:
160
Optional parameter. Specify the Reorder buffer size
I_1
{Processor_Name=Processor_1, Cache_Speed_Mhz=1000.0, Size_KBytes=16.0,
Hit_Ratio=0.9999, Words_per_Cache_Access=1, Words_per_Cache_Line=16,
Cache_Miss_Name=L2}
D_1
{Cache_Speed_Mhz=500.0,
Size_KBytes=64.0, Words_per_Cache_Line=16, Cache_Miss_Name=L_2}
L_2
{Cache_Speed_Mhz=500.0,
Size_KBytes=64.0, Words_per_Cache_Line=16, Cache_Miss_Name=Cache_1}
Explanation of the cache line: I_1 line shows all the cache parameter. D_1 line shows the minimum required lines only.
I_1
Name of this cache
Processor_Name=Processor_1 Optional
parameter. If cache is in another processor, then enter the
name
Cache_Speed_Mhz=1000.0
Speed of
the cache. Can be different from the Processor Speed
Size_KBytes=16.0
Size of the memory. Used to
determine the cache boundary for miss request
Hit_Ratio=0.9999
Optional Paremeter. Used to
fix the hit ratio
Words_per_Cache_Access=1 Optional Parameter. This is the number
of data access per cycle.
Words_per_Cache_Line=16 Required to identify the end of a line
and generate miss, plus a prefetch
Outstanding_Req_Count=3 Optional Parameter. This is the number
of outstanding requests that can be made to the corresponding memory. Used along with External cache.
Cache_Type=Load_Store Optional
Parameter. This is used to specify whether the cache is Load_Store
cache. This must be used only if any of the pipeline stages doesnt
implicitly specify the Load_Store cache name as an Execution_Location.
Cache_Miss_Name=L2 Next level memory when there is a miss
The Cache can be setup outside. If so,
then, we must use the keyword "External_" + Cache_Name while setting
the cache hierarchy. This cache can be cycle accurate cache.
External_I_Cache_Name: {}
External_D_Cache_Name: {Outstanding_Req_Count=3}
Instruction
Set: The Processor_Instruction_Set is a separate block called Instruction_Set
and in this case the name of the block is MyInstructionSet, see above.
It contains information about each execution unit (INT_n,
FP_n),
where INT means integer, and FP means floating
point.
"begin INT_1 ... end INT_1" defines the instructions for this
execution unit. One can group Processor instructions, if they
have the
same number of cycles, or one can list the entire instruction set.
Mnew Min Max
; /*
Label */
PROC INT_1 FP_1 ;
begin size_config ; /* Specify the Load store instructions */
Read 3 32 LDR,LDUR ; /*
<Command> <PipelineStage number> <Size in
bits> <Instruction/Instructions/
Execution_Unit[startIndex:endIndex]> */
Write 3 32 STR ;
end size_config ;
begin execUnit_config
; /* Specify the
Execution unit queue sizes */
Queue_Size INT_1 2 ;
Queue_Size FP_1 2 ;
end execUnit_config ;
begin INT_1 ; /* Group */
ADD 2 ;
SUB 2 ;
*b 2 ;
MUL 4 ;
DIV 4 ;
LDR 1 ;
LDUR 1 ;
STR 1 ;
end INT_1 ;
begin FP_1
; /* Group
*/
FADD
2
;
FSUB
2
;
FMUL
4 8
;
FDIV
4 12
;
end FP_1
;
The entry FADD 2 ; means that the instruction "FADD" will take 2 cycles to complete execution (Without including the I_Cache access latency and pipeline transfer latency).
The entry FMUL
4 8
; means that the instruction FMUL can take a random delay cycle between 4 and 8 cycles to complete execution.
The Pipeline is a separate parameter window in the Processor block and can vary from two to twenty stages, depending on the processor being modeled.
Stage_Name
Execute_Location Action Condition ;
1_PREFETCH
I_1
instr
none ; // I_Cache access
2_DECODE
I_1
wait
none ;
3_DISPATCH
D_1
issue 6 ; // from the
dispatch stage, the width is set to be a max of 6 uop per cycle
3_EXECUTE
INT
exec
none ; // instruction execution
3_EXECUTE INT wait none ;
4_STORE
D_1
write
none ;
The Pipeline shown above is the classic four stage pipeline for prefetch, decode, execute and store back results. More advanced pipeline execution can be modeled and references can be made to other processors and external blocks.
There are 4 columns to the
pipeline.
The
first column has the stage number followed by a "_" and a identifier.
Multiple lines can be defined for a pipeline stage.
The
name can be descriptive value. The stage number must be a
integer
and in order. The number of stages must match the
Number_of_Pipelines_Stages parameter of the processor.
The second column is the location where the line must be executed. This can be a cache or execution unit of this block. In addition, it can be a execution unit of another processor or another custom block that has a path defined in the Routing Tabel. To learn more abourt the Routing Tabel, review the Architecture Setup document or the Advanced Modeling Guide.
The third column specifies what action needs to be performed. The possible options are instr, read, write, wait and exec. instr represents an instruction access and is a keyword. For a data access, the Action can be a read or write. When the request needs to wait for a response, then the wait action is added. If the pipeline needs to do an external action, such as accessing a hardware engine or a co-processor or write data via an external definition; then add the "task" keyword here. The fourth column, ie the Condition column must specify the destination for the task. All other actions do not use the Condition column.
For a external task, the Execute_Location refers to a Scheduler or other named device outside of this Processor block. The condition column will refer to the Instruction Unit that has the list of instructions that will be executed in the external device.
Operation of the Pipeline:
The instructions are received in a Data Structure arriving on the Instr_In port on the left-side. The Data Structure is a task and can contain multiple instructions. The list of instructions supported by this block is listed in the Instruction_Table. The tasks are stored in the Instruction Queue. The length is defined by the parameter Instruction_Queue_Length. The head of the Instruction Queue is sent to the pipeline. The instructions within a task are executed in sequence.
Statistics and Plotting
There are three sets of statistics and a series of Timing Diagrams available standard for the Processor. The timing diagrams are for the Register read, Register Write, I_1, D_1, L_2 (if available), INT_1, INT_2(if available), FP_1(if available), and FP_2(if available). The statistics are made up of the statistics that are added to the Data Structure when the task has completed processing. This update is available in the instr_out port and Software_Mapper, depending on the origination. The list are:
CYCLES_IN_PROCESSOR
= 603.0 (Number of cycles in the processor for this task)
CYCLES_PER_INSTRUCTION
= 1.206 (CYCLES_IN_PROCESSOR/size of (A_Instruction))
MHZ_PROCESSOR
= 2.0E9 (Final processor speed. This will be different if the
clock speed has been modified by an instruction.)
MIPS_IN_PROCESSOR
= 1400.560224089636 (Millions of Instructions per second)
TIME_IN_PROCESSOR
= 3.57E-7 (Duration of time in the Processor)
The second are the utilization metrics of the caches, registers, pipeline and execution units. There is a difference between Proc (OProcessor) and Pipeline. The utilization of the pipeline indicates the percentage of time the pipeline is in the Active state. The processor utilization is total number of instructions processed over time.
Processor_1_D_1_Utilization_Pct_Max
= 5.245,
Processor_1_D_1_Utilization_Pct_Mean =
5.245,
Processor_1_D_1_Utilization_Pct_Min =
5.245,
Processor_1_D_1_Utilization_Pct_StDev =
0.0,
Processor_1_FP_1_Utilization_Pct_Max =
1.145,
Processor_1_FP_1_Utilization_Pct_Mean =
1.145,
Processor_1_FP_1_Utilization_Pct_Min =
1.145,
Processor_1_FP_1_Utilization_Pct_StDev =
0.0,
Processor_1_FP_2_Utilization_Pct_Max =
0.57,
Processor_1_FP_2_Utilization_Pct_Mean =
0.57,
Processor_1_FP_2_Utilization_Pct_Min =
0.57,
Processor_1_FP_2_Utilization_Pct_StDev =
0.0,
Processor_1_INT_1_Utilization_Pct_Max =
2.87,
Processor_1_INT_1_Utilization_Pct_Mean =
2.87,
Processor_1_INT_1_Utilization_Pct_Min =
2.87,
Processor_1_INT_1_Utilization_Pct_StDev =
0.0,
Processor_1_INT_2_Utilization_Pct_Max =
1.145,
Processor_1_INT_2_Utilization_Pct_Mean =
1.145,
Processor_1_INT_2_Utilization_Pct_Min =
1.145,
Processor_1_INT_2_Utilization_Pct_StDev =
0.0,
Processor_1_I_1_Utilization_Pct_Max =
4.925,
Processor_1_I_1_Utilization_Pct_Mean =
4.925,
Processor_1_I_1_Utilization_Pct_Min =
4.925,
Processor_1_I_1_Utilization_Pct_StDev =
0.0,
Processor_1_L_2_Utilization_Pct_Max =
4.075,
Processor_1_L_2_Utilization_Pct_Mean =
4.075,
Processor_1_L_2_Utilization_Pct_Min =
4.075,
Processor_1_L_2_Utilization_Pct_StDev =
0.0,
Processor_1_PROC_Utilization_Pct_Max =
2.3733333333333,
Processor_1_PROC_Utilization_Pct_Mean =
2.3733333333333,
Processor_1_PROC_Utilization_Pct_Min =
2.3733333333333,
Processor_1_PROC_Utilization_Pct_StDev =
0.0,
Processor_1_Pipeline_Utilization_Pct_Max
= 2.885,
Processor_1_Pipeline_Utilization_Pct_Mean
= 2.885,
Processor_1_Pipeline_Utilization_Pct_Min
= 2.885,
Processor_1_Pipeline_Utilization_Pct_StDev
= 0.0,
Processor_1_Register_Rd_Utilization_Pct_Max
= 0.955,
Processor_1_Register_Rd_Utilization_Pct_Mean
= 0.955,
Processor_1_Register_Rd_Utilization_Pct_Min
= 0.955,
Processor_1_Register_Rd_Utilization_Pct_StDev
= 0.0,
Processor_1_Register_Wr_Utilization_Pct_Max
= 0.43,
Processor_1_Register_Wr_Utilization_Pct_Mean
= 0.43,
Processor_1_Register_Wr_Utilization_Pct_Min
= 0.43,
Processor_1_Register_Wr_Utilization_Pct_StDev
= 0.0,
The last are the throughput metrics for the caches, registers, pipeline and execution units. The context switch time is defined in the Processor parameters. The statistics gives a measurement of the percentage of time that was consumed by the context switching. This time is very valuable as it shows the amount of time consumed for switching between tasks. The KB_per_Thread gives a measure of the amount of cache needed to complete the processing. This gives an idea of the size of the cache required. The stall time is a statistics that provides a view of time spent in getting data or making IO calls by the Task. This is the time that the task is holding the pipeline but not doing any thing with it. The Task delay is an average over all the tasks that are executed on this processor. The L_2 hit ratio filed is not currently used. The plan is to add it in the future. The KB_per_Thread is also not used.
Processor_1_Context_Switch_Time_Pct_Max
= 45.095,
Processor_1_Context_Switch_Time_Pct_Mean
= 45.095,
Processor_1_Context_Switch_Time_Pct_Min =
45.095,
Processor_1_Context_Switch_Time_Pct_StDev
= 0.0,
Processor_1_D_1_Hit_Ratio_Max = 100.0,
Processor_1_D_1_Hit_Ratio_Mean =
13.4920634920635,
Processor_1_D_1_Hit_Ratio_Min = 0.0,
Processor_1_D_1_Hit_Ratio_StDev =
29.3655668615204,
Processor_1_D_1_KB_per_Thread_Max = 0.0,
Processor_1_D_1_KB_per_Thread_Mean = 0.0,
Processor_1_D_1_KB_per_Thread_Min = 0.0,
Processor_1_D_1_KB_per_Thread_StDev =
0.0,
Processor_1_I_1_Hit_Ratio_Max = 100.0,
Processor_1_I_1_Hit_Ratio_Mean =
40.8549783549784,
Processor_1_I_1_Hit_Ratio_Min = 0.0,
Processor_1_I_1_Hit_Ratio_StDev =
44.307741292233,
Processor_1_I_1_KB_per_Thread_Max = 0.0,
Processor_1_I_1_KB_per_Thread_Mean = 0.0,
Processor_1_I_1_KB_per_Thread_Min = 0.0,
Processor_1_I_1_KB_per_Thread_StDev =
0.0,
Processor_1_L_2_Hit_Ratio_Max = 0.0,
Processor_1_L_2_Hit_Ratio_Mean = 0.0,
Processor_1_L_2_Hit_Ratio_Min = 0.0,
Processor_1_L_2_Hit_Ratio_StDev = 0.0,
Processor_1_L_2_KB_per_Thread_Max = 0.0,
Processor_1_L_2_KB_per_Thread_Mean = 0.0,
Processor_1_L_2_KB_per_Thread_Min = 0.0,
Processor_1_L_2_KB_per_Thread_StDev =
0.0,
Processor_1_Stall_Time_Pct_Max = 50.715,
Processor_1_Stall_Time_Pct_Mean = 50.715,
Processor_1_Stall_Time_Pct_Min = 50.715,
Processor_1_Stall_Time_Pct_StDev = 0.0,
Processor_1_Task_Delay_Max = 2.737E-6,
Processor_1_Task_Delay_Mean =
2.2243766233766E-6,
Processor_1_Task_Delay_Min = 1.39E-7,
Processor_1_Task_Delay_StDev =
4.1390570393254E-7,
begin INT_1 ; /* Group */
ADD 2 ;
SUB 2 ;
*b 2 ;
MUL 4 ;
DIV 4 ;
LDR 1 ;
STR 1 ;
end INT_1 ;
*branch (ie.. *b here) instructions represent branch mispredictions.
Detailed
Documentation:
Advanced
Modeling Guide
has comprehensive information
on the processor can be found here:
Block Keywords:
INT_n -- name of integer
execution units, 1 through n
FP_n -- name of floating
point execution units, 1 through n
Field Details |
This is the name of the ArchitectureSetup block that this Processor is associated. The Architecture_Setup block maintains the routing table and statistics collection. Type is String
This is a unique name of the Processor block. No other architecture component can have this name. Type is String
Setup Processor parameters. Type is text string.
Define Processor pipeline stage execution. Type is text string.
String Attribute, Width of Processor in Bits, forms a processor word, either 16, 32, 64 Pulldown selection.
Instruction input port. This can be connected to any VisualSim library block, model of a RTOS or custom-code. This can also be connected to a Bus_port or other blocks. The type is general.
Instruction output port. This can be connected to any VisualSim library block, model of a RTOS or custom-code. This can also be connected to a Bus_port or other blocks. The type is general.
Bus output port. This is one of two Bus connections on the left side (East). The type is general.
Bus input port. This is one of two Bus connections on the left side (East). The type is general.
Bus output2 port. This is one of two Bus connections on the left side (East). The type is general.
Bus input2 port. This is one of two Bus connections on the left side (East). The type is general.
Reject output port. When the instruction queue is full, the incoming instruction is rejected and placed on this port. The type is general.
This is an output port through which the processor sends out interrupts to the DMA. Then the Processor continues its operation while DMA caries out the functions intended for that particular instruction sent out by the processor. ( Future feature but ports have been allocated in this version. )
This is an input port. When the DM completes its task, it comes back to the processor. So now the processor knows, the task assigned to the DMA has been completed.( Future feature but ports have been allocated in this version. )