Virtual Experiments with Zynq 7000 

Virtual Experiments with Optimal System Resource Allocation in Zynq 7000

Zynq 7000 Programmable SoC from AMD-Xilinx is an FPGA device that combines processors, application programmable units, on-chip memories, memory controllers, twenty interfaces, and programmable logic. The massive processing power, number of interfaces and large logic resources make the life of system designers technically challenging. With systematic planning, users can take full advantage of this complex device and create a cost-competitive, high-performance product. At Mirabilis Design, we conducted many simulation experiments in the areas of resource distribution across ARM core and programmable logic, the number of interfaces handled and large scale member accesses. The purpose was to evaluate the behavior and performance of Zynq 7000 for a variety of applications. As virtual models offer greater flexibility for experimentation, we have come up with a complete configurable Zynq 7000 Programmable SoC model that enables customers to identify the combination of task distribution across Processing Unit and programmable logic that yield better performance while also keeping the power consumption to an optimum level.

This article discusses how early exploration of Zynq 7000 all programmable SoC using modeling and simulation would help the users to achieve optimal performance requirements effectively.

The advantages of early exploration of Zynq 7000 resources for the target application will substantially decrease the product retries and possibilities of system failure. One of our customers was experiencing difficulty with a radio implemented on Zynq 7000. The design could not achieve the performance and was posing a considerable delay in the receiver module. When they modeled the same requirement on VisualSim, they could see that the task distribution across resources on board was the main reason for not meeting performance requirements. In addition to the existing problem, they could also have extra room for future software enhancements.

We have conducted couple of very simple but crucial analysis on Zynq 7000 using VisualSim. Here we will be conducting an analysis of a system with one channel CAN interface vs two CAN interfaces and an USB. Design utilizes ARM Cortex A9 processor cores and also has MicroBlaze implemented on the Programmable Logic (PL).

Design Considerations

The proposed system architecture receives the data from a single CAN interface. As we will not have the application software ready in many cases, the user can define the behavior flow similar to a flow chart and define task definitions along with service time details. These details need not be as accurate as those of a software model as VisualSim will generate the profile-based task distribution across the target hardware platform based on the abstract definitions. It also replicates the behavior of exact application software and in addition it also handles the data rate and service time. The VisualSim model of the proposed system is shown in below figure 1.

Figure 1 – Zynq 7000 model
Figure 1 – Zynq 7000 model

Clock speed of processors, bus, memories and interconnects is completely configurable and is by default set to 866 MHz.

The behavior flow of the application can be defined in the block Behavior flows. Here the user will map tasks on to the hardware architecture elements. For example, a user can analyze two different mapping approaches; one could be running all tasks on the application processor (APU) and then other could be running few on the APU and few on the Programmable Logic. This provides a greater leverage to the system development and confidence over the selected configuration. Figure 2 shows the sample behavior flow with a single CAN interface.

Figure 2 – Behavior Flow
Figure 2 – Behavior Flow

Initial system requirement was to have only one input coming from the CAN interface and the behavior flow is defined in figure 2. We were also interested in looking at the system behavior upon adding few more interfaces to the existing system.  The version 2 of the system could have additional data coming from USB, PCIe, CAN, and UART etc. The version 2 could be your future requirements and you still want to make sure that the system will not compromise in terms of performance or power requirements or even you want to know system works perfectly fine without losing any critical data. At this point, user can conduct a very detailed analysis on the target architecture by varying the data rate and also the inputs from various peripherals and interfaces.

In the version 2 of the same model, we have implemented MicroBlaze on the Programmable Logic and added two additional input interfaces; CAN and USB. The purpose of adding MicroBlaze is mainly to address the issue of reliability. Single threading improves the system reliability, one can place computationally intensive task thread per Cortex-A9 and instantiate as many MicroBlaze processors for others on the PL. In our design we have instantiated only one MicroBlaze processor.

The behavior flow of Version 2 of system is shown in figure 3.

Figure 3: Behavior Flow
Figure 3: Behavior Flow

The simulation run provides a greater picture of resource utilizations on board, power consumption, end-to-end latency for each application and also identifies the possible bottlenecks.

Analysis and Results

We ran the simulation on a 2.6 GHz Windows 8.1 machine with 4 GB RAM. We simulated for 20 micro seconds real time and 6.8 seconds wall clock time to finish. The model was created within few hours as VisualSim provides a configurable Zynq 7000 all programmable SoC model. Here by default, the Programmable logic will not have any implementation and there will be a module to create behavior flow for the user requirements. Rest of the system including IO controllers, APUs with ARM Cortex A9, on chip memory, AXI, watchdog, snoop control, interrupt control are preconfigured and users may not require to modify these.

We ran simulations for Version 1 of the system with only CAN interface and Version 2 of the system with two CAN interfaces and a USB. Table 1 below shows the utilization report of various on-board resources.

Table 1: Utilization Report
Table 1: Utilization Report

Results shown in Table 1 are quite unexpected. Even though we are running few additional tasks on Version 2 architecture, the utilization of the processors is less compared to Version 1. This clearly tells that either the processor is rejecting most of the incoming tasks or the stall time is quite high and actually it was 49.46% mean Stall time for Processor_2 and 29.81% for Processor_1. To overcome the issue with version 2 of the system, initially, we increased the memory speed and reduce the memory access time. But the outcome was not impressive enough, and there were still a lot of tasks being dropped. As we were calling the co-processor MicroBlaze implemented in programmable logic from the Processor pipeline, this was another point of bottleneck. When we looked at the Microblze statistics MicroBlaze was consuming more cycles and hence there was a delay in returning tasks back to Processor_1/Processor_2 pipeline. If the cycle time in MicroBlaze is too long, then this directly affects the performance of overall system as application processors will be stalled. Processor Stall time before and after changes is as shown in the figure 4

Figure 4: Processor Stall Time
Figure 4: Processor Stall Time

After reducing the MicroBlaze clock cycles and increasing APU clock speed about 20 MHz we were able to see a considerable reduction in the application processor stall times, and also we could observe that the utilization of processing resources increased in Version 2 of the system. Table 2 shows the utilization report of processors after architecture configuration modifications

Table 2: Resource utilization
Table 2: Resource utilization

Another important factor to consider is the end-to-end latency. End-to-end latency for version 1 of the system was a maximum of 3.5 micro seconds. However maximum latency in version 2 system with three applications is 11 – 14 micro seconds.  Figure 5 shows the latency of applications running on the two versions on the system.

Figure 5: End to End Latency
Figure 5: End to End Latency

If you notice the latency graph mentioned in figure 5, the latency curve is exponentially increasing in both systems. This tells you that you that the incoming traffic rate seems quite high and is being buffered at the bus or memory or even at the processing resource itself. If there is a buffering at processing resource, then a few tasks may have to wait; if the buffering is at the bus, then you may experience extra overhead in arbitrating the data. To overcome this problem you can either try to reduce the data rate of traffic or change the scheduling mechanism or even reconfigure hardware architecture elements. This way, one can quickly identify how to effectively utilize Zynq 7000 resources based on the requirements and ensure to fully exploit Zynq 7000 during implementation process.

Conclusion

Even though Zynq 7000 is a configurable SoC, it is advantageous and also recommended to do prior analysis of the application on Zynq 7000 using System level design. As the device itself is highly complex and more efficient, one can take complete advantage of the platform for a specific requirement to achieve optimal performance and also to ensure that the hardware footprint is minimal. VisualSim Zynq 7000 all programmable SoC library will enable the user to quickly model their logic on the PL and general purpose processor unit. Users can conduct all possible test scenarios on VisualSim Zynq 7000 before implementing it on the real hardware to get best out of the hardware platform.