Industry Background
Artificial Intelligence (AI) applications take into consideration the compute, storage, memory, pipeline, communication interface, software, and control. Further, AI application processing can be distributed across multi-core within processors, multiple processor boards on a PCIe backbone, computers distributed across an ethernet network, high-performance computers, or systems across a data center. In addition, AI processors also have a massive memory size requirement, access time limitation, distribution across analog and digital, and hardware-software partition.
Problem
Architecture exploration of AI applications is complex and involves multiple studies. To begin with, we can target a single problem such as memory access, or we can look at the full processor or system. Most designs start with memory access. There are many options – SRAM vs DRAM, local vs distributed storage, in-memory compute, and caching the back-propagation coefficients vs discarding.
The second evaluation sector is the bus or network topology. The virtual prototype can have a Network-on-Chip, TileLink, or AMBA AXI bus for the processor internals; PCIe, or Ethernet to connect the multi-processor boards and chassis; and Wifi/5G/Internet routers to access the data center.
The third study using the virtual prototype is the compute. This can be modeled as processor cores, multi-processor, accelerators, FPGA, Multi-Accumulate, and analog processing. The last piece is the interface to sensors, networks, math operations, DMA, custom logic, arbiters, schedulers, and control functions.
Moreover, the architecture exploration of AI processor and systems is challenging as it applies data-intensive task graphs on the full power of the hardware.
Model Construction
At Mirabilis, we use VisualSim for the architectural exploration of AI applications. Users of VisualSim assemble a virtual prototype very quickly in a graphical discrete-event simulation platform with a large library of AI hardware and software modeling components. The prototype can be used to conduct timing, throughput, power consumption, and quality of service trade-off. Over 20 templates of AI processor and embedded systems are provided to accelerate the development of new AI applications.
Reports generated for the trade-off in AI systems include response times, throughput, buffer occupancy, average power, energy consumption, and resource efficiency.
ADAS Model Construction
To begin with, let us consider the Autonomous Driving (ADAS) application, a form of AI deployment in Figure 1. The ADAS application co-exists with a number of applications on both the computer or Electronic Control Unit (ECU) and on the Network. There is also dependency on the sensors and actuators of the existing system for the ADAS task to operate correctly.
Early architecture trade-off can test and evaluate the hypothesis to quickly identify bottlenecks, and optimize the specification to meet the timing, throughput, power, and functional requirements. In Figure 1, you will see that the Architecture model requires the hardware, network, application task, sensors, attenuators, and the traffic stimulus to gain visibility into the operation of the complete system. Figure 2 shows the implementation of this ADAS logical architecture mapped to the Physical Architecture.
A nice feature of the Architecture model is the ability to separate all parts of the design, such that the performance of individual operations can be studied. In Figure 2, you will notice the existing tasks are listed separately, the network with the ECUs, sensor generation, and the ADAS logical tasks organization. Each function in the ADAS task graph is mapped to an ECU.
ADAS Analysis
When the ADAS model in Figure 2 is simulated, you can get a variety of reports. In Figure 3, the latency to complete the ADAS tasks and the associated heat dissipated by the battery for this task are shown. Other plots of interest can be the measured power, network throughput, battery consumption, CPU utilization, and buffer occupancy.
Processor Model Construction
Designers of AI processors and systems conduct experiments with application type, training vs inference, cost point, power consumption, and size limitations. For example, designers can assign child networks to pipeline stages, trade-off deep neural networks (DNNs) vs conventional machine learning algorithms, measure algorithm performance on GPU, TPU, AI processors, FPGA, and conventional processors, evaluate the benefits of melding compute and memory on a chip, compute the power impact of analog techniques that resemble human brain functions, and building SoCs with a partial set of functions targeted at a single application.
The schedule from PowerPoint to the first prototype for the new AI processors is extremely short and the first production sample cannot have any bottlenecks or bugs. Hence modeling becomes mandatory.
Figure 4 shows the internal view of the Google Tensor Processor. The block diagram has been translated into an architecture model in Figure 5. The processor receives requests from a host computer via a PCIe interface. MM, TG2, TG3, and TG4 are different requests streams from independent hosts. The weights are stored in an off-chip DDR3 and called up into the Weight FIFO. The arriving requests are stored and updated in the Unified Local Buffer and sent to the Matrix Multiple Unit for processing. When the request has been processed through the AI pipeline, it is returned to the Unified Buffer to respond back to the Host.
Processor Model Analysis
In Figure 6, you can view the latency and the Back-propagation weights management in the off-chip DDR3. The latency is from the time that the host sends requests to the receiving the response. You will see that TG3 and TG4 were able to maintain a low latency until 200 us and 350 us respectively. MM and TG2 started to buffer early in the simulation. As there is considerable buffering and the latency is increasing for this set of traffic profiles, the current TPU configuration is inadequate to handle the loads and the processing. The higher priority of TG3 and TG4 helped sustain operations for a longer period.
Automotive Design Construction
Today’s automotive design incorporates a number of safety and autonomous driving features that require a significant amount of machine learning and inference. The available time schedule will determine whether the processing is done at the ECU or sent to a Data Center. For example, a braking decision can be done locally while changing the air-conditioning temperature can be sent for remote processing. Both require some amount of artificial intelligence based on the input sensors and cameras.
Figure 7 is a network block diagram that incorporates the ECU, CAN-FD, Ethernet, and Gateway.
Figure 8 captures a portion of Figure 7 that integrates the CAN-FD network with the high-performance Nvidia DrivePX that contains multiple ARM cores and a GPU. The Ethernet/TSN/AVB and Gateway have been removed from the model to simplify the view. In this model, the focus is on understanding the internal behavior of the SoC. The application is a MPEG video capture, processing, and rendering that is triggered by the camera sensors on the vehicle.
Automotive Design Analysis
Figure 9 shows the statistics for the AMBA bus and the DDR3 memory. You can see the distribution of the workload across multiple masters. The application pipeline can be evaluated for bottlenecks, identifying the highest cycle time tasks, memory usage profile, and the latency for each individual task.
The use cases and traffic patterns are applied to the architecture model assembled as a combination of hardware, RTOS, and networks. A periodic traffic profile is used to model the radars, lidars, and cameras while the use case can be autonomous driving, chatbot, search, learning, inference, large data manipulation, image recognition, and disease detection. The use case and traffic can be varied for the input rates, data sizes, processing time, priority, dependency, prerequisites, back-propagation loops, coefficients, task graph, and memory accesses. The use case is simulated on the system model by varying the attributes. This results in a variety of statistics and plots to be generated including cache hit-ratio, pipeline utilization, number of requests rejected, watts per instruction or task, throughput, buffer occupancy, and state diagram.
Figure 10 shows the power consumption of both the system and silicon. In addition to the heat dissipated, battery charge consumption rate, and the battery lifecycle change, the model can capture the dynamic power change. The model plots the state activity of each device, the associated instant spikes, and the average power of the system. Getting early feedback on power consumption helps the thermal and mechanical teams to design the casing and cooling methods. Most chassis’ have a maximum power constraint for each board. This early power information can be used to perform architecture trade-offs with performance, thus looking for ways to reduce power consumption.
Further Exploration Scenarios
The following are some additional examples highlighting the use of the AI architecture model and analysis.
1. The autonomous driving system with a 360-degree laser scanner, stereo camera, fisheye camera, millimeter-wave radar, sonars, or lidars connected to 20 ECUs on multiple IEEE802.1Q networks connected via Gateways. The prototype is used to test feature packages for hardware configurations of the OEM to determine the hardware and network requirements. The response time for an active safety action is the primary criterion.
2. Artificial Intelligence Processor for learning and inference tasks is defined using a Network-on-Chip backbone that is built-up with 32 cores, 32 accelerators, 4 HBM2.0, 8 DDR5, multiple DMAs, and full cache coherency. This model experiments with variations of RISC-V, ARM Z1, and a proprietary core. The goal achieved was 40Gbps on the links while maintaining a low Router frequency and retraining the network routing.
3. A 32-layer deep neural network needed to get the memory from 40GB to less than 7GB. The data throughput and response times were not changed. The model is set up with the functional flow diagram of the behavior with the memory accesses for both the processing and the backpropagation. For different data sizes and task graphs, the model determined the amount of discarding of the data and various off-chip DRAM sizing and SSD storage options. The task graph was varied with an arbitrary number of graphs and several inputs and outputs.
4. General-purpose SoC using ARM processors and AXI bus for low-cost AI processing. The goal was to get the lowest power per watt which maximizes the memory bandwidth. The multiply-accumulate functions were off-loaded to vector instructions, encryption to an IP core, and the custom algorithms to accelerators. The model was constructed with the explicit purpose of evaluating different cache-memory hierarchy to increase the hit-ratio and bus topologies to reduce latency.
5. Analog-Digital AI processor requires a thorough analysis of the power consumption and accurate analysis of the throughput achieved. In this model, the non-linear control was modeled in a discrete-event simulator as a series of linear functions to accelerate the simulation time. In this case, the functionality was tested to check the behavior and measure true power savings.
Have a question? Or you would like to know more? Reach out to us!