AI Accelerators — Part III: Architectural Foundations - Adi Fuchs

## Metadata
- Author: **Adi Fuchs**
- Full Title: AI Accelerators — Part III: Architectural Foundations
- Category: #articles
- Tags: #ai-hardware #hardware
- URL: https://medium.com/@adi.fu7/ai-accelerators-part-iii-architectural-foundations-3f1f73d61f1f
## Highlights
- one of the main lessons learned from the CISC vs. RISC war is that simplicity and specifically simplifying the instruction decoding significantly contributes to hardware efficiency; hence RISC has been more favorable (at least for smartphones). ([View Highlight](https://read.readwise.io/read/01hkdmydxq23jz8pwzzsfgnp3g))
- An ISA describes how instructions and operations are encoded by the compiler and are later decoded and executed by the processor. ([View Highlight](https://read.readwise.io/read/01hjsy1r2tk17tyx6a9898j8r6))
- In AI applications, which typically consist of linear algebra and non-linear activation, there is no need for many “exotic” types of operations. Therefore, the ISA can be designed to support a relatively narrow operation scope. ([View Highlight](https://read.readwise.io/read/01hkdn623kvbgtqnpfdswenq93))
- The benefits of using a reduced version of an existing RISC ISA is that some RISC companies (like ARM or SiFive) sell *IPs* which are existing processing cores that support the ISA (or some subset of it), which can be used as a baseline for the customized processing core that will be used in the accelerator chip. This way, the accelerator vendors can rely on a baseline design that has already been verified and potentially deployed in other systems; this is a more solid alternative to designing a new architecture from scratch, and it is particularly appealing for startups that have limited engineering resources, want to have the support of an existing processing ecosystem, or want to shorten ramp-up times. ([View Highlight](https://read.readwise.io/read/01hkdn670zxa2scrhtpek6dkr0))
- VLIW architectures consist of a heterogenous datapath array of arithmetic and memory units. The heterogeneity stems from the differences in timing and supported functionality of each unit: for example while computing the outcome of a simple logical operand can take 1–2 cycles, a memory operand can take hundreds of cycles ([View Highlight](https://read.readwise.io/read/01hkdnbdbq0348ayynj7p8j02k))
- VLIW architectures rely on a compiler that combines multiple operations to a single and complex instruction that dispatches data to the units in the datapath array. For example, in AI accelerators, an instruction could point a tensor to a matrix multiply unit, and in parallel, it can send data portions to a vector unit and a transpose unit, and so on. ([View Highlight](https://read.readwise.io/read/01hkdp2tzvpc532pdt7meykwtb))
- The systolic structure is an efficient way of performing matrix multiplications (which DNN workloads have an abundance of); the partial multiplications and accumulations are performed in parallel and a pre-determined orderly fashion. The TPU was the first widespread use of systolic arrays for AI. Consequently, several other companies have integrated systolic execution units in their accelerated hardware, like NVIDIA’s [Tensor Cores](https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf). ([View Highlight](https://read.readwise.io/read/01hkdpbb4dqykqy9nrxdsr8qb8))
- The most common class of reconfigurable processors is the *Field-Programmable Gate Array* (FPGA). FPGAs support a wide computational spectrum by enabling *bit-level* configurability: the arithmetic units can be configured to implement functions that operate on numbers of arbitrary widths, and the on-chip memory blocks can be fused to construct memory spaces of varied sizes ([View Highlight](https://read.readwise.io/read/01hkdpdvcdqmebb8vt6wgmwx7c))
- An upside of reconfigurable processors is that they can model chip designs written in hardware description languages (HDLs); this enabled companies ability to test their designs within a few hours instead of taping out chips, a process that can take up to months or even years. The downside of FPGAs is that fine-grained bit-level configurability is inefficient, typical compilation times can take many hours, and the amount of extra wires needed take up much space and are also energetically wasteful. Therefore, FPGAs are commonly used for prototyping a design before it gets taped out, as the resulting chip would be more performant and more efficient than its FPGA equivalent. ([View Highlight](https://read.readwise.io/read/01hkdpfc6bpgsmr3yvh9vqegwh))
- Compared to FPGAs, CGRAs do not support bit-level configurability and typically have more rigid structures and interconnection networks. CGRAs have a high degree of reconfigurability but at a coarser-granularity than FPGAs (they sacrifice the finer-grained bit-level configurability as it might not be necessary). ([View Highlight](https://read.readwise.io/read/01hke3xgzhjn98vyfkztten516))
- from an energetic point of view, memory access costs are an acute problem in many AI models. Moving the data around to and from the main memory is a few orders of magnitudes more costly than doing the actual computation ([View Highlight](https://read.readwise.io/read/01hkgbfhthtxs1g9vspmcff1zg))