AI Accelerators — Part IV: The Very Rich Landscape - Adi Fuchs

## Metadata
- Author: **Adi Fuchs**
- Full Title: AI Accelerators — Part IV: The Very Rich Landscape
- Category: #articles
- Tags: #ai-hardware #hardware
- URL: https://medium.com/@adi.fu7/ai-accelerators-part-iv-the-very-rich-landscape-17481be80917
## Highlights
- The takeaways are two folds: (i) Intel believes that the value proposition of AI (in the Datacenter) have greatly increased, (ii) Intel believes that AI is so important it was willing to shift its focus away from the Nervana project at the cost of millions of acquisition dollars and years of human engineering and aim for what it believes is a more promising solution. ([View Highlight](https://read.readwise.io/read/01hmjna1edcevqs4wsvb1r5sj9))
- If you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens?” (Seymour Cray) ([View Highlight](https://read.readwise.io/read/01hmjnbvkmgxw20rjsq7960pet))
- Compared to the A100, Cerebras’ second-generation WSE (WSE-2) houses 123x more cores, 1000x more memory at 56x more the size. While they did not explicitly disclose any power consumption numbers they are assumed to be in the 15k-16k watt range, which is about 37–40x more power compared to A100’s 400W power envelope. Their approach is that their chip is large enough so even big models can fit on WFE2’s 40GB on-chip memory space, which is at the same size as the first A100’s *off-chip* memory (later upgraded to 80GB ([View Highlight](https://read.readwise.io/read/01hmjnxax9a2xkzz7fx44zjyyq))
- The IPU has a “Bulk Synchronous Parallel” execution model. The idea is that all of the chip’s tiles operate synchronously according to three general execution phases. (i) *Compute:* all tiles perform the mathematical computations specified by their assigned threads. (ii) *Sync:* a phase in which we wait for all tiles to finish their execution. (iii) *Exchange:* all the computed output data is written to the exchange memory, and if needed, it will be used as input by (potentially) other tiles in the next Compute-Sync-Exchange phase ([View Highlight](https://read.readwise.io/read/01hmjp4q5ccfhb833ckznaewkr))
- Wave Computing, SambaNova, and SimpleMachines are three startups that have presented accelerator chips whose foundations combine two concepts (overviewed in the previous chapter). (i) *reconfigurability:* from a processor taxonomy’s point of view, their accelerators are classified as *Coarse-Grained Reconfigurable Arrays* (CGRA) originally suggested in [1996](https://courses.cs.washington.edu/courses/cse591n/07au/papers/Matrix_MirskyDehon1996.pdf) ([View Highlight](https://read.readwise.io/read/01hmm8kvmb2mvr0ksgx0a1npsz))
- The RDU chip contains an array of compute units (called “PCUs”) and scratchpad memory units (called “PMUs”) organized in a 2D-mesh hierarchy interconnected with NoC (network-on-chip) switches the RDU accesses off-chip memory using a hierarchy of units called AGUs and CUs. ([View Highlight](https://read.readwise.io/read/01hmm90wf4fd5ks9g72pbh5zpd))
- SimpleMachines was founded in 2017 by a group of academic researchers from the University of Wisconsin. Their group has been exploring reconfigurable architectures that rely on heterogeneous datapaths combining both von Neumann (instruction-by-instruction) and non-von Neumann (i.e., dataflow) execution ([View Highlight](https://read.readwise.io/read/01hmm94ge454xqqyjt3p4mcr6z))
- SimpleMachines tries to expand the model further to support both traditional (von Neumann) and non-traditional datapaths under the concept of “composable computing” (i.e., dividing a configurable fabric into sub-functional datapaths and fusing them to provide a mix of generic and domain-specific accelerations.) ([View Highlight](https://read.readwise.io/read/01hmm973b119ytpeajn284v5c6))
- Hailo’s main line of products is based on the Hailo-8 accelerator chip which supports a throughput of up to 26 TOPS at a typical power consumption of 2.5W (no data was provided on max power). It is sold on printed circuit boards (PCBs) connected via PCIe as an external card connected to an existing computer system. Hailo currently has three offerings that differ in form factors and PCIe interface width. ([View Highlight](https://read.readwise.io/read/01hmm9m32z75c0kafgqfzak2ap))
- [The motivation for the TPU](https://cloud.google.com/blog/products/gcp/quantifying-the-performance-of-the-tpu-our-first-machine-learning-chip) was a study done internally by Google’s scientists, in which they concluded that the rise in computing demands of Google Voice search alone would soon require them to double their datacenter if they were to rely on traditional CPUs and GPUs. Therefore, in 2015 they started working on their own accelerator chip designed to target their internal DNN workloads, such as speech and text recognition ([View Highlight](https://read.readwise.io/read/01hmm9reg8ce8pgyd1pw43551x))
- The first-generation TPU was built for inference-only workloads in Google’s datacenters, and combined a systolic array called a “Matrix-Multiply Unit” and a VLIW architecture*.* In later generations, Google’s engineers designed the TPUv2 and TPUv3 to do AI training. They used larger matrix multiply units and added new tensor units like the transpose-permute unit ([View Highlight](https://read.readwise.io/read/01hmm9tpjvwz39zn9p08b491k3))
- As such, Google tailored the TPUs its specific needs, and it is not particularly aiming for massive commercialization of TPUs and compete head-to-head with the other companies. Therefore, in 2016 a team of some of the TPU’s architects decided to leave Google to design a new processor with similar baseline characteristics to the ones of TPU and commercialize it in a new startup called “Groq.” ([View Highlight](https://read.readwise.io/read/01hmm9xk68xz77cqafjtk4y6bj))
- Given that it would take you at least two years from the initial planning to have a new chip working, and that is when you already have a team (not building one in your new startup) and you’re lucky enough to have a successful tape-out, they operated at very ambitious timelines. ([View Highlight](https://read.readwise.io/read/01hmma1ann6xvpv40xkk38532n))
- Importantly, Guadi has a built-in engine for Remote-DMA (RDMA, or more accurately RDMA over Converged Ethernet, or “RoCE,” let’s not go there at this time) — the benefit of RDMA is the ability to perform *direct accesses (=no CPU or operating system involved)* to memory spaces of other systems in the datacenter ([View Highlight](https://read.readwise.io/read/01hmmajxsvjhs9qcsckpn8z38k))
- One of the main problems of AI is how to find the sweet spot between efficiency and accuracy ([View Highlight](https://read.readwise.io/read/01hmmarrred7aq56rfmzkrv9yn))
- while 32-bit floating-point (FP32) variables are accurate, they cost a lot of computing power, and while 16-bit floating variables are (FP16) are cheap, they weren’t sufficiently accuracy. Therefore, in 2018, a team of engineers from Google Brain discovered that the main problem with FP16 was that the representation range was not wide enough. Therefore they invented the [“Brain float16”](https://cloud.google.com/tpu/docs/bfloat16) (BF16) standard that uses fewer bits for ”mantissa” (which determines the smallest fraction you can represent) but uses more bits for “exponent” (which determines the largest absolute values you can express, i.e., the number range). ([View Highlight](https://read.readwise.io/read/01hmmavby500e5s68k4zyj2pzj))
- Habana released its Goya inference performance results [in November of 2019](https://mlcommons.org/en/inference-datacenter-05/) and went head to head with the big corporations like NVIDIA, Google, and importantly — beating Intel’s Datacenter inference processor, NNP-I. A month after the MLPerf results were published, [Intel acquired Habana for 2 billion dollars to replace the existing solutions](https://www.forbes.com/sites/moorinsights/2019/12/16/intel-acquires-habana-labs-for-2b/?sh=761b336b19f9), so it definitely seems that Habana’s MLPerf approach paid off (pun intended — sorry, couldn’t myself.) ([View Highlight](https://read.readwise.io/read/01hmmb0sc81rb9jdnd5aay379b))