Transformer - louis030195

#ai #computing #llm # [[Epistemic status]] #shower-thought #to-digest # Related - [[Seeker search augmented conversational bot]] - [[BlenderBot2]] - [[GPT3]] - [[Computing/Intelligence/Machine Learning/Vision Transformer]] - [[Computing/Intelligence/Machine Learning/Attention]] - [[Computing/Intelligence/Machine Learning/Reformer]] - [[Computing/NIPS 2022]] - [[Computing/Intelligence/Machine Learning/Scalable attention]] # Transformer The transformer model is a bit more complex, but let's use the analogy of a group of detectives solving a crime. **Step 1: Gathering evidence (Input Embedding):** Each detective arrives at the crime scene and picks up different pieces of evidence, such as a knife, a torn piece of fabric, or a footprint. These pieces of evidence are like the input words to the transformer model, each carrying some meaning. **Step 2: Sharing insights ([[Self attention]]):** The detectives then come together and share their findings with each other. They're not just simply stating what they found, but also deciding how important each piece of evidence is based on the crime scene context. For example, if they're solving a stabbing case, the knife might be more relevant than the footprint. This is analogous to the self-attention mechanism in transformers, where different parts of the input interact with each other, determining how much attention each word should get given the overall context. **Step 3: Pooling resources (Feed Forward and Layer Stacking):** After each discussion round, the detectives pool together their insights and form a collective understanding of the crime. They then repeat this process (discussion, pooling resources), building upon the insights from previous rounds. This mirrors the feed-forward layers and the stacking of multiple layers in the transformer model, allowing it to progressively refine its understanding of the input. **Step 4: Making predictions (Output):** After several rounds of discussion and resource pooling, the detectives are ready to make a prediction about the criminal. In the transformer model, this corresponds to the output, where it generates a prediction (e.g., the next word in a sentence) based on its comprehensive understanding of the input. This analogy simplifies a lot, but it captures the high-level process that a transformer model goes through: each part of the input is considered in relation to the others, and this combined understanding is refined over several stages to produce an output. An analogy to understand the difference between a Transformer and a Recurrent Neural Network (RNN) is to think of a Transformer as a group discussion, where all individuals are present and contribute simultaneously, while a RNN is more like a relay race, where each individual passes the baton to the next, and each person can only contribute when they have the baton. In other words, a Transformer can consider all the information at once, while a RNN processes information sequentially. This simultaneous processing of information in the Transformer model allows it to handle long-range dependencies more effectively than the RNN. ![[Pasted image 20221018110526.png]] ![[Pasted image 20221018144907.png]] ### Vision Transformer https://openreview.net/pdf?id=YicbFdNTTy ![[Vision Transformer.png]] ## challenges Challenges to address include external memory, computation complexity, controllability, and aligning with human brain functioning.