LLaMA: Open and Efficient Foundation Language Models - Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample ![rw-book-cover|200x400](https://readwise-assets.s3.amazonaws.com/media/uploaded_book_covers/profile_40759/VjNfqthZNsorEKZCYZaL474wxNx8x66lmadozxTOswY-cover_KuzntIj.png) ## Metadata - Author: **Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample** - Full Title: LLaMA: Open and Efficient Foundation Language Models - Category: #articles - URL: https://arxiv.org/pdf/2302.13971.pdf ## Highlights - Pre-normalization [GPT3]. To improve the training stability, we normalize the input of each transformer sub-layer, instead of normalizing the output. We use the RMSNorm normalizing func- tion, introduced by Zhang and Sennrich (2019). ([View Highlight](https://read.readwise.io/read/01h58rhzpbs7qp1jerjqbcr1sp)) - SwiGLU activation function [PaLM]. We re- place the ReLU non-linearity by the SwiGLU ac- tivation function, introduced by Shazeer (2020) to improve the performance. We use a dimension of 2 34d instead of 4d as in PaLM. ([View Highlight](https://read.readwise.io/read/01h58rj1xta17147k831k9ek15)) - Rotary Embeddings [GPTNeo]. We remove the absolute positional embeddings, and instead, add rotary positional embeddings (RoPE), introduced by Su et al. (2021), at each layer of the network. The details of the hyper-parameters for our dif- ferent models are given in Table 2. ([View Highlight](https://read.readwise.io/read/01h58rj43dm7fk1ff1m926b3w6))