LLaMA: Open and Efficient Foundation Language Models - Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample

## Metadata
- Author: **Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample**
- Full Title: LLaMA: Open and Efficient Foundation Language Models
- Category: #articles
- URL: https://arxiv.org/pdf/2302.13971.pdf
## Highlights
- Pre-normalization [GPT3]. To improve the
training stability, we normalize the input of each
transformer sub-layer, instead of normalizing the
output. We use the RMSNorm normalizing func-
tion, introduced by Zhang and Sennrich (2019). ([View Highlight](https://read.readwise.io/read/01h58rhzpbs7qp1jerjqbcr1sp))
- SwiGLU activation function [PaLM]. We re-
place the ReLU non-linearity by the SwiGLU ac-
tivation function, introduced by Shazeer (2020) to
improve the performance. We use a dimension of
2
34d instead of 4d as in PaLM. ([View Highlight](https://read.readwise.io/read/01h58rj1xta17147k831k9ek15))
- Rotary Embeddings [GPTNeo]. We remove the
absolute positional embeddings, and instead, add
rotary positional embeddings (RoPE), introduced
by Su et al. (2021), at each layer of the network.
The details of the hyper-parameters for our dif-
ferent models are given in Table 2. ([View Highlight](https://read.readwise.io/read/01h58rj43dm7fk1ff1m926b3w6))