github.com - MistralaiMistral-Src Reference Implementation of Mistral AI 7B V0.1 Model.

Mistralai/Mistral-Src: Reference Implementation of Mistral AI 7B V0.1 Model. - github.com ![rw-book-cover|200x400](https://readwise-assets.s3.amazonaws.com/static/images/article3.5c705a01b476.png) ## Metadata - Author: **github.com** - Full Title: Mistralai/Mistral-Src: Reference Implementation of Mistral AI 7B V0.1 Model. - Category: #articles - URL: https://github.com/mistralai/mistral-src/tree/main ## Highlights - The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). - Note that tokens outside the sliding window still influence next word prediction. At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.