# Metadata
Source URL:: https://github.com/mistralai/mistral-src/tree/main
---
# mistralai/mistral-src: Reference implementation of Mistral AI 7B v0.1 model.
## Highlights
> [!quote]+ Updated on Sun Oct 01 2023 13:34:41 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length.
>At inference time, this incurs higher latency and smaller throughput due to reduced cache availability.
>To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
> [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700
>
> Note that tokens outside the sliding window still influence next word prediction.
>At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc.
>For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.