mistralaimistral-src Reference implementation of Mistral AI 7B v0.1 model

# Metadata Source URL:: https://github.com/mistralai/mistral-src/tree/main --- # mistralai/mistral-src: Reference implementation of Mistral AI 7B v0.1 model. ## Highlights > [!quote]+ Updated on Sun Oct 01 2023 13:34:41 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length. > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. >At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. >To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3). > [!quote]+ Updated on Sun Oct 01 2023 13:36:17 GMT-0700 > > Note that tokens outside the sliding window still influence next word prediction. >At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. >For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.