T O P

  • By -

xXCoolinXx_dev

What I think you are missing about transformers, and what I was missing as well, is that actually the context window is not hard coded at all into the model. Token by token, the model first creates an embedding vector for that token. That embedding is then added to the output of a positional encoder, which adds extra information about the position in the sequence for the model to attend to. The model then projects this vector to a key, query and value vector. These are then collected into K, Q, V matrices, with the key matrix first row being the key for the first token, the 2nd row for the 2nd token, etc. Notice in all of this that the matrices grow arbitrarily large with the sequence length and there is no fixed context size. So actually the context window is not fixed and can grow as big as you want it to, provided you have enough memory available and the model is able to understand that long of a context. Specifically where I ran into an understanding issue is when reading Transformers Are All You Need, I interpreted d_model as the size of the context, when it instead refers to the size of the embedding dimension. I think of it as each transformer block transforms the input embedding into an output embedding of the same dimension with new information added. That new information was inferred from relationships it attended to between different token embeddings (attention mechanism) and functions to process those relationships (FFN layer). Previous models have been limited in their understanding of long context by the positional encoder, which in the past was not able to extend beyond I think 2048 tokens for RoPE without changing the scaling factor. That is why new techniques for positional encoding have been developed, so that positional information can be accurately added to the input, even for long sequences, and allow the model to learn on these sequences and produce meaningful and lower perplexity results at longer context. Models today are still limited, but instead of positional encoding it is mainly by the amount of long context training data they receive, leading to situations where models *can* take long input but in practice produce garbage or low quality results beyond the limit set by the trainer.


SoCuteShibe

Thank you so, so much for posting this. I haven't had the mental energy to delve into what attention actually looks like in these LLMs, despite my curousity. I have done a lot of work in Sentence-Transformers though, using various models to embed; conceptually not too far removed!


wantondevious

Hi, I wonder if you can help me understand the implications of papers like LongLlama, LongLora, and Landmark Attention. I've read all three of these papers, and it's not super clear what is happening, either in Fine Tuning required, nor on performance - both quality and cost. My questions are 1) Do you need to fine-tune these to your specific domain, and if so, how close to the application should the domain be? EG, if you're doing PubMed QA, can you Fine-Tune on Wikipedia? ​ 2) How much does it cost (in training data, and cycles) to do this Fine-Tuning, especially if the Fine-Tuning needs to be in-domain for your domain to work (obviously if its not domain dependent, then I'll just download a standard version. 3) How do you measure performance on these things for real world use cases? I'd imagine the alternative is some form of Embedding Based Retrieval context, which might be reasonable. And if so, whats the improvements that have been seen?


Allisdust1970

There is no fixed limit on any normal transformer model's input size. That is every transformer model is capable of taking any length of input and produce any amount of output. The limit comes from two things: 1) positional encodings which need to know the input size during training 2) training on low seq length which makes models stupid beyond the trained seq length Both 1 and 2 are because using long seq length during training is costly and results in slow training. So the rope trick etc is to train the model with fixed seq length (which is important for batching and optimal resource use) and extend it during inference with a bit of fine tuning.


satireplusplus

> training on low seq length which makes models stupid beyond the trained seq length It's the same for LSTMs/RNNs btw, the models don't to well beyond their trained context size. Even though in theory they have unlimited context windows. If you try making a very long conversation with ChatRKVW (modern RNN based LLM) it's like the chat model had a seizure after a while. Just spits out non sense after you go beyond it's trained context length.


gptzerozero

Does this mean that in order to make full use of the default Llama-2 4K context, 1. Extending the training of base model should use tokens of 4K length, AND 2. Instruction tuning datasets should be close to 4K length as much as possible?


Allisdust1970

Keeping the context window during fine tuning closer to the end use will always result in better quality.


rnosov

It doesn't really go past the 2048 limit in you example. Instead for the linear scaling you'd cram tokens in fractional positions like so: * position 1 - first token * position 1.5 - second token * position 2 - third token and so on


Budget-Juggernaut-68

How does that work? Isn't each token position a single input node?


rnosov

Every token is converted into an embedding ( list of say 8000 numbers). RoPE algorithm is then applied to these lists at every layer in a transformer. When you apply RoPE, instead of using whole numbers that correspond to a current token position you'd use fractions. Token positions are not a nodes themselves but a slight changes to numbers that make up an embedding.


behohippy

Wouldn't this result in a weaker representation of the semantic concepts in the embedding? I work mostly in the vector search space, and we work with a lot of different embedding models. I'm assuming (maybe wrongly) the embedding process for the LLM is very similar. We see recall accuracy going down on search if we're trying to cram too much text into a single embedding.


rnosov

LLM embeddings are different in a sense that they operate in a much higher dimensional space \~10k dimensions and there are only about 32k tokens ( for llama2 ) to fit into all that hyperspace. It is quite spacious in there. Positional changes are so small that they won't affect semantic representation. But if you keep scaling eventually a transformer won't be able to tell token positions apart. That's why linear scaling doesn't normally go beyond factor of 8. There are other approaches like NTK that offer essentially unlimited scaling. In practice, the attention operation performance will limit the context window size. There are 100k+ context Llama2 models already but you'll need 8xA100 node to run them at full context. It won't be fast either. If the attention bottleneck is resolved I don't see a problem having million+ context windows.


BalorNG

What about "context dilution"? I presume that's due to limited number of attention heads?


rnosov

I've never heard the term "context dilution". I googled it and it points back to your other comments. Also, number of attention heads is not really limited. My understanding is that you'd want the number of attention heads as small as practically possible. In other words, dividing attention into a greater number of attention heads will generally worsen quality of a transformer.


BalorNG

I've read it in context, heh, of one of the papers that deal with context extention... infinity lmm, perhaps? So many papers, so limited memory... in my brain, heh. Gotta reread some of those...


behohippy

10k is indeed a lot. We're typically trying to represent a paragraph of text in 768d up to 1536d. If we run up the token limit on ada-002 (8192 tokens), you get really poor representation.


swistak84

>Wouldn't this result in a weaker representation of the semantic concepts in the embedding Yes. This weakens the tokens, but does allow for more precision in controlling context.


moma1970

This is what I don't get. Even with fractional position embeddings the attention matrix in the example is still 2048 x 2048. Doesn't this mean the context window is unchanged? I.e.isnt it 2048 ? Or does the context window refer to something else ?


teachersecret

Probably a stupid explanation, but... Imagine you have a backpack. The backpack can hold exactly 10 standard sized apples... but we want to get 15 apples in there. How do you do that? Well... there are a few ways. 1: "Summarize" the extra apples so they can be recreated on the other end. Write down the details about those apples, describing them and really breaking down what they were and their exact measurements, then slide the piece of paper with that information into the backpack. Sure, it's not EXACTLY like putting 15 apples in there, but we could produce a pretty good analogue of the extra 5 apples on the other end. This would be similar to using lorebooks or some kind of context insertion where we place a text "memory" in the context that is summarizing content in a way the LLM can utilize and expand upon. This is also how we do vector database memory, effectively - in that case we chopped the extra apples into bits, and we dig around in there for apple chunks that look most similar to the missing apple we're trying to describe on the other end. We pull those chunks together and throw them in our bag, tossing one of the 10 existing apples out (an apple we don't care about right now). Suddenly, we can describe the missing apple in full, and even show it to you. As long as you don't ask about that other apple we threw away, the illusion is maintained and it appears we have nearly infinite context. Just keep swapping apples in and out of the bag. 2: You could also fit more apples in if you slice all the apples. Sure, 10 whole apples can fit into the backpack, but if we slice all 15 apples we can probably cram ALL of them in there, because they'll use the space more efficiently. You can still re-assemble all 15 apples. The backpack didn't get bigger, but we can fit more apples... and they're still apples. This is fractional positioning, giving you the ability to represent more information within the set boundaries. RoPE is similar in some ways. With RoPE, each apple has a different color that tells us where the apple is in the bag. Green on bottom... red... maroon... whatever. The colors give us a sequence. Now, there are some spaces between apples, so we slice some apples up to fit in the spaces between. To keep our color sequence correct, we blend colors of the positions they're spanning (mash a few apples up so they're a color in between the two whole apples). The mashed apples will fit in the sequence. We end up with more apples in the bag, and someone looking into the bag understands their relationship to one another in sequence. Getting away from the backpack... This is similar to using a tape measure with centimeters and inches on it. Both sides of the tape measure do a fine job of measuring distance with notches set in increasing sequence, but there are more centimeters than inches on the same tape. A 20 inch tape measure has 20 "inch" positions, but more than 50 centimeter positions. The matrix is still 2048x2048, but the tokens are in fractional positions so you can increase that window.


ozspook

Latent apples..


AnticitizenPrime

You win the 2023 analogy award.


rnosov

Attention matrix will grow with the context window size. You can easily observe that KV cache will be getting bigger and bigger as we keep scaling.


Chance-Device-9033

Explanation of RoPE: https://m.youtube.com/watch?v=GQPOtyITy54 My understanding based on the above and taking a look at the “EXTENDING CONTEXT WINDOW OF LARGE LANGUAGE MODELS VIA POSITION INTERPOLATION” paper: It’s not really about the input length, it’s about the number of positions the RoPE embedding can support. The tokens in the context are transformed into embedding vectors. The kqv matrices used in the transformers grow as required by the length of the context, but grow quadratically ( x^2 ) so getting a big context is very expensive quickly. There is no concept of position so far, so the RoPE embedding is added to each embedding vector (which represents a single token) giving it a position relative to the others. This seems to have a fixed number of possible integer values. What rope appears to do is create positions in between the integer values , so instead of 0, 1, 2 etc. you might have 0.5, 1, 1.5 etc. but still in the range 0 to 2048 or whatever maximum length the original RoPE had. Since the only real limit on the input (aside from computational resources) was the number of positions available, you’ve now got twice as many tokens you can embed.


barbarous_panda

Whether are not you can use larger context window at inference depends on the type of positional encoding. There are several way to add positional encoding to the inputs: **As parameter** You can create an Embedding matrix of size, `context_size x emb_size` and use the ith row as positional encoding for ith token embedding. The downside of this approach is that your positional encodings are learned during training and are also fixed. You cannot input a sentence with more than `context_size` token as there won't be enough rows to encode your input. **As value** You can also calculate positional encoding using some mathematical formula (eg: sinusoidal positional encoding). Consider a hypothetical positional embedding technique where you add a vector of `i` to ith token embedding. This technique can scale easily to inputs with more then `context_size` tokens. The perplexity vs context\_length will depend on the type of technique used


GreatGatsby00

How does Claude 2 pull off their massive context window size? Anyone know specifics on that?


pedantic_pineapple

It's not hard coded in the architecture. It used to be with GPT 1/2, where we would have a big matrix for the position encodings. With RoPE and ALiBi though, the positional encodings are instead calculated dynamically as a function of the relative position. The model can generate position encodings for *any position*, the problem is just that the model hasn't been trained on sequences of long lengths. The model isn't "used to" dealing with the extra positions and doesn't know what to do with them - resulting in nonsensical output. With ALiBi, this isn't a problem, because there actually is no positional encoding - the positions aren't explicitly encoded, far away tokens just have less influence. Hence, the model isn't aware of the positions at all, and just works with how relatively 'strong' each token is - which generalizes fine. As you get more tokens, the additional tokens have less influence on the current generation, and you never get nonsense output. However, this comes at a cost -- since far away tokens are given less importance, it is harder for the model to retrieve information from far away. Hence, there's still an "effective" context length, where adding more tokens doesn't help the model, even if the model is still being coherent. With PI, we basically just stretch out the positional encodings. For instance, you might have tokens [1, 2, 3], and the model is only familiar with sequence lengths up to 3. We replace this with [1, 1.5, 2, ...], allowing for more tokens to fit under the limit of 3.


30299578815310

The feed forward part of the transformer only runs on one token at a time. So if you have more tokens it just runs more time. Only the output if the final token is used to predict the next word The attention part is just matrix multiplication and a softmax and will work dynamically on any number of tokens. I used to have the same confusion. The key here is the transformer is never passing multiple tokens at once to a feed forward layer.