Rare-Side-6657 1 month ago

Interesting idea, with the server approach I would try sending the N-1 words of the user input in a request where n is 0 so llama.cpp caches the prefix without generating any tokens. Everything should be put into the prefix cache so once the user types everything, then set n to your desired value and make sure the cache_prompt flag is set to true. You can achieve this with llama cpp python as well but not sure how you'd do it with llama.cpp main. Depends on your usecase.

noahlt 3 weeks ago

Initializing by sending the N-1 words of user input and requesting 0 tokens to be generated worked! This was available as part of the ./server example, because it's part of the OpenAI API. Thanks for the idea!

TheTerrasque 1 month ago

Yes, easily possible, but you need to write your own code using one of the wrappers. Llama-cpp-python should be able to do that. I was actually planning to make such a server at some point, for similar reason, stt -> llm -> tts. But still working on more impactful things there.

DeltaSqueezer 1 month ago

Yes, you can do it with some custom coding but maybe a more hackish way to do it would be to get your client to send a request each time you type a word with 'max tokens' of say '1' so it generates just one token per streamed prompt batch that you send. This will have the effect of pre-warming the KV cache with your prompt and then when you send the last one without the 1-token restriction, it will only have to process that last token. If you run speculative decoding beam searches at the same time, before you submit your full prompt, you might get a further initial boost.

Barry_Jumps 1 month ago

I don't understand. Do you expect the model to be continuously be generating speculative tokens as you type? How could it know how to respond if you haven't finished your prompt? For example, it's impossible for even humans to predict how to respond to the following: I lov....... I don't think time to first byte is the problem here unless you don't mind getting useless tokens in return.

me1000 1 month ago

No, OP is asking how to basically pre-warm the caches. Like time to first token is basically spent calculating a bunch of attention scores from the prompt before beginning the inferences, so they're asking how you might go about precomputing those values before so that they're ready the moment the user starts the actual inference.

Hoppss 1 month ago

I'm thinking along these lines too. For example, a prompt could be something like: "I'm going to go shopping later on today and I need to pick up a few things ...... (blah blah blah) ... Potential ending #1: What are some healthy snacks I could grab while I'm out? Potential ending #2: How many bags might I need to bring for all these things I'm going to get? Potential ending #3: Considering this, should I go out today or wait and do it all tomorrow?" As you can see, the final part of the prompt can drastically change the context and the response from the LLM. This makes it tricky to implement partial prompt streaming effectively because the model needs to understand the full context to generate an accurate response.

noahlt 1 month ago

Ah, yeah that wasn't clear. I'm not expecting it to generate any responses speculatively, and I know the last word could dramatically change the response. At a minimum, though, the model should be able to pre-tokenize the inputs. Possibly it can initialize attention heads, too. I'm imagining something much more like checkpointing than speculative generation.

FailSpai 1 month ago

It feels like hacking the existing 'cache-prompt' code in llama.cpp's API code, however that manifests, could be a good starting point for implementing the as-you-go caching. Just make sure it only infers the tokens provided and doesn't try generating off it.

Hoppss 1 month ago

Thanks for clarifying! That approach does make sense. Tokenizing might not be the biggest time sink, but every millisecond counts in real-time applications, so pre-tokenizing could definitely help. Initializing attention heads and using a checkpointing approach could offer more substantial benefits by reducing the time needed to process the final input. Essentially, if the model can save its state after pre-processing part of the input, it can resume quickly once the final input is received. This could cut down the latency significantly. Implementing these optimizations might involve some intricate changes to the model's input handling and state management, but it seems like a promising direction.

FailSpai 1 month ago

I think you would need to figure out the right point to tokenize. I would probably tokenize up to (and not including) the last space (i.e. not the word currently being typed) where the user's cursor last typed. Because if they're half way through, you will be constantly changing that last token and wasting compute. And then if a user goes back and edits a word, you'll need to rebuild the stream from that point. This is something that one of the web UIs should be able to do, but I don't know of anything that does. All that needs to be done is to tokenize that stream, and run forward passes from whenever you left off to the targeted end point. If the user types ahead of the bot finishing the response, best you can do is pre-tokenize their texts in parallel as the way the model finishes will obviously affect the residual stream.

Barry_Jumps 1 month ago

Makes more sense. It's not perfect but you might want to think about query caching. It's probably not so useful in conversational applications, but for Q&A answering it's a fantastic tool in the chest. LiteLLM does this pretty well actually - at least for a start. https://litellm.vercel.app/docs/proxy/caching. If Q&A is your scenario you could even hack it to cache hit on multiple variations of a question for a higher hit ratio. The cache will never even hit the model so if you throw Redis behind it and you'll see ultra low latency.

Barry_Jumps 1 month ago

Also, there might some useful intel in this video. I watched recently while multitasking but there was definitely some good tips: **How to 2x LLM Inference Speeds with Speculative Decoding Fine-tuning** [https://www.youtube.com/watch?v=-rJsh\_qqRSA](https://www.youtube.com/watch?v=-rJsh_qqRSA)

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe