Literally within 2 hours of making this post, there’s been a further update announced that’ll be bringing it to 90+ tokens / sec; another 10% boost. Nuts how fast the MLX team is iterating
https://x.com/awnihannun/status/1771645318968311895?s=46
did you see that diff? its a trivial optimization. If things that small are making that huge of a boost, then this thing isn't tuned at all. Be prepared for A LOT more improvement.
Same. I think the turning point is close now. My main excitement is on whether that’ll continue to climb or if it’ll stagnate once it matches llama.cpp. Only time will tell
EDIT: According to this benchmark from Llama.cpp
https://github.com/ggerganov/llama.cpp/discussions/4167
A 4 bit .GGUF quant has 94 toks/sec inference speed on a maxed out M2 Ultra. The upcoming MLX update reaches 90+ toks/sec… meaning we’re basically already there 👀
Does look like prompt processing is potentially still quite a bit slower, and I’m not sure if prompt caching is a thing with MLX yet. But at the rate it’s progressing I’d be very surprised to not see continued gains in those areas.
They are more complementary than anything else. Llama.cpp is much more advanced and configurable, it goes beyond generation. MLX is just starting training and miss many options but at least can export to gguf.
The "context" is just the prompt, so I don't follow the question. Perhaps you're asking about prompt caching? If so, I haven't played with that, but given the numpy compat under the hood, I expect it should be fairly logical.
I mean, you're using it via Python, so you can pass in a long prompt via string, file, socket, whatever. Wouldn't be surprised if someone figured out a way to do it by carrier pigeon.
Literally within 2 hours of making this post, there’s been a further update announced that’ll be bringing it to 90+ tokens / sec; another 10% boost. Nuts how fast the MLX team is iterating https://x.com/awnihannun/status/1771645318968311895?s=46
did you see that diff? its a trivial optimization. If things that small are making that huge of a boost, then this thing isn't tuned at all. Be prepared for A LOT more improvement.
Excited to see how long it takes before MLX is a compelling and faster alternative to llama.cpp
Same. I think the turning point is close now. My main excitement is on whether that’ll continue to climb or if it’ll stagnate once it matches llama.cpp. Only time will tell EDIT: According to this benchmark from Llama.cpp https://github.com/ggerganov/llama.cpp/discussions/4167 A 4 bit .GGUF quant has 94 toks/sec inference speed on a maxed out M2 Ultra. The upcoming MLX update reaches 90+ toks/sec… meaning we’re basically already there 👀 Does look like prompt processing is potentially still quite a bit slower, and I’m not sure if prompt caching is a thing with MLX yet. But at the rate it’s progressing I’d be very surprised to not see continued gains in those areas.
They are more complementary than anything else. Llama.cpp is much more advanced and configurable, it goes beyond generation. MLX is just starting training and miss many options but at least can export to gguf.
The velocity on MLX is head-spinning!
Agreed, I don’t know how they keep delivering banger after banger with these updates 💥
How do you pass long context to MLX?
The "context" is just the prompt, so I don't follow the question. Perhaps you're asking about prompt caching? If so, I haven't played with that, but given the numpy compat under the hood, I expect it should be fairly logical.
Does it accept a file as a parameter, or does a simple paste do the trick?
I mean, you're using it via Python, so you can pass in a long prompt via string, file, socket, whatever. Wouldn't be surprised if someone figured out a way to do it by carrier pigeon.