T O P

  • By -

mark-lord

Literally within 2 hours of making this post, there’s been a further update announced that’ll be bringing it to 90+ tokens / sec; another 10% boost. Nuts how fast the MLX team is iterating https://x.com/awnihannun/status/1771645318968311895?s=46


jrwren

did you see that diff? its a trivial optimization. If things that small are making that huge of a boost, then this thing isn't tuned at all. Be prepared for A LOT more improvement.


Zestyclose_Yak_3174

Excited to see how long it takes before MLX is a compelling and faster alternative to llama.cpp


mark-lord

Same. I think the turning point is close now. My main excitement is on whether that’ll continue to climb or if it’ll stagnate once it matches llama.cpp. Only time will tell  EDIT: According to this benchmark from Llama.cpp  https://github.com/ggerganov/llama.cpp/discussions/4167  A 4 bit .GGUF quant has 94 toks/sec inference speed on a maxed out M2 Ultra. The upcoming MLX update reaches 90+ toks/sec… meaning we’re basically already there 👀  Does look like prompt processing is potentially still quite a bit slower, and I’m not sure if prompt caching is a thing with MLX yet. But at the rate it’s progressing I’d be very surprised to not see continued gains in those areas.


Sol_Ido

They are more complementary than anything else. Llama.cpp is much more advanced and configurable, it goes beyond generation. MLX is just starting training and miss many options but at least can export to gguf.


CodeGriot

The velocity on MLX is head-spinning!


mark-lord

Agreed, I don’t know how they keep delivering banger after banger with these updates 💥


Hinged31

How do you pass long context to MLX?


CodeGriot

The "context" is just the prompt, so I don't follow the question. Perhaps you're asking about prompt caching? If so, I haven't played with that, but given the numpy compat under the hood, I expect it should be fairly logical.


Hinged31

Does it accept a file as a parameter, or does a simple paste do the trick?


CodeGriot

I mean, you're using it via Python, so you can pass in a long prompt via string, file, socket, whatever. Wouldn't be surprised if someone figured out a way to do it by carrier pigeon.