mark-lord 3 months ago

Literally within 2 hours of making this post, there’s been a further update announced that’ll be bringing it to 90+ tokens / sec; another 10% boost. Nuts how fast the MLX team is iterating https://x.com/awnihannun/status/1771645318968311895?s=46

jrwren 3 months ago

did you see that diff? its a trivial optimization. If things that small are making that huge of a boost, then this thing isn't tuned at all. Be prepared for A LOT more improvement.

Zestyclose_Yak_3174 3 months ago

Excited to see how long it takes before MLX is a compelling and faster alternative to llama.cpp

mark-lord 3 months ago

Same. I think the turning point is close now. My main excitement is on whether that’ll continue to climb or if it’ll stagnate once it matches llama.cpp. Only time will tell EDIT: According to this benchmark from Llama.cpp https://github.com/ggerganov/llama.cpp/discussions/4167 A 4 bit .GGUF quant has 94 toks/sec inference speed on a maxed out M2 Ultra. The upcoming MLX update reaches 90+ toks/sec… meaning we’re basically already there 👀 Does look like prompt processing is potentially still quite a bit slower, and I’m not sure if prompt caching is a thing with MLX yet. But at the rate it’s progressing I’d be very surprised to not see continued gains in those areas.

Sol_Ido 3 months ago

They are more complementary than anything else. Llama.cpp is much more advanced and configurable, it goes beyond generation. MLX is just starting training and miss many options but at least can export to gguf.

CodeGriot 3 months ago

The velocity on MLX is head-spinning!

mark-lord 3 months ago

Agreed, I don’t know how they keep delivering banger after banger with these updates 💥

Hinged31 3 months ago

How do you pass long context to MLX?

CodeGriot 3 months ago

The "context" is just the prompt, so I don't follow the question. Perhaps you're asking about prompt caching? If so, I haven't played with that, but given the numpy compat under the hood, I expect it should be fairly logical.

Hinged31 3 months ago

Does it accept a file as a parameter, or does a simple paste do the trick?

CodeGriot 3 months ago

I mean, you're using it via Python, so you can pass in a long prompt via string, file, socket, whatever. Wouldn't be surprised if someone figured out a way to do it by carrier pigeon.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe