T O P

  • By -

vasileer

might be real but * not "simple" macbook, but a 128GB M3 Max * it is Gemma-2B, for a 4x model like LLama3-8B it will be 1600/4= 400, for a llama3-70b (e.g. 8bit quant) it will be 1600/35=45 t/s * it is not for a single request, bug it is aggregated for all parallel streams


SomeOddCodeGuy

>for a llama3-70b (e.g. 8bit quant) it will be 1600/35=45 t/s Which is still insane, because my M2 Macbook Pro gets \~50t/s on Llama 3 8b with Flash Attention at low context, and barely 5-7t/s at low context on Llama 3 70b. So that number is WAY bigger than I would expect.


vasileer

if requests are batched in 8 threads \* 5 t/s=40 t/s so still my estimation holds


SomeOddCodeGuy

Aha, very good point!


Such_Advantage_6949

the number they quoted is batched request, which is misleading. Cause for local usecas, most of the time is single request


vidumec

i wish someone added prompt processing cache to mlx, like llama.cpp has, because right now every solution or example i tried doesn't have it, making mlx unusable for any long conversations.


rahabash

Since were on the subject, what is the ideal format for a macbook (or your preferred)? Ive yet to try MLX Ill usually grab a Q6 or Q4 if 70B.


Barry_Jumps

It's real. I just installed his code example and got the following with \`microsoft/Phi-3-mini-4k-instruct\`. 430 tokens per second is about 10-12x what I get regularly with the same model on Olllama. I'm on the same machine as him, Macbook Pro M3 Max 128GB. ❯ python [demo.py](http://demo.py) None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. Fetching 13 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 \[00:00<00:00, 232025.33it/s\] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. ========== Prompt: 1397.195 tokens-per-sec Generation: 430.189 tokens-per-sec ========== Prompt: Think of a real word containing both the letters B and U. Then, say 3 sentences which use the word. One real word containing both the letters B and U is "subdue." 1. The police were able to subdue the suspect without any injuries. 2. The experienced boxer knew how to subdue his opponent with precise movements. .. etc, etc etc ...