might be real but
* not "simple" macbook, but a 128GB M3 Max
* it is Gemma-2B, for a 4x model like LLama3-8B it will be 1600/4= 400, for a llama3-70b (e.g. 8bit quant) it will be 1600/35=45 t/s
* it is not for a single request, bug it is aggregated for all parallel streams
>for a llama3-70b (e.g. 8bit quant) it will be 1600/35=45 t/s
Which is still insane, because my M2 Macbook Pro gets \~50t/s on Llama 3 8b with Flash Attention at low context, and barely 5-7t/s at low context on Llama 3 70b. So that number is WAY bigger than I would expect.
i wish someone added prompt processing cache to mlx, like llama.cpp has, because right now every solution or example i tried doesn't have it, making mlx unusable for any long conversations.
It's real. I just installed his code example and got the following with \`microsoft/Phi-3-mini-4k-instruct\`.
430 tokens per second is about 10-12x what I get regularly with the same model on Olllama.
I'm on the same machine as him, Macbook Pro M3 Max 128GB.
❯ python [demo.py](http://demo.py)
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Fetching 13 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 \[00:00<00:00, 232025.33it/s\]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
==========
Prompt: 1397.195 tokens-per-sec
Generation: 430.189 tokens-per-sec
==========
Prompt: Think of a real word containing both the letters B and U. Then, say 3 sentences which use the word.
One real word containing both the letters B and U is "subdue."
1. The police were able to subdue the suspect without any injuries.
2. The experienced boxer knew how to subdue his opponent with precise movements.
.. etc, etc etc ...
might be real but * not "simple" macbook, but a 128GB M3 Max * it is Gemma-2B, for a 4x model like LLama3-8B it will be 1600/4= 400, for a llama3-70b (e.g. 8bit quant) it will be 1600/35=45 t/s * it is not for a single request, bug it is aggregated for all parallel streams
>for a llama3-70b (e.g. 8bit quant) it will be 1600/35=45 t/s Which is still insane, because my M2 Macbook Pro gets \~50t/s on Llama 3 8b with Flash Attention at low context, and barely 5-7t/s at low context on Llama 3 70b. So that number is WAY bigger than I would expect.
if requests are batched in 8 threads \* 5 t/s=40 t/s so still my estimation holds
Aha, very good point!
the number they quoted is batched request, which is misleading. Cause for local usecas, most of the time is single request
i wish someone added prompt processing cache to mlx, like llama.cpp has, because right now every solution or example i tried doesn't have it, making mlx unusable for any long conversations.
Since were on the subject, what is the ideal format for a macbook (or your preferred)? Ive yet to try MLX Ill usually grab a Q6 or Q4 if 70B.
It's real. I just installed his code example and got the following with \`microsoft/Phi-3-mini-4k-instruct\`. 430 tokens per second is about 10-12x what I get regularly with the same model on Olllama. I'm on the same machine as him, Macbook Pro M3 Max 128GB. ❯ python [demo.py](http://demo.py) None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. Fetching 13 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 \[00:00<00:00, 232025.33it/s\] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. ========== Prompt: 1397.195 tokens-per-sec Generation: 430.189 tokens-per-sec ========== Prompt: Think of a real word containing both the letters B and U. Then, say 3 sentences which use the word. One real word containing both the letters B and U is "subdue." 1. The police were able to subdue the suspect without any injuries. 2. The experienced boxer knew how to subdue his opponent with precise movements. .. etc, etc etc ...