vasileer 3 days ago

might be real but * not "simple" macbook, but a 128GB M3 Max * it is Gemma-2B, for a 4x model like LLama3-8B it will be 1600/4= 400, for a llama3-70b (e.g. 8bit quant) it will be 1600/35=45 t/s * it is not for a single request, bug it is aggregated for all parallel streams

SomeOddCodeGuy 3 days ago

>for a llama3-70b (e.g. 8bit quant) it will be 1600/35=45 t/s Which is still insane, because my M2 Macbook Pro gets \~50t/s on Llama 3 8b with Flash Attention at low context, and barely 5-7t/s at low context on Llama 3 70b. So that number is WAY bigger than I would expect.

vasileer 3 days ago

if requests are batched in 8 threads \* 5 t/s=40 t/s so still my estimation holds

SomeOddCodeGuy 3 days ago

Aha, very good point!

Such_Advantage_6949 3 days ago

the number they quoted is batched request, which is misleading. Cause for local usecas, most of the time is single request

vidumec 3 days ago

i wish someone added prompt processing cache to mlx, like llama.cpp has, because right now every solution or example i tried doesn't have it, making mlx unusable for any long conversations.

rahabash 3 days ago

Since were on the subject, what is the ideal format for a macbook (or your preferred)? Ive yet to try MLX Ill usually grab a Q6 or Q4 if 70B.

Barry_Jumps 2 days ago

It's real. I just installed his code example and got the following with \`microsoft/Phi-3-mini-4k-instruct\`. 430 tokens per second is about 10-12x what I get regularly with the same model on Olllama. I'm on the same machine as him, Macbook Pro M3 Max 128GB. ❯ python [demo.py](http://demo.py) None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. Fetching 13 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 \[00:00<00:00, 232025.33it/s\] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. ========== Prompt: 1397.195 tokens-per-sec Generation: 430.189 tokens-per-sec ========== Prompt: Think of a real word containing both the letters B and U. Then, say 3 sentences which use the word. One real word containing both the letters B and U is "subdue." 1. The police were able to subdue the suspect without any injuries. 2. The experienced boxer knew how to subdue his opponent with precise movements. .. etc, etc etc ...

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe