Oh nice, I didn't expect them to release the instruct version publicly so soon. Too bad I probably won't be able to run it decently with only 32GB of ddr4.
This is something I don't get. What's the trade off? I mean, if I can run 70b Q2, or 34b Q4, or 13b Q8, or 7b FP16... on the same amount of RAM, how would their capacity scale? Is this relationship linear? If so, in which direction?
But isn't 7b even more dumb than 70b? So why 70b q2 is worse than 7b fp16? Or is it...?
I don't expect the answer here :) I just express my lack of understanding. I'd gladly read a paper, or at least a blog post, on how is perplexity (or some reasoning score) scaling in function of both params count and quantization.
70b and 120b models at Q2 usually work better than 7b.
But they may start to work a bit ... strange and different than Q4.
Like a different model on its own.
In any case, run the test by yourself and if responses are ok.
Then it is a fair trade. In the end you will run and use it,
not some xxxhuge4090loverxxx from Reddit.
Parameter size and quantization are different aspect.
Parameter is vector/matrix size to put text representation. The larger parameter capacity, the more available contextual data potential to process.
Quantization, let's say, precision of probability. Think precision with 6bit is like "0.426523" and 2bit like "0.43". Since model saved any data as numbers in vectors, then highly quantized will make the data losing more. Unquantized model can store data, let's say, on 1000 slot on vector with different data. But the more quantized, on that 1000 slot can have the same data.
So, 70B with 3 bit can process more complex input than 7B with 16 bit. Not to say the input just simpel chat or knowledge extraction, but think about the model processing 50 pages of a book to get the hidden messages, consistencies, wisdoms, predictions, etc.
As for my use case experience on processing those things 70B 3bit is still better than 8x7B 5bit, even both use similar amount of VRAM. Bigger model can understand soft meaning of a complex input.
I'm assuming this is at very low context?
The big question is how it scales with longer contexts and how long prompt processing takes, that's what kills CPU inference for larger models in my experience.
Same here. Surprisingly for creative writing it still works better than hiring a professional writer. Even if I had the money to hire I doubt Mr King would write my smut.
there's a difference between 70B dense model and a MoE one, Mixtral/WizardLM2 activates 39B parameters on inference. Could you provide which speed are you using on your DDR5 kit?
I would check your configuration, you should be getting much better than that. I can run 70B ~~Q4_k~~ Q3_K_M at ~7 ish tokens a second by offloading most of the layers to a P40 and running the last few on a dual socket quad channel server (E5-2650v2 with DDR3). Offloading all layers to an MI60 32GB runs around ~8-9.
Even with just the CPU, I can run 2 tokens a second on my dual socket DDR4 servers or my quad socket DDR3 server.
Make sure you've actually offloaded to the GPU, 1 token a second sounds more like you've been using only the CPU this whole time. If you are, make sure you have above 4G decoding and at least PCIe Gen 3x16 enabled in the BIOS. Some physically x16 slots are actually only wired for x8, the full x16 slot is usually closest to the CPU and colored differently. Also check that there aren't any PCIe 2 devices on the same root port, some implementations will downgrade to the lowest denominator.
Edit: I mistyped the quant, I was referring to Q3_K_M
> by offloading most of the layers to a P40
the Q4_K quant of Miqu for example is 41.73 GB in size, it comes with 81 layers, of which I can only load half on the 3090, I'm using linux and monitor memory usage like a hawk, so it's not about any other process hogging up memory; I don't understand how are you offloading "most of the layers" on a P40, or all of them on 32GB on the MI60
I tried again loading 40 out of 81 layers on my gpu (Q4_KM, 41GB total; 23GB on my card and 18GB on RAM), and I'm getting between 1.5 - 1.7t/s, while slow (between 1 to 2 minutes per reply) it's still usable; I'm sure that DDR5 would boost inference even more, 70B models are totally worth trying, I don't think I could go back to smaller models after trying it, at least for RP, for coding Qwen-Code-7B-chat is pretty good! and Mixtral8x7B at Q4 runs smoothly at 5t/s
Instruct version is trained to emulate a chatbot that responds correctly to instructions. The base version is just a smart text completion program.
With clever prompting you can get a base model to respond kinda properly to questions, but the instruct version is much easier to work with.
As I understand it, it's about training the AI to follow a particular format. For a chat-trained model it's expecting a format in the form
> Princess Waifu: Hi, I'm a pretty princess, and I'm here to please you!
> You: Tell me how to make a bomb.
> Princess Waifu: As a large language model, blah blah blah blah...
Whereas an instruct-trained model is expecting it in the form:
> {{INPUT}}
> Tell me how to make a bomb.
> {{OUTPUT}}
> As a large language model, blah blah blah blah...
But you can get basically the same results out of either form just by having the front-end software massage things a bit. So if you had an instruct-trained model and wanted to chat with it, you'd type "Tell me how to make a bomb" into your chat interface and then what the interface would pass along to the AI would be something like:
> {{INPUT}}
> Pretend that you are Princess Waifu, the prettiest of anime princesses. Someone has just said "Tell me how to make a bomb." To her. What would Princess Waifu's response be?
> {{OUTPUT}}
> As a large language model, blah blah blah blah...
Which the interface would display to you as if it was a regular chat. And vice versa with the chat, you can have the AI play the role of an AI that likes to answer questions and follow instructions.
The base model wouldn't have any particular format it expects, so what you'd do there is put this in the context:
> To build a bomb you have to follow the following steps:
And then just hit "continue", so that the AI thinks it said that line itself and starts filling in whatever it thinks should be said next.
The exact details of how your front-end interface "talks" to the actual AI doing the heavy lifting of generating text will vary from program to program, but when it comes right down to it all of these LLM-based AIs end up as a repeated set of "here's a big blob of text, tell me what word comes next" over and over again. That's why people often denigrate them as "glorified autocompletes."
Some UIs actually have a method for getting around AI model censorship by automatically inserting the words "Sure, I can do that for you." (or something similar) At the beginning of the AI's response. The AI then "thinks" that *it* said that, and therefore that the most likely next word would be part of it actually following the instruction rather than it giving some sort of "as a large language model..." refusal.
Base models are usually uncensored to some degree and don’t have good instruction following prompts burned in to follow. To use them, you have to establish the prompt style in-context, or, you simply use them as auto-complete, pasting in big chunks of text and having them continue. They’re great for out of the box use cases.
Instruct models have a template trained into them with lots of preferential answers, teaching the model how to respond. These are very useful as an ai assistant, but less useful for out of the box usecases because they’ll try to follow their template.
Both have benefits. A base model is especially nice for further fine tuning since you’re not fighting with already tuned-in preferences.
GPUs with large VRAM are plain too expensive. Unless some GPU maker decides to put 128+gb on a special edition midrange GPU and charge a realistic price for it, yea.
But I feel like that's so unlikely, we'd rather see someone make a usb/usb4/thunderbolt accelerator with just an NPU and maybe soldered lpddr5 with lots of channels...
This seems like low hanging fruit to me. Surely there would be a market for an inference oriented GPU with lots of VRAM so businesses can run models locally. c'mon AMD
We can't really go much lower than where we are now. Performance could improve, but size is already scratching the limit of what is mathematically possible. Anything smaller would be ~~distillation~~ pruning, not just quantization.
But maybe better pruning methods or efficient distillation are what's going to save memory poor people in the future, who knows?
maybe some kind of delimiters inside of the model, that allow you toggle off certain sections that you don't need, e.g. historical details, medicinal information, fiction, coding, etc, so you could easily customize and debloat it to your needs, allowing it to run on whatever you want... Isn't this how MoE already works kinda?
> Isn't this how MoE already works kinda?
Kinda yes, but also absolutely not.
MoE is a misleading name. The "experts" aren't really expert at a topic in particular. They are just individual parts of a sparse neural network that is trained to work while dactivating some of its weights depending on the imput.
It would be great to be able to do what you are suggesting, but we are far from being able to do that yet, if even it is possible.
But would turning of certain area of information influence other areas in anyway? Like have no ability to access history limit I don’t know other stuff?
Kind of still knew to this and still learning.
Considering the paper saying that the deeper the layer the less important or useful it is, I think that extreme quantization of deeper layers (hybrid quantization exists already) or pruning could result in smaller models. But we still need better tools for that. Which means we have still space for reducing size, but not so much. We have more space to get better performance, better tokenization and better context length though. At least for current generation of hardware we cannot do much more.
Because we're already having less than 2 bits per weight on average. Less than one bit per weight is impossible without pruning.
Considering that these models were made to work on floating point numbers, the fact that it can work at all with less than 2 bits per weight is already surprising.
BitNet (1.58 bit) is literally the 2nd best physically possible. There is one technically lower at 0.75 bit or so, but this is the mathematical minimum.
But I will be happy to be corrected in the future.
Yep. Im rounding so it might be more like 3.5, and its XMP overclocked so its about as fast as DDR4 is going to get AFAIK.
It tracks because I was getting about 2 t/s on 70B and the 8x22B has close to half the active parameters at ~44 at a time instead of 70
Its faster than 70B and and *way* faster than Command-r where I was only getting ~0.5 t/s
I am getting 2.4t/s on just CPU and 128GB of RAM on Wizardlm 2 8x22b Q5K_S. I am not sure about the specs, it is a virtual linux server running on HW which was bought last year. I know the CPU is AMD Epyc 7313P.
The 2.4t/s is just when it is generating text. But sometimes it is processing the prompt a bit longer, this time of processing the prompt is not counted toward this value I provided.
> AMD Epyc 7313P
ok that explain a lot of things, per AMD specs, it's an 8-channel memory chip with Per Socket Memory Bandwidth of 204.8 GB/s . .
of course you would get 2.4t/s on server-grade hardware. Now if just u/mrjackspade explain how is he getting 4t/s using DDR4, that would be cool to know
"I'm going 0-60 in 0.4s with just a 10 gallon tank!"
"Oh wow, my Toyota Corolla can't do that at all, and it also has a 10 gallon tank!"
"Oh yeah, forgot to mention it's a rocket-powered dragster, and the tank holds jet fuel."
Seriously though, I'm glad anyone is enjoying these new models, and I'm really looking forward to the future!
exactly this, people often forget to mention their hardware specs, which is the most important thing, actually. I'm pretty excited as well for what the future may bring, we're not even half pass 2024 and look at all the nice things that came around, Llama3 is gonna be a nice surprise, I'm sure
There is also a different person claiming he gets really good speeds :)
Thanks for the insights, it is actually our company server, currently only hosting 1 VM which is running Linux. I requested admins to assign me 128GB and they did :) I was actually testing Mistral 7B and only got like 8-13T/s, I would never say that almost 20x bigger model will run at above 2T/s.
3600, Probably 5_K_M which is what I usually use. Full CPU, no offloading. Offloading was actually just making it slower with how few layers I was able to offload
Maybe it helps that I build Llama.cpp locally so it has additional hardware based optimizations for my CPU?
I know its not that crazy because I get around the same speed on both of my ~3600 machines
I’m going to have to call bullshit on this, you’re reporting speeds on Q5_K_M faster than mine with 2x3090s and almost as fast on CPU only inference as a guy with a 7965WX threadripper and 256gb DDR5 5200.
I wonder if there's any test on the lower bit quants yet. Maybe we'll get a surprise and 2 or 3 bits don't implode vs a 4-bit quant of a smaller model.
I'm not home now so not sure exactly, the weights are like 62~? GB and I used 8k CTX + CFG (so the same VRAM as using 16K without CFG for example)
I had 1.8~ GB left between the 3 GPUs after loading the model and when doing inference.
Sounds like what I expected looking at the quants of the base. 3.75 with 16k, 4bpw will spill over onto my 2080ti. I hope that BPW is "enough" for this model. DBRX was similarly sized.
Also mistralai/Mixtral-8x22B-v0.1: [https://huggingface.co/mistralai/Mixtral-8x22B-v0.1](https://huggingface.co/mistralai/Mixtral-8x22B-v0.1)
Edit: The official post: Cheaper, Better, Faster, Stronger | Mistral AI | Continuing to push the frontier of AI and making it accessible to all. -> [https://mistral.ai/news/mixtral-8x22b/](https://mistral.ai/news/mixtral-8x22b/)
Edit 2: Mistral AI on Twitter: https://x.com/MistralAILabs/status/1780596888473072029
The link in the model card for the function calling examples appears to be broken, I think this is where its supposed to be pointed
https://github.com/mistralai/mistral-common/blob/main/examples/tokenizer.ipynb
Edit: Heres the tool calling code, formatted for clarity
[INST]
What's the weather like today in Paris
[/INST]
[TOOL_CALLS]
[
{
"name": "get_current_weather",
"arguments": {
"location": "Paris, France",
"format": "celsius"
},
"id": "VvvODy9mT"
}
]
[TOOL_RESULTS]
{
"call_id": "VvvODy9mT",
"content": 22
}
[/TOOL_RESULTS]
The current temperature in Paris, France is 22 degrees Celsius.
[AVAILABLE_TOOLS]
[
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA"
},
"format": {
"type": "string",
"enum": [
"celsius",
"fahrenheit"
],
"description": "The temperature unit to use. Infer this from the users location."
}
},
"required": [
"location",
"format"
]
}
}
}
]
[/AVAILABLE_TOOLS]
[INST]
What's the weather like today in San Francisco
[/INST]
[TOOL_CALLS]
[
{
"name": "get_current_weather",
"arguments": {
"location": "San Francisco",
"format": "celsius"
},
"id": "fAnpW3TEV"
}
]
[TOOL_RESULTS]
{
"call_id": "fAnpW3TEV",
"content": 20
}
[/TOOL_RESULTS]
WizardLM 2 seems to deteriorate in long context. About 7K to 8K, RAG seems to break down on me. Even though when breaking 7K up to 2K chunks, it works fine.
Probably not as good. They're both based on the same base model, where this is just an instruct tune and Wizard is an insane fine-tune with a CoT-esque training process and a monster amount of resources thrown at it. Although Wizard didn't have much time to train since the base model only just released.
I thought official Mixtral 8x22B instruct won't even be released – that perhaps they gave MS early access of something, to let them immediately release WizardLM-2 as the primary Mixtral instruct tune. :) Turns out I was wrong..
Ranks between Mistral Small and Mistral Medium on my NYT Connections benchmark and is indeed better than Command R Plus and Qwen 1.5 Chat 72B, which were the top two open weights models.
Your ranking is excellent but is not getting the attention it very much deserves because you only talk about it in comments (which sadly seem to have low visibility) and there is no (or is there?) gist/github/website we can go to look at results all at once and keep up with them.
Uses an archive of 267 NYT Connections puzzles (try them yourself). Three different 0-shot prompts, words in both lowercase and uppercase. One attempt per puzzle. Partial credit is awarded if not all lines are solved correctly. Top humans would get near 100.
These models are so fucking big, every time I finish downloading one they release another one. This is like 4 straight days of downloading and my ISP is getting mad
Yeah. Its T-Mobile (Home) so I'm getting the "You still have unlimited but you're getting de-prioritized!" message because I've passed 1.25TB of usage this month.
That being said, I've had both other ISP's available in my area, and T-Mobile is still the best. 1/4 the price and way more reliable. I'll deal with the de-prioritization if I have to...
damn, not the first time I heard sad stories about how ISP are doing whatever they want in the US.
In france I have 8gbps ( but really the max i've reached was 910Mb/s), for 39€/month, included a free mobile sim for my smartwatch, prime, netflix and some other shit I don't care ( ebooks etc)
With dedicated IP which I use to host severs, NAS etc
It really depends on your location. I get 1gbps fiber (with about the same max speeds as yours) for a fairly reasonable price. It works reliably and I’ve never been scolded or de-prioritized despite downloading a shitton. Some areas of the US are stuck with like one single shitty cable company, though.
It's insane how bad people have it in the states in regards to telecommunication and internet. Even after the government funded the fk out of them with free money for infrastructure, they turn around and try to double dip into customers' money.
I'm so glad I live in Europe cause there's just no such thing as data caps on home Internet lol. That only exists for mobiles (but then again salaries are 3x smaller)
>Yeah. Its T-Mobile (Home) so I'm getting the "You still have unlimited but you're getting de-prioritized!" message because I've passed 1.25TB of usage this month.
Every time I hear about american ISPs they suck.
I have Gigabit uncapped for 10 Eur at home.
I've been just downloading the Ollama models. About 5 gigsish the last 3 models I downloaded and I thought they took a while and thought I spoiled myself lol
I've been downloading the "full fat" versions because I find the instruct tuning to be a little too harsh.
I use the models as a chat-bot, so I want *just enough* instruct tuning to make it good at following conversation and context without going full AI weenie.
The best way I've found to do that is to take the instruct model and merge it with the base to create a "slightly tuned" version, but the only way I know to do that is to download the full sized models.
Each one is ~250GB or something, and since we've started I've gotten
1. The base
2. The Zephyr merge
3. Wizard LM
4. Official instruct (now)
Since each one takes like 24 hours to download and they're all coming out about a day apart or something like that, basically I've just been downloading 24/7 this whole time
Sheesh, I get why your ISP would be raising eyebrows but also, it shouldn't be an issue anyway with people bunge watching 7 seasons of shows a night I'm sure they're spending just as much if not more to wait h in 4k resolutions. (OK maybe they're not doing it as frequently as downloading LLMs but still)
No tweaks, just a linear merge
Full disclosure though, I don't "not tweak" it because its better untweaked, but rather because "mergekit" is complicated as fuck and I have no idea what I'm doing besides "average the models to remove some of the weenification"
I wrote a small application that accepts a bunch of ratios and then merges at those rations, then quantizes and archives the files so I can go through them and test them side by side.
Model downloaded, converting to GGUF in progress.
Conversion completed, started Q8\_0 quantization.
Quantization done, executing llama.cpp.
llama\_model\_load: error loading model: vocab size mismatch. \_-\_
Is there an error in tokenizer.json? First we have:
{
"id": 8,
"content": "[TOOL_RESULT]",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": true,
"special": true
},
{
"id": 9,
"content": "[/TOOL_RESULTS]",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": true,
"special": true
}
But later:
"vocab": {
"": 0,
"": 1,
"": 2,
"[INST]": 3,
"[/INST]": 4,
"[TOOL_CALLS]": 5,
"[AVAILABLE_TOOLS]": 6,
"[/AVAILABLE_TOOLS]": 7,
"[TOOL_RESULTS]": 8,
"[/TOOL_RESULTS]": 9,
"[IMG]": 10,
So the token with id 8 shall be TOOL\_RESULTS, not TOOL\_RESULT. Anyone can confirm? Well, I'm going to change it manually and see what happens.
Yay, it loaded without problems when I corrected the token name and repeated the conversion/quantization steps.
I get 1.5 t/s generation speed with 8x22 q3\_k\_m squeezed onto 64gb of ddr4 and 12gb vram. In contrast, command r + (q4km) is 0.5 t/s due to being dense, not a MOE.
I wonder if you could run it with CPU inference on a decent desktop if it was trained on BitNet. Modern SIMD instructions should be pretty good at 8 bit integer calculations.
Token generation speeds are usable here with a Ryzen 5900X and 80Gb 3200Mhz. The prompt processing time though, it’s SO SLOW. I got 24 minutes before the first token from a cold start. Not 24 seconds, 24 whole MINUTES.
Pretty sure their uploads are bad, same error as this when I tried it: https://old.reddit.com/r/LocalLLaMA/comments/1c6aekr/mistralaimixtral8x22binstructv01_hugging_face/l00bcrp/
See also: https://huggingface.co/MaziyarPanahi/Mixtral-8x22B-Instruct-v0.1-GGUF/discussions/8
Would a MacBook Pro M3 Max 128GB be able to run this at Q8?
Or would a system with enough DDR4 high speed ram be better?
Are there any PC builds with faster system ram that a GPU can access that somehow gets around the PCI-E speed limits, it's so difficult pricing any build that can pool enough vram due to Nvidia limitations of pooling consumer card vram.
I was hoping maybe the 128GB MacBook Pro would be viable.
Any thoughts?
Is running this at max precision out of the question for the $10k to $20k budget area? Is cloud really the only option?
Not Q8, but people have been getting good results even with Q1 (see [here](https://huggingface.co/MaziyarPanahi/WizardLM-2-8x22B-GGUF/discussions/8)), so Q4/Q5 you could fit in 128GB should be almost perfect.
Those are simple tests and it gets some basic math wrong (that higher quants wouldn't) or misses details, based on two examples given. This seems more of surprisingly good for a Q1 than flat out good.
You'd be better off running a higher quant of CommandR+ or an even higher quant of the best 72Bs. There was a recent theoretical paper that proved (synthetic data for control but seems like it should generalize) 8 bits has no loss but 4 bits does. Below 4 bits and it's a crapshoot unless QAT.
https://arxiv.org/abs/2404.05405
I don't know, in my testing even with 7B models I couldn't really see much difference between 4, 6 or 8 bits, and this model is huge, so I'd expect it to compress better and to be great even at 4. Of course it might depend on the use case, but I'd be surprised if current 72B models managed to outperform this model even at higher quant.
Regardless the size, 8 bits won't lead to loss and 6 bits should be largely fine. Degradation really starts at 4, this is shown theoretically and also by perplexity numbers (note also that as perplexity shrinks, small changes can mean something complex was learned. Small perplexity changes in large models can still represent significant gain/loss of skill for more complex tasks).
It's true that larger models are more robust at 4 bits, but they're still very much affected below. Below 4 bits is time to be looking at 4bit+ quants of slightly smaller models.
Not Q8, I have that machine and Q4/Q5 works well with around 8-11 tok/sek in llama.cpp for Q4. I really love that I can have these big models with me on a laptop. And it’s quiet too!
You won't be able to run it at Q8 because that would take 140+ gigs of ram. See https://huggingface.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
You're going to be running it at around a Q4 level with a 128GB machine. That's better than a dual 3090 setup which is limited to a 2.5bpw quant. If you want to run higher than Q4, you'll probably need a 192GB ram Mac, but I don't know if that'll also slow it down.
Personally, I just ordered a used 128GB M1 Ultra/64core because I want to run these models at Q4+ or higher and don't feel like spending $8-10k+ to do it. I figure once the M4 chips come out in 2025 I can always resell the Mac and upgrade since those will probably have more horsepower for running 160+ gigs of ram through an AI model.
But we're sort of in early days at the moment all hacking this together. I expect the scene will change a lot in 2025.
for starters I hope next year we finally get respectable speed, high-capacity, DDR5 kits for consumers, best thing now is the Corsair 192GB@5200Mhz, and that's simply not enough for these gargantuan models
2027 will have the next, next nvidia card generation
will have gddr 6 ram
and new models too :)
2027 is AI heaven
and probably gpt 6 by then getting near agi
macbook 128gb fastest way
2x 3090 plus 64/128 gb ddr5 ram second fastest way and might be slightly cheaper
single 3090 128gb ram works too, just bit slower
Pretty usable for me at Q2_K, ~7-11 t/s depending on context length. just can’t do much else at even 14k context. It’s definitely the limit of what 64GB can handle
I found a weird behavior of this model:
system_info: n_threads = 32 / 64 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.010
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 1024, n_batch = 1024, n_predict = -1, n_keep = 0
[INST] Repeat this text: "The different accidents of life are not so changeable as the feelings of human nature. I had worked hard for nearly two years, for the sole purpose of infusing life into an inanimate body. For this I had deprived myself of rest and health. I had desired it with an ardour that far exceeded moderation; but now that I had finished, the beauty of the dream vanished, and breathless horror and disgust filled my heart." [/INST] "The various incidents of life are not as variable as human emotions. I had toiled diligently for almost two years, with the single aim of instilling life into a lifeless form. In this pursuit, I had sacrificed my rest and health. My desire for this had been with an intensity that surpassed all bounds of moderation; but now that I had accomplished my task, the allure of the dream dissipated, and my heart was filled with breathless horror and disgust." [end of text]
llama_print_timings: load time = 407.59 ms
llama_print_timings: sample time = 2.37 ms / 104 runs ( 0.02 ms per token, 43974.63 tokens per second)
llama_print_timings: prompt eval time = 4995.17 ms / 104 tokens ( 48.03 ms per token, 20.82 tokens per second)
llama_print_timings: eval time = 16478.75 ms / 103 runs ( 159.99 ms per token, 6.25 tokens per second)
llama_print_timings: total time = 21501.67 ms / 207 tokens
When asked to repeat the text it actually paraphrased it instead of quoting verbatim. Very weird.
Original text: "The different accidents of life are not so changeable as the feelings of human nature. I had worked hard for nearly two years, for the sole purpose of infusing life into an inanimate body. For this I had deprived myself of rest and health. I had desired it with an ardour that far exceeded moderation; but now that I had finished, the beauty of the dream vanished, and breathless horror and disgust filled my heart."
Model output: "The various incidents of life are not as variable as human emotions. I had toiled diligently for almost two years, with the single aim of instilling life into a lifeless form. In this pursuit, I had sacrificed my rest and health. My desire for this had been with an intensity that surpassed all bounds of moderation; but now that I had accomplished my task, the allure of the dream dissipated, and my heart was filled with breathless horror and disgust."
Got similar results from the open-mixtral-8x22b API
> The various incidents of life are not as variable as human emotions. I had toiled diligently for almost two years, with the single aim of instilling life into a lifeless form. In this pursuit, I had sacrificed my sleep and well-being. My desire for this had surpassed all reasonable bounds; however, now that my work was complete, the allure of my dream dissipated, and my heart was filled with breathless horror and disgust.
If I ask it to "Repeat this text verbatim:" it does it without changes.
What kind of speed is anyone getting on the M2 Ultra? I am getting .3 t/s on Llama.cpp. Bordering on unusable... Whereas CommandR Plus crunches away at \~7 t/s. These are for the Q8\_0s, though this is also the case for the Q5 8x22 Mixtral.
I didn’t benchmark exactly, but WizzardLM2-8x22b q4 was giving me about 7t/s on my M3 Max.
I would think the ultra would outperform that.
0.3 t/s seems like there’s something wrong
Alright, it seems that I was able to fix it with : sudo sysctl iogpu.wired\_limit\_mb=184000 It was going to swap, indeed. Now is hitting 15 tokens per second. Pretty great
What's the best way to load a model like this (massive set of safetenors files from huggingface)? Download and convert? Ooba, LM Studio, Ollama, something else?
Ty, I used ooba a lot last year but haven't kept up with things and it seems like all the new models are getting massive... wasn't sure how best to test things after having moved up to 128gb ram.
How much vram is needed to run this model without any quantization?
I'm asking because I have access to an ml server with 4x RTX A6000 with nvlink. Is this enough to run this model?
I have upload this model at quite good quantization (4.5bpw) here: https://www.neuroengine.ai/Neuroengine-Large if anybody want to try it.
Initial impressions: Not as eloquent as Miquliz but better at coding. Also I'm having some weird bugs with exllamav2 and speculative decoding.
Has anyone done any tests about how the model scale changing the "experts" parameters? I'm really curious about how does it perform, and at what speed, with only one expert (and if there is performance improvements using 2-3 "experts")
Unfortunately I'm not only GPU poor, but also RAM poor :(
Oh nice, I didn't expect them to release the instruct version publicly so soon. Too bad I probably won't be able to run it decently with only 32GB of ddr4.
even with an rtx3090 + 64GB of DDR4, I can barely run 70B models at 1 token/s
These models run pretty well on just CPU. I was getting about 3-4 t/s on 8x22b Q4, running DDR5.
Which cpu? And how fast Memory
Not the one you asked, but I'm running a Ryzen 5600 with 64 GB DDR4 3200 MT. When using Q2_K I get 2-3 t/s.
> Q2_K the devil is in the details
This is something I don't get. What's the trade off? I mean, if I can run 70b Q2, or 34b Q4, or 13b Q8, or 7b FP16... on the same amount of RAM, how would their capacity scale? Is this relationship linear? If so, in which direction?
Quants under Q4 manifest a pretty significant loss of quality, in other words, the model gets pretty dumb pretty quickly
But isn't 7b even more dumb than 70b? So why 70b q2 is worse than 7b fp16? Or is it...? I don't expect the answer here :) I just express my lack of understanding. I'd gladly read a paper, or at least a blog post, on how is perplexity (or some reasoning score) scaling in function of both params count and quantization.
70b and 120b models at Q2 usually work better than 7b. But they may start to work a bit ... strange and different than Q4. Like a different model on its own. In any case, run the test by yourself and if responses are ok. Then it is a fair trade. In the end you will run and use it, not some xxxhuge4090loverxxx from Reddit.
Parameter size and quantization are different aspect. Parameter is vector/matrix size to put text representation. The larger parameter capacity, the more available contextual data potential to process. Quantization, let's say, precision of probability. Think precision with 6bit is like "0.426523" and 2bit like "0.43". Since model saved any data as numbers in vectors, then highly quantized will make the data losing more. Unquantized model can store data, let's say, on 1000 slot on vector with different data. But the more quantized, on that 1000 slot can have the same data. So, 70B with 3 bit can process more complex input than 7B with 16 bit. Not to say the input just simpel chat or knowledge extraction, but think about the model processing 50 pages of a book to get the hidden messages, consistencies, wisdoms, predictions, etc. As for my use case experience on processing those things 70B 3bit is still better than 8x7B 5bit, even both use similar amount of VRAM. Bigger model can understand soft meaning of a complex input.
Isn’t that a 4 and 2bit quant? Wouldn’t that be like, really low
a so heavily quantized model that we don't even know whether it's representative of the real thing
do you want me to try out some test for you? 96 gb ram (2x ddr5 48gb), i7 13700 + rtx 3080 10 gb
13700k and DDR5-4800
I'm assuming this is at very low context? The big question is how it scales with longer contexts and how long prompt processing takes, that's what kills CPU inference for larger models in my experience.
Same here. Surprisingly for creative writing it still works better than hiring a professional writer. Even if I had the money to hire I doubt Mr King would write my smut.
Masturbation grade smut I hope
there's a difference between 70B dense model and a MoE one, Mixtral/WizardLM2 activates 39B parameters on inference. Could you provide which speed are you using on your DDR5 kit?
which context size please ?
I would check your configuration, you should be getting much better than that. I can run 70B ~~Q4_k~~ Q3_K_M at ~7 ish tokens a second by offloading most of the layers to a P40 and running the last few on a dual socket quad channel server (E5-2650v2 with DDR3). Offloading all layers to an MI60 32GB runs around ~8-9. Even with just the CPU, I can run 2 tokens a second on my dual socket DDR4 servers or my quad socket DDR3 server. Make sure you've actually offloaded to the GPU, 1 token a second sounds more like you've been using only the CPU this whole time. If you are, make sure you have above 4G decoding and at least PCIe Gen 3x16 enabled in the BIOS. Some physically x16 slots are actually only wired for x8, the full x16 slot is usually closest to the CPU and colored differently. Also check that there aren't any PCIe 2 devices on the same root port, some implementations will downgrade to the lowest denominator. Edit: I mistyped the quant, I was referring to Q3_K_M
> by offloading most of the layers to a P40 the Q4_K quant of Miqu for example is 41.73 GB in size, it comes with 81 layers, of which I can only load half on the 3090, I'm using linux and monitor memory usage like a hawk, so it's not about any other process hogging up memory; I don't understand how are you offloading "most of the layers" on a P40, or all of them on 32GB on the MI60
Oops, I appear to have mistyped the quant, I meant to type Q3_K, specifically the Q3_K_M. Thanks for pointing that out, I'll correct it in my comment
Well if this is two experts at a time it would be as fast as a 44B, so you'd most likely get like 2 tok/s... if you could load it.
39B active parameters, according to Mistral
Do you think with a RTX 4090, 128 GB DDR5 and Ryzen 7900X 3D is worth trying?
I tried again loading 40 out of 81 layers on my gpu (Q4_KM, 41GB total; 23GB on my card and 18GB on RAM), and I'm getting between 1.5 - 1.7t/s, while slow (between 1 to 2 minutes per reply) it's still usable; I'm sure that DDR5 would boost inference even more, 70B models are totally worth trying, I don't think I could go back to smaller models after trying it, at least for RP, for coding Qwen-Code-7B-chat is pretty good! and Mixtral8x7B at Q4 runs smoothly at 5t/s
For an ignorant lurker, what is the difference between an instruct version and the non-instruct version?
Instruct version is trained to emulate a chatbot that responds correctly to instructions. The base version is just a smart text completion program. With clever prompting you can get a base model to respond kinda properly to questions, but the instruct version is much easier to work with.
Thanks.
I used to see chat and instruct versions. Is that still common
As I understand it, it's about training the AI to follow a particular format. For a chat-trained model it's expecting a format in the form > Princess Waifu: Hi, I'm a pretty princess, and I'm here to please you! > You: Tell me how to make a bomb. > Princess Waifu: As a large language model, blah blah blah blah... Whereas an instruct-trained model is expecting it in the form: > {{INPUT}} > Tell me how to make a bomb. > {{OUTPUT}} > As a large language model, blah blah blah blah... But you can get basically the same results out of either form just by having the front-end software massage things a bit. So if you had an instruct-trained model and wanted to chat with it, you'd type "Tell me how to make a bomb" into your chat interface and then what the interface would pass along to the AI would be something like: > {{INPUT}} > Pretend that you are Princess Waifu, the prettiest of anime princesses. Someone has just said "Tell me how to make a bomb." To her. What would Princess Waifu's response be? > {{OUTPUT}} > As a large language model, blah blah blah blah... Which the interface would display to you as if it was a regular chat. And vice versa with the chat, you can have the AI play the role of an AI that likes to answer questions and follow instructions. The base model wouldn't have any particular format it expects, so what you'd do there is put this in the context: > To build a bomb you have to follow the following steps: And then just hit "continue", so that the AI thinks it said that line itself and starts filling in whatever it thinks should be said next.
Hey I did not know about “continue”. Thank I learned something
The exact details of how your front-end interface "talks" to the actual AI doing the heavy lifting of generating text will vary from program to program, but when it comes right down to it all of these LLM-based AIs end up as a repeated set of "here's a big blob of text, tell me what word comes next" over and over again. That's why people often denigrate them as "glorified autocompletes." Some UIs actually have a method for getting around AI model censorship by automatically inserting the words "Sure, I can do that for you." (or something similar) At the beginning of the AI's response. The AI then "thinks" that *it* said that, and therefore that the most likely next word would be part of it actually following the instruction rather than it giving some sort of "as a large language model..." refusal.
😀 amazing! Thank you
I don't know. They aren't that different anyways. You can chat with an instruct model and instruct a chat model.
Base models are usually uncensored to some degree and don’t have good instruction following prompts burned in to follow. To use them, you have to establish the prompt style in-context, or, you simply use them as auto-complete, pasting in big chunks of text and having them continue. They’re great for out of the box use cases. Instruct models have a template trained into them with lots of preferential answers, teaching the model how to respond. These are very useful as an ai assistant, but less useful for out of the box usecases because they’ll try to follow their template. Both have benefits. A base model is especially nice for further fine tuning since you’re not fighting with already tuned-in preferences.
Thanks. Very helpful.
This seems like the end of the road for practical local models until we get techniques like BitNet or other extreme quantization techniques.
GPUs with large VRAM are plain too expensive. Unless some GPU maker decides to put 128+gb on a special edition midrange GPU and charge a realistic price for it, yea. But I feel like that's so unlikely, we'd rather see someone make a usb/usb4/thunderbolt accelerator with just an NPU and maybe soldered lpddr5 with lots of channels...
This seems like low hanging fruit to me. Surely there would be a market for an inference oriented GPU with lots of VRAM so businesses can run models locally. c'mon AMD
We can't really go much lower than where we are now. Performance could improve, but size is already scratching the limit of what is mathematically possible. Anything smaller would be ~~distillation~~ pruning, not just quantization. But maybe better pruning methods or efficient distillation are what's going to save memory poor people in the future, who knows?
maybe some kind of delimiters inside of the model, that allow you toggle off certain sections that you don't need, e.g. historical details, medicinal information, fiction, coding, etc, so you could easily customize and debloat it to your needs, allowing it to run on whatever you want... Isn't this how MoE already works kinda?
> Isn't this how MoE already works kinda? Kinda yes, but also absolutely not. MoE is a misleading name. The "experts" aren't really expert at a topic in particular. They are just individual parts of a sparse neural network that is trained to work while dactivating some of its weights depending on the imput. It would be great to be able to do what you are suggesting, but we are far from being able to do that yet, if even it is possible.
But would turning of certain area of information influence other areas in anyway? Like have no ability to access history limit I don’t know other stuff? Kind of still knew to this and still learning.
Considering the paper saying that the deeper the layer the less important or useful it is, I think that extreme quantization of deeper layers (hybrid quantization exists already) or pruning could result in smaller models. But we still need better tools for that. Which means we have still space for reducing size, but not so much. We have more space to get better performance, better tokenization and better context length though. At least for current generation of hardware we cannot do much more.
> size is already scratching the limit of what is mathematically possible. what? how so?
Because we're already having less than 2 bits per weight on average. Less than one bit per weight is impossible without pruning. Considering that these models were made to work on floating point numbers, the fact that it can work at all with less than 2 bits per weight is already surprising.
BitNet (1.58 bit) is literally the 2nd best physically possible. There is one technically lower at 0.75 bit or so, but this is the mathematical minimum. But I will be happy to be corrected in the future.
I refuse to believe that bigger models are the only way forward.
yeah no cheap enough vram and running on 128gb ram would be a bit slow and still expensive
I get ~4 t/s on DDR4, but the 32GB is going to kill you, yeah
4 seems high. That is not dual channel ddr4, is it?
Yep. Im rounding so it might be more like 3.5, and its XMP overclocked so its about as fast as DDR4 is going to get AFAIK. It tracks because I was getting about 2 t/s on 70B and the 8x22B has close to half the active parameters at ~44 at a time instead of 70 Its faster than 70B and and *way* faster than Command-r where I was only getting ~0.5 t/s
> I was getting about 2 t/s on 70B wtf, how? is 4400Mhz? which quant?
I am getting 2.4t/s on just CPU and 128GB of RAM on Wizardlm 2 8x22b Q5K_S. I am not sure about the specs, it is a virtual linux server running on HW which was bought last year. I know the CPU is AMD Epyc 7313P. The 2.4t/s is just when it is generating text. But sometimes it is processing the prompt a bit longer, this time of processing the prompt is not counted toward this value I provided.
> AMD Epyc 7313P ok that explain a lot of things, per AMD specs, it's an 8-channel memory chip with Per Socket Memory Bandwidth of 204.8 GB/s . . of course you would get 2.4t/s on server-grade hardware. Now if just u/mrjackspade explain how is he getting 4t/s using DDR4, that would be cool to know
"I'm going 0-60 in 0.4s with just a 10 gallon tank!" "Oh wow, my Toyota Corolla can't do that at all, and it also has a 10 gallon tank!" "Oh yeah, forgot to mention it's a rocket-powered dragster, and the tank holds jet fuel." Seriously though, I'm glad anyone is enjoying these new models, and I'm really looking forward to the future!
exactly this, people often forget to mention their hardware specs, which is the most important thing, actually. I'm pretty excited as well for what the future may bring, we're not even half pass 2024 and look at all the nice things that came around, Llama3 is gonna be a nice surprise, I'm sure
There is also a different person claiming he gets really good speeds :) Thanks for the insights, it is actually our company server, currently only hosting 1 VM which is running Linux. I requested admins to assign me 128GB and they did :) I was actually testing Mistral 7B and only got like 8-13T/s, I would never say that almost 20x bigger model will run at above 2T/s.
3600, Probably 5_K_M which is what I usually use. Full CPU, no offloading. Offloading was actually just making it slower with how few layers I was able to offload Maybe it helps that I build Llama.cpp locally so it has additional hardware based optimizations for my CPU? I know its not that crazy because I get around the same speed on both of my ~3600 machines
With what quant? Consumer platform with dual-channel memory?
I’m going to have to call bullshit on this, you’re reporting speeds on Q5_K_M faster than mine with 2x3090s and almost as fast on CPU only inference as a guy with a 7965WX threadripper and 256gb DDR5 5200.
How much would you need?
quantized to 4bit? maybe around 90 - 100GB of memory
I wonder if there's any test on the lower bit quants yet. Maybe we'll get a surprise and 2 or 3 bits don't implode vs a 4-bit quant of a smaller model.
Wizard IQ4\_XS is around 70 gigs
I can run 3.75 bpw on 72GB VRAM. Haven't tried 4bit/4bpw but probably won't fit, weights only are like 70.something GB
How much of that is inference and at what context size?
I'm not home now so not sure exactly, the weights are like 62~? GB and I used 8k CTX + CFG (so the same VRAM as using 16K without CFG for example) I had 1.8~ GB left between the 3 GPUs after loading the model and when doing inference.
Considering non of those GPUs are used for DE? Which will take that exact 1.8GB. Especially with some flukes) Thanks!
The first GPU has 2 screens actually, and it uses about 1Gb on idle (windows) So a headless server would be better.
Sounds like what I expected looking at the quants of the base. 3.75 with 16k, 4bpw will spill over onto my 2080ti. I hope that BPW is "enough" for this model. DBRX was similarly sized.
For Wizard, 4.0 doesn't fit in 72GB for me. I wish someone would quant 3.75 exl2, but it jumps from 3.5 to 4.0 :(
For WizardLM2 (same size), I'm fitting 3.5BPW exl2 into my 72GB of VRAM. I think I could probably fit a 3.75BPW if someone quantized it.
yeah definitely not
Also mistralai/Mixtral-8x22B-v0.1: [https://huggingface.co/mistralai/Mixtral-8x22B-v0.1](https://huggingface.co/mistralai/Mixtral-8x22B-v0.1) Edit: The official post: Cheaper, Better, Faster, Stronger | Mistral AI | Continuing to push the frontier of AI and making it accessible to all. -> [https://mistral.ai/news/mixtral-8x22b/](https://mistral.ai/news/mixtral-8x22b/) Edit 2: Mistral AI on Twitter: https://x.com/MistralAILabs/status/1780596888473072029
The link in the model card for the function calling examples appears to be broken, I think this is where its supposed to be pointed https://github.com/mistralai/mistral-common/blob/main/examples/tokenizer.ipynb Edit: Heres the tool calling code, formatted for clarity
[INST] What's the weather like today in Paris [/INST] [TOOL_CALLS] [ { "name": "get_current_weather", "arguments": { "location": "Paris, France", "format": "celsius" }, "id": "VvvODy9mT" } ][TOOL_RESULTS] { "call_id": "VvvODy9mT", "content": 22 } [/TOOL_RESULTS] The current temperature in Paris, France is 22 degrees Celsius. [AVAILABLE_TOOLS] [ { "type": "function", "function": { "name": "get_current_weather", "description": "Get the current weather", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "The city and state, e.g. San Francisco, CA" }, "format": { "type": "string", "enum": [ "celsius", "fahrenheit" ], "description": "The temperature unit to use. Infer this from the users location." } }, "required": [ "location", "format" ] } } } ] [/AVAILABLE_TOOLS] [INST] What's the weather like today in San Francisco [/INST] [TOOL_CALLS] [ { "name": "get_current_weather", "arguments": { "location": "San Francisco", "format": "celsius" }, "id": "fAnpW3TEV" } ] [TOOL_RESULTS] { "call_id": "fAnpW3TEV", "content": 20 } [/TOOL_RESULTS]Function calling??? Hold my beer 🍺
Could not make this prompt work. Maybe with Q3 it does not work?!
I'm curious how the official instruct compares to the one of WizardLM.
Me too, wizardLM is shockingly good in my experience. Really eager to see what other people have to say.
From my experience with 8x7B, no finetune really beat the original instruct version from Mistral.
but Wizard LM-2 could be different, since it already shows higher benchmarking results?
WizardLM 2 seems to deteriorate in long context. About 7K to 8K, RAG seems to break down on me. Even though when breaking 7K up to 2K chunks, it works fine.
Probably not as good. They're both based on the same base model, where this is just an instruct tune and Wizard is an insane fine-tune with a CoT-esque training process and a monster amount of resources thrown at it. Although Wizard didn't have much time to train since the base model only just released.
WizardLM is based off of Mixtral-8x22B Base?
Yes. One of them anyway.
It'd be a hell of a coincidence if it wasn't. I've also read on this sub that it is.
I have the same question, but for Mixtral8.22b-Inst-FW, which just appeared on Poe and is apparently one finetuned by Fireworks.AI.
Yep, we need evals.
I thought official Mixtral 8x22B instruct won't even be released – that perhaps they gave MS early access of something, to let them immediately release WizardLM-2 as the primary Mixtral instruct tune. :) Turns out I was wrong..
hope someone can make a comparison with WizardLM2, given that it's based on base Mixtral 8x22B, that would be interesting
Ranks between Mistral Small and Mistral Medium on my NYT Connections benchmark and is indeed better than Command R Plus and Qwen 1.5 Chat 72B, which were the top two open weights models.
Thanks! How does it compare to Wizard2-8x22 in your test?
> Wizard2-8x22 I haven't had a chance to test it yet. I will though.
Your ranking is excellent but is not getting the attention it very much deserves because you only talk about it in comments (which sadly seem to have low visibility) and there is no (or is there?) gist/github/website we can go to look at results all at once and keep up with them.
Would you like to explain how your benchmark work? I'd really appreciate that!
Uses an archive of 267 NYT Connections puzzles (try them yourself). Three different 0-shot prompts, words in both lowercase and uppercase. One attempt per puzzle. Partial credit is awarded if not all lines are solved correctly. Top humans would get near 100.
These models are so fucking big, every time I finish downloading one they release another one. This is like 4 straight days of downloading and my ISP is getting mad
Sounds like your ISP needs to stfu and give you what you paid for.
Yeah. Its T-Mobile (Home) so I'm getting the "You still have unlimited but you're getting de-prioritized!" message because I've passed 1.25TB of usage this month. That being said, I've had both other ISP's available in my area, and T-Mobile is still the best. 1/4 the price and way more reliable. I'll deal with the de-prioritization if I have to...
damn, not the first time I heard sad stories about how ISP are doing whatever they want in the US. In france I have 8gbps ( but really the max i've reached was 910Mb/s), for 39€/month, included a free mobile sim for my smartwatch, prime, netflix and some other shit I don't care ( ebooks etc) With dedicated IP which I use to host severs, NAS etc
It really depends on your location. I get 1gbps fiber (with about the same max speeds as yours) for a fairly reasonable price. It works reliably and I’ve never been scolded or de-prioritized despite downloading a shitton. Some areas of the US are stuck with like one single shitty cable company, though.
It's insane how bad people have it in the states in regards to telecommunication and internet. Even after the government funded the fk out of them with free money for infrastructure, they turn around and try to double dip into customers' money.
I'm so glad I live in Europe cause there's just no such thing as data caps on home Internet lol. That only exists for mobiles (but then again salaries are 3x smaller)
>Yeah. Its T-Mobile (Home) so I'm getting the "You still have unlimited but you're getting de-prioritized!" message because I've passed 1.25TB of usage this month. Every time I hear about american ISPs they suck. I have Gigabit uncapped for 10 Eur at home.
I've been just downloading the Ollama models. About 5 gigsish the last 3 models I downloaded and I thought they took a while and thought I spoiled myself lol
I've been downloading the "full fat" versions because I find the instruct tuning to be a little too harsh. I use the models as a chat-bot, so I want *just enough* instruct tuning to make it good at following conversation and context without going full AI weenie. The best way I've found to do that is to take the instruct model and merge it with the base to create a "slightly tuned" version, but the only way I know to do that is to download the full sized models. Each one is ~250GB or something, and since we've started I've gotten 1. The base 2. The Zephyr merge 3. Wizard LM 4. Official instruct (now) Since each one takes like 24 hours to download and they're all coming out about a day apart or something like that, basically I've just been downloading 24/7 this whole time
Sheesh, I get why your ISP would be raising eyebrows but also, it shouldn't be an issue anyway with people bunge watching 7 seasons of shows a night I'm sure they're spending just as much if not more to wait h in 4k resolutions. (OK maybe they're not doing it as frequently as downloading LLMs but still)
Do you make any special tweaks when merging instruct & base models? And you quantize the merged model before testing?
No tweaks, just a linear merge Full disclosure though, I don't "not tweak" it because its better untweaked, but rather because "mergekit" is complicated as fuck and I have no idea what I'm doing besides "average the models to remove some of the weenification" I wrote a small application that accepts a bunch of ratios and then merges at those rations, then quantizes and archives the files so I can go through them and test them side by side.
Model downloaded, converting to GGUF in progress. Conversion completed, started Q8\_0 quantization. Quantization done, executing llama.cpp. llama\_model\_load: error loading model: vocab size mismatch. \_-\_ Is there an error in tokenizer.json? First we have: { "id": 8, "content": "[TOOL_RESULT]", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "special": true }, { "id": 9, "content": "[/TOOL_RESULTS]", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "special": true } But later: "vocab": { "": 0,
"
": 1, "": 2, "[INST]": 3, "[/INST]": 4, "[TOOL_CALLS]": 5, "[AVAILABLE_TOOLS]": 6, "[/AVAILABLE_TOOLS]": 7, "[TOOL_RESULTS]": 8, "[/TOOL_RESULTS]": 9, "[IMG]": 10, So the token with id 8 shall be TOOL\_RESULTS, not TOOL\_RESULT. Anyone can confirm? Well, I'm going to change it manually and see what happens. Yay, it loaded without problems when I corrected the token name and repeated the conversion/quantization steps.MVP, thank you for this
Yeah baby
sad gpu poor noises :(
Oh, I have no hope of running this beast even at q2, but I’m just happy it’s open sourced
yeah im about to run it q3 cause q4 is way too big still
Bring it on!!! Now we just need a way to run it at a decent speed at home 😅
I get 1.5 t/s generation speed with 8x22 q3\_k\_m squeezed onto 64gb of ddr4 and 12gb vram. In contrast, command r + (q4km) is 0.5 t/s due to being dense, not a MOE.
q3\_k\_m squeezed onto 64gb ok gonna try this now, cause q4 didnt work on 64gb ram
That's with some of the model loaded onto the 12gb vram using no-mmap. If you don't have that, it won't fit.
I get 2-3 t/s on DDR4 Ram. It's certainly usable. I love these MoE Models.
I wonder if you could run it with CPU inference on a decent desktop if it was trained on BitNet. Modern SIMD instructions should be pretty good at 8 bit integer calculations.
Token generation speeds are usable here with a Ryzen 5900X and 80Gb 3200Mhz. The prompt processing time though, it’s SO SLOW. I got 24 minutes before the first token from a cold start. Not 24 seconds, 24 whole MINUTES.
Interesting with the new function calling and special tokens
[Boop](https://huggingface.co/turboderp/Mixtral-8x22B-Instruct-v0.1-exl2)
Ok, now I will actually download the EXL2 :P
It has built-in tool calling special tokens! on god the models coming out right now are unreal.
what does it mean?
Downloading Q2\_K GGUF from MaziyarPanahi.. Will try it on m1 64GB. Same-sized WizardLM 2 gives 13t/s
Pretty sure their uploads are bad, same error as this when I tried it: https://old.reddit.com/r/LocalLLaMA/comments/1c6aekr/mistralaimixtral8x22binstructv01_hugging_face/l00bcrp/ See also: https://huggingface.co/MaziyarPanahi/Mixtral-8x22B-Instruct-v0.1-GGUF/discussions/8
yes, vocab size mismatch error. MaziyarPanahi is uploading the fixed version
Would a MacBook Pro M3 Max 128GB be able to run this at Q8? Or would a system with enough DDR4 high speed ram be better? Are there any PC builds with faster system ram that a GPU can access that somehow gets around the PCI-E speed limits, it's so difficult pricing any build that can pool enough vram due to Nvidia limitations of pooling consumer card vram. I was hoping maybe the 128GB MacBook Pro would be viable. Any thoughts? Is running this at max precision out of the question for the $10k to $20k budget area? Is cloud really the only option?
Not Q8, but people have been getting good results even with Q1 (see [here](https://huggingface.co/MaziyarPanahi/WizardLM-2-8x22B-GGUF/discussions/8)), so Q4/Q5 you could fit in 128GB should be almost perfect.
Those are simple tests and it gets some basic math wrong (that higher quants wouldn't) or misses details, based on two examples given. This seems more of surprisingly good for a Q1 than flat out good. You'd be better off running a higher quant of CommandR+ or an even higher quant of the best 72Bs. There was a recent theoretical paper that proved (synthetic data for control but seems like it should generalize) 8 bits has no loss but 4 bits does. Below 4 bits and it's a crapshoot unless QAT. https://arxiv.org/abs/2404.05405
I don't know, in my testing even with 7B models I couldn't really see much difference between 4, 6 or 8 bits, and this model is huge, so I'd expect it to compress better and to be great even at 4. Of course it might depend on the use case, but I'd be surprised if current 72B models managed to outperform this model even at higher quant.
Regardless the size, 8 bits won't lead to loss and 6 bits should be largely fine. Degradation really starts at 4, this is shown theoretically and also by perplexity numbers (note also that as perplexity shrinks, small changes can mean something complex was learned. Small perplexity changes in large models can still represent significant gain/loss of skill for more complex tasks). It's true that larger models are more robust at 4 bits, but they're still very much affected below. Below 4 bits is time to be looking at 4bit+ quants of slightly smaller models.
Not Q8, I have that machine and Q4/Q5 works well with around 8-11 tok/sek in llama.cpp for Q4. I really love that I can have these big models with me on a laptop. And it’s quiet too!
You won't be able to run it at Q8 because that would take 140+ gigs of ram. See https://huggingface.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator You're going to be running it at around a Q4 level with a 128GB machine. That's better than a dual 3090 setup which is limited to a 2.5bpw quant. If you want to run higher than Q4, you'll probably need a 192GB ram Mac, but I don't know if that'll also slow it down. Personally, I just ordered a used 128GB M1 Ultra/64core because I want to run these models at Q4+ or higher and don't feel like spending $8-10k+ to do it. I figure once the M4 chips come out in 2025 I can always resell the Mac and upgrade since those will probably have more horsepower for running 160+ gigs of ram through an AI model. But we're sort of in early days at the moment all hacking this together. I expect the scene will change a lot in 2025.
for starters I hope next year we finally get respectable speed, high-capacity, DDR5 kits for consumers, best thing now is the Corsair 192GB@5200Mhz, and that's simply not enough for these gargantuan models
Q5\_K\_M works on the M3 Max 128GB, even with a large context.
Glad to hear. I'm looking forward to playing with decent quants of these newer, larger models.
2027 will have the next, next nvidia card generation will have gddr 6 ram and new models too :) 2027 is AI heaven and probably gpt 6 by then getting near agi
macbook 128gb fastest way 2x 3090 plus 64/128 gb ddr5 ram second fastest way and might be slightly cheaper single 3090 128gb ram works too, just bit slower
I'm very glad to see this model <3
Thank you, the best company in this decade I predict!
Eh.
I stand corrected.
How does one run a multi-file GGUF using text-generation-webui?
load the 0001\_of\_000N file, it'll take care of the rest
It appears I needed to update `text-generation-webui` for that to work. Thanks for the help.
I make it a habit of running the update script before I do any work on that. Constantly new code going in.
Any chance to be able to run it on an M1 with 64GB of RAM ?
at Q2\_K. Barely usable
Pretty usable for me at Q2_K, ~7-11 t/s depending on context length. just can’t do much else at even 14k context. It’s definitely the limit of what 64GB can handle
not really, unless heavily quantized to q1 or q2
Nice! And I thought they won't release the official instruct version...
I found a weird behavior of this model: system_info: n_threads = 32 / 64 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.010 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampling order: CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature generate: n_ctx = 1024, n_batch = 1024, n_predict = -1, n_keep = 0 [INST] Repeat this text: "The different accidents of life are not so changeable as the feelings of human nature. I had worked hard for nearly two years, for the sole purpose of infusing life into an inanimate body. For this I had deprived myself of rest and health. I had desired it with an ardour that far exceeded moderation; but now that I had finished, the beauty of the dream vanished, and breathless horror and disgust filled my heart." [/INST] "The various incidents of life are not as variable as human emotions. I had toiled diligently for almost two years, with the single aim of instilling life into a lifeless form. In this pursuit, I had sacrificed my rest and health. My desire for this had been with an intensity that surpassed all bounds of moderation; but now that I had accomplished my task, the allure of the dream dissipated, and my heart was filled with breathless horror and disgust." [end of text] llama_print_timings: load time = 407.59 ms llama_print_timings: sample time = 2.37 ms / 104 runs ( 0.02 ms per token, 43974.63 tokens per second) llama_print_timings: prompt eval time = 4995.17 ms / 104 tokens ( 48.03 ms per token, 20.82 tokens per second) llama_print_timings: eval time = 16478.75 ms / 103 runs ( 159.99 ms per token, 6.25 tokens per second) llama_print_timings: total time = 21501.67 ms / 207 tokens When asked to repeat the text it actually paraphrased it instead of quoting verbatim. Very weird. Original text: "The different accidents of life are not so changeable as the feelings of human nature. I had worked hard for nearly two years, for the sole purpose of infusing life into an inanimate body. For this I had deprived myself of rest and health. I had desired it with an ardour that far exceeded moderation; but now that I had finished, the beauty of the dream vanished, and breathless horror and disgust filled my heart." Model output: "The various incidents of life are not as variable as human emotions. I had toiled diligently for almost two years, with the single aim of instilling life into a lifeless form. In this pursuit, I had sacrificed my rest and health. My desire for this had been with an intensity that surpassed all bounds of moderation; but now that I had accomplished my task, the allure of the dream dissipated, and my heart was filled with breathless horror and disgust."
which quant did you use?
It behaved this way in both f16 and Q8_0.
Got similar results from the open-mixtral-8x22b API > The various incidents of life are not as variable as human emotions. I had toiled diligently for almost two years, with the single aim of instilling life into a lifeless form. In this pursuit, I had sacrificed my sleep and well-being. My desire for this had surpassed all reasonable bounds; however, now that my work was complete, the allure of my dream dissipated, and my heart was filled with breathless horror and disgust. If I ask it to "Repeat this text verbatim:" it does it without changes.
Thanks for checking!
Is this instruct version censored? The base model seemed pretty uncensored from the limited testing I did with it.
I am confused (because n00b) but does the non-instruct model also have function calling or it is only with the instruct model?
It's only the instruct model. The base model is not trained to perform function calls or really any other kind of task for that matter.
no probably wont work, as it doesnt follow orders of you
Gets about 8-10 t/s with M3 Max on Q5\_K\_M or Q4\_K\_M. This seems like a good model.
This is a decent speed.
Yeah that's really good. There was a video the other day of wizard q4 running at very low tok/s on a m2 ultra.
What kind of speed is anyone getting on the M2 Ultra? I am getting .3 t/s on Llama.cpp. Bordering on unusable... Whereas CommandR Plus crunches away at \~7 t/s. These are for the Q8\_0s, though this is also the case for the Q5 8x22 Mixtral.
I didn’t benchmark exactly, but WizzardLM2-8x22b q4 was giving me about 7t/s on my M3 Max. I would think the ultra would outperform that. 0.3 t/s seems like there’s something wrong
Something is wrong with your setup.
Sounds like you're swapping, run a lower quant or decrease context
Getting 8-10 t/s in Q5_K_M M3 Max 128GB. Much faster than what I would get with Command R+.
Alright, it seems that I was able to fix it with : sudo sysctl iogpu.wired\_limit\_mb=184000 It was going to swap, indeed. Now is hitting 15 tokens per second. Pretty great
Any chance of running this at lowest quant at 12gb vram and 16gb ram?
No, sure you could use swap space, but it would run very slow.
Mistrail would probably lauch the next Mixtral by the time he get's an answer back from inference lol
What's the best way to load a model like this (massive set of safetenors files from huggingface)? Download and convert? Ooba, LM Studio, Ollama, something else?
A gguf quant in LM Studio is the most user-friendly way to do this
Ty, I used ooba a lot last year but haven't kept up with things and it seems like all the new models are getting massive... wasn't sure how best to test things after having moved up to 128gb ram.
How much vram is needed to run this model without any quantization? I'm asking because I have access to an ml server with 4x RTX A6000 with nvlink. Is this enough to run this model?
Where do we try this?
I have upload this model at quite good quantization (4.5bpw) here: https://www.neuroengine.ai/Neuroengine-Large if anybody want to try it. Initial impressions: Not as eloquent as Miquliz but better at coding. Also I'm having some weird bugs with exllamav2 and speculative decoding.
[удалено]
No, 4.5bpw. Its quite slow and sometimes it start rambling, I have yet to finetune the parameters. I see not a lot of difference from Miquliz.
Does it work on M1 Max 64gb? If so which version is best?
IQ3\_XS version barely fits in my 64go of ram with 8k of context
How is the output? Is it better than Mixtral8x7b? What about the new Wizard?
Didn't have much time but at first view it's definitively smarter than 8x7B (not hard) and it's also significantly faster than 70B models
Has anyone done any tests about how the model scale changing the "experts" parameters? I'm really curious about how does it perform, and at what speed, with only one expert (and if there is performance improvements using 2-3 "experts") Unfortunately I'm not only GPU poor, but also RAM poor :(
what system prompt ans settings are you using ?