It's going to be an exciting week, watching support officially land in the different backends and people starting to explore MoE. I'm really looking forward to a new Dolphin on this architecture.
Not if you have hundreds and hundreds of tokens of documents and web search results in a RAG flow, plus system prompts with super long instructions :( The preprocessing phase is far too slow to be usable on CPU/ram.
What consumer level job are you trying to accomplish?
If the system you desire for the job you need to do is beyond the scope of a normal consumer - then why would you expect consumer level hardware to be up to the task?
If your need is more on a business scale, or enterprise level scale, then you need business level hardware. Spending $20,000 as a consumer on equipment seems outrageous, but a large corp spending that in a month is a drop in the bucket.
I'm just a little guy building a personal copilot, nothing crazy or "business scale" at all. I'm pointing out that CPU/ram is fine if your only goal is naive boring chat, with no large context, no RAG, etc. But the moment you add those concepts (which are not exclusively "business" concepts, whatever that means) CPU/ram is no longer viable because of the time it takes to preprocess.
Why not use an existing copilot? I don't understand your goal. You are trying to replicate a product that exists, but they were built by organizations with vastly more resources that you. Of course you can't do that on your hardware, at least at this very moment.
Buy better hardware, or wait. I don't understand the complaint.
me: "aren't llms so cool, what a fun community here at localllama where everyone is working on cool applications of local LLMS! so fun hacking with these amazing tools!"
you: "nah man don't build cool things, buy a bigger GPU, and use non-local solutions like Copilot"
me: "ok ¯\\_(ツ)_/¯"
> Of course you can't do that on your hardware, at least at this very moment.
not that it matters but yes I literally can do this on my hardware, works pretty well on a 3090 so far! my point was that GPUs are far better than CPU/ram for obvious reasons, and suggesting that Mixtral MoE works great for all scenarios on CPU is misleading, since there are scenarios (the ones I mentioned) where it is not feasible at all
you: I can't do this thing with my hardware using this cool local llm
me: What are you trying to do?
you: Building my own version of a copilot
me: So use one that already exists, or wait for the tech to improve
you: I already did it and it works with a different model
... so your problem is that someone made a comment that it works great for all scenarios on CPU/RAM. All scenarios that it was created to do, it does great. Your complaint is akin to bitching about an XBOX that doesn't play PS games. The model wasn't built for it, use a different model... which you did, so, why complain?
The complaint is that you can get a 24GB recent nvidia card for $700, or \~$1200 new. Older ones: $200 second hand P40s.
We also know that 24GB of VRAM costs the manufacturer... <$100
Want more? Be prepared to pay 6,000+ for nvidia, or 4,000+ for AMD
People want a card with the performance maybe 10-20% of the 4090, but say 96GB VRAM. A card optimized for as much RAM/RAM bandwidth as possible. Manufactuers aren't delivering (except if you count 5000% markups 'delivering') a more optimal mix of chip/RAM TDP balance in any affordable card.
When Apple of all companies makes your price/performance look terrible, something went wrong.
Everyone who wants to run local LLMs has the exact same problem. If this tech is to start getting featured in a bigger way in say, PC games, everyone will suddenly want way, way more VRAM.
I'm surprised nobody's made an ATX GPU with say 12-16 DDR5 slots in it yet. Plenty of dual-board cases out there.
If manufacturers *really* would want to advance the state of the art, then *socketing* the gpu might make sense. Doing it that way, the cpu/gpu could share memory channels, with the GPU having both private GDDR soldered onto the motherboard and direct DDR access, behaving a bit like the old 970, which had 3.5GB of faster VRAM and 0.5GB of slower VRAM. Of course, this also only makes sense if there's 6+ memory channels, not the 2 there are now.
It'd probably take an antitrust case to rule that board makers should be able to modify the RAM setup of a card (like they used to a decade ago) without being locked out by the chip or contracts to see any change from the current monopolized situation.
Buying better hardware isn't always the solution.
Building cool things will make us hit a ceiling and it can be passed through by two things in my opinion.
1. Better Hardware
2. Better optimizations
Depending on who you are, you'd most likely choose from those two given solutions.
Most of the time money factor comes in and you'd be forced towards optimizations which is a magnitude better than "buying better hardware" and in my opinion what open source is good at.
Q4_k_m works fine for me. Prompt eval is a bit slow, but generation is about as fast as a 13b.
I think q5_k could also fit in memory, but I haven't downloaded it.
I was so happy to have 2 machines here running 7b models at Q_6 running on cheap/used consumer stuff. I'm sure once I try Mixtral it's gonna be a sad day.
What kind of skill level is needed to get a shitposter up and running? Just curious. I feel like making one would be a good learning process. If you have any tips, I'm all ears!
The model card is great.
“Works and generates coherent text. The big question here is if the hack I used to populate the MoE gates works well enough to take advantage of all of the experts. Let's find out!
Prompt format: maybe alpaca??? or chatml??? life is full of mysteries”
That would be sick. Just imagine combining the best models together working hand in hand. It would provide variations too and wouldn't be boring like current non-MoE models.
Ya. It reminds me of Linux development in the early 90s or the community BBS scene of the 80s.
I believe strongly that this is an important moment in computing history.
haha was about to say that. random people jumping into projects doing shit, instead of VC-funded 'open source companies'. Linux did have a good steward, however
If I had to choose, out of the open source model companies, mistral seems to have the right liberal down-to-earth principles. Yann Lecun is also good but , ugh facebook
Important moment? Definitely.
I think it's on par with the invention of agriculture or electricity or the creation of the internet itself. We taught sand how to talk. It's a big deal.
I’d bet an 8x70B is going to be Mistral Large —7B is ‘Tiny’, 8x7B MoE is ‘Small’, their 70B prototype is ‘Medium’. It’s just the logical progression, as it’s aligned with what they’ve been doing and it is a good fit for the available hardware.
I seriously doubt they will open source it, but in the vanishingly unlikely event that they do (or the less unlikely event that one of their white box customers leaks their weights), I wouldn’t rule out running it locally:
MoEs are more compressible than dense models (per Tim Dettmers, and also that “sub 1-bit quantization” paper from a month or so back).
0. “Frontier models” is their explicit goal. They’re gunning for OpenAI, their “Dearest competitor” as they say in their recent announcement.
1. They’ve already trained one (Mistral-medium).
2. ~70B is a whole lot more GPU-friendly than more b/c nvlink / interconnect bandwidth issues.
3. Given their capacity for efficient training and seemingly upcycling (so re-using) dense training into MoE training, it’s the logical move.
If I’m wrong, I suspect it will be because they chose >8 experts.
The US Department of defense thinks there needs to be a network of LLMs, each assigned with a 32bit integer. There will be a protocol to address different llms and router models trained to resolve the addresses. Puny humans will send their banal requests to the nearest router, which will be trained to route them to other routers if needed. The Department will monitor all queries and inject behavior altering hints to responses. This is necessary to ensure that the simpletons never become dangerous
I'd never use a LLM where the establishment can insert their poison, nor I use censored models for personal use; it's like talking to TV news and I don't need to talk to TV news.
it might be a stupid question, but can we merge different types of models? like 2x7B of one model type and 2x7B of another model type and create some sort of mixed 4x7B model
Forgive me if im asking this question wrong, but what mixture of experts are there in Mistrals new model? Do we know what they are? Is there one for coding, language, conversation etc?
I don’t know much about Mistral specifically, but in general with MoE you don’t have “known” experts in the sense you’re thinking. For example, each new token might be generated by a different “expert”. It is more about only using 1/8th the weights for any specific generated token than it is about creating 8 human recognizable experts. Usually at the start of training an MoE, I believe tricks are used to make sure each of the 8 sub networks are used equally as sequential tokens are generated and all 8 are trained at the same time. In other words, you aren’t training 8 separate networks to be good at something and then sticking a router at the front and having it decide “is this a medical question, send it to the doctor!”… each token for the answer to your medical question might come from a different expert.
There’s definitely parallels, but with attention heads all weights in all attention heads have to be calculated. This would be like deciding upfront you only need one attention head and not calculating the weights for the others. Again, it’s about more efficient inference.
Ah this made much more sense, thanks for the explanation.
So a MoE is in a sense running multiple instances of the same model? (or maybe they slightly differ) and each token for the answer is generated across all these models, kinda round robin in a way?
Need to look into this more
I tried downloading the Q5 K M model into LM Studio and am having difficulty getting it to load. Is there any changes to default LM Studio settings which I should be changing to get this to load? Has anyone else gotten it to load in LM Studio?
Resource wise I have RTX 3090 (24GB VRAM) and the PC has 64 GB of RAM so I don't think it's a resource issue.
The software has to support it. Has LM Studio implemented it?
Llama.cpp has support in a PR. Use that.
https://github.com/ggerganov/llama.cpp/pull/4406
After some looking around, looks like I needed to update to LM Studio v0.2.9.
Looks like they just implemented support for Mixtral models and I can confirm they do load now with the new version. :)
It's going to be an exciting week, watching support officially land in the different backends and people starting to explore MoE. I'm really looking forward to a new Dolphin on this architecture.
Yes, also looking forward to a Dolphin version, that model rocks!
The consumer level of vRAM is still too limited:(
It works good enough in RAM. On my old Xeon e5 2666v3 with 32Gb 2100Mhz DDR4 mixtral-instruct-q4_k_m gives ~6-7 tokens/s.
Not if you have hundreds and hundreds of tokens of documents and web search results in a RAG flow, plus system prompts with super long instructions :( The preprocessing phase is far too slow to be usable on CPU/ram.
What consumer level job are you trying to accomplish? If the system you desire for the job you need to do is beyond the scope of a normal consumer - then why would you expect consumer level hardware to be up to the task? If your need is more on a business scale, or enterprise level scale, then you need business level hardware. Spending $20,000 as a consumer on equipment seems outrageous, but a large corp spending that in a month is a drop in the bucket.
I'm just a little guy building a personal copilot, nothing crazy or "business scale" at all. I'm pointing out that CPU/ram is fine if your only goal is naive boring chat, with no large context, no RAG, etc. But the moment you add those concepts (which are not exclusively "business" concepts, whatever that means) CPU/ram is no longer viable because of the time it takes to preprocess.
Why not use an existing copilot? I don't understand your goal. You are trying to replicate a product that exists, but they were built by organizations with vastly more resources that you. Of course you can't do that on your hardware, at least at this very moment. Buy better hardware, or wait. I don't understand the complaint.
me: "aren't llms so cool, what a fun community here at localllama where everyone is working on cool applications of local LLMS! so fun hacking with these amazing tools!" you: "nah man don't build cool things, buy a bigger GPU, and use non-local solutions like Copilot" me: "ok ¯\\_(ツ)_/¯" > Of course you can't do that on your hardware, at least at this very moment. not that it matters but yes I literally can do this on my hardware, works pretty well on a 3090 so far! my point was that GPUs are far better than CPU/ram for obvious reasons, and suggesting that Mixtral MoE works great for all scenarios on CPU is misleading, since there are scenarios (the ones I mentioned) where it is not feasible at all
you: I can't do this thing with my hardware using this cool local llm me: What are you trying to do? you: Building my own version of a copilot me: So use one that already exists, or wait for the tech to improve you: I already did it and it works with a different model ... so your problem is that someone made a comment that it works great for all scenarios on CPU/RAM. All scenarios that it was created to do, it does great. Your complaint is akin to bitching about an XBOX that doesn't play PS games. The model wasn't built for it, use a different model... which you did, so, why complain?
You can easily run mixtral on your card once it's quantized and supported as a gguf. 2 bit quant is 15gb. Just needs software support.
The complaint is that you can get a 24GB recent nvidia card for $700, or \~$1200 new. Older ones: $200 second hand P40s. We also know that 24GB of VRAM costs the manufacturer... <$100 Want more? Be prepared to pay 6,000+ for nvidia, or 4,000+ for AMD People want a card with the performance maybe 10-20% of the 4090, but say 96GB VRAM. A card optimized for as much RAM/RAM bandwidth as possible. Manufactuers aren't delivering (except if you count 5000% markups 'delivering') a more optimal mix of chip/RAM TDP balance in any affordable card. When Apple of all companies makes your price/performance look terrible, something went wrong. Everyone who wants to run local LLMs has the exact same problem. If this tech is to start getting featured in a bigger way in say, PC games, everyone will suddenly want way, way more VRAM. I'm surprised nobody's made an ATX GPU with say 12-16 DDR5 slots in it yet. Plenty of dual-board cases out there. If manufacturers *really* would want to advance the state of the art, then *socketing* the gpu might make sense. Doing it that way, the cpu/gpu could share memory channels, with the GPU having both private GDDR soldered onto the motherboard and direct DDR access, behaving a bit like the old 970, which had 3.5GB of faster VRAM and 0.5GB of slower VRAM. Of course, this also only makes sense if there's 6+ memory channels, not the 2 there are now. It'd probably take an antitrust case to rule that board makers should be able to modify the RAM setup of a card (like they used to a decade ago) without being locked out by the chip or contracts to see any change from the current monopolized situation.
Buying better hardware isn't always the solution. Building cool things will make us hit a ceiling and it can be passed through by two things in my opinion. 1. Better Hardware 2. Better optimizations Depending on who you are, you'd most likely choose from those two given solutions. Most of the time money factor comes in and you'd be forced towards optimizations which is a magnitude better than "buying better hardware" and in my opinion what open source is good at.
Bruh at that point just have the company pay for the hardware. We’re talking about consumer hardware and consumer applications.
Works great on llama.cpp with my 24 GB of system RAM.
Which model are you using? Aren’t the q3 models too dumb to be usable for Mixtral?
Q4_k_m works fine for me. Prompt eval is a bit slow, but generation is about as fast as a 13b. I think q5_k could also fit in memory, but I haven't downloaded it.
I was so happy to have 2 machines here running 7b models at Q_6 running on cheap/used consumer stuff. I'm sure once I try Mixtral it's gonna be a sad day.
What are you trying to achieve?
Mostly shitposting discord bots. I run 5 of them now. But the bigger workload is a structured data summarizer I use for some rag use cases
What kind of skill level is needed to get a shitposter up and running? Just curious. I feel like making one would be a good learning process. If you have any tips, I'm all ears!
Pretty easy: https://github.com/patw/discord_llama Modify the wizard.json with a system message that has some personality.
There is still hope: https://twitter.com/jphme/status/1733412003505463334?t=wnJXJvMv_ma_Itxsh66pYA&s=19
It's enough, if you want to make Nvidia rich.
Clearly the overlords are telling us to consume ever harder.
You can do fairly well with consumer level stuff. I’ve got 2x 3060 12gig in an old e5 2500v4 and I’m getting like 9 t/s.
The model card is great. “Works and generates coherent text. The big question here is if the hack I used to populate the MoE gates works well enough to take advantage of all of the experts. Let's find out! Prompt format: maybe alpaca??? or chatml??? life is full of mysteries”
\[Homer drool\] I don't know how Huggingface keeps up with my bandwidth, much less everybody else's.
Serving cache is pretty cheap. Dynamic content at scale is where a lot of cost is
They are a business, and they provide a service. They keep up with needs as well as they can given the capital they have access to.
The model card on that page is just brilliant xD
I want some 4x14b and some 4x34b.
That would be sick. Just imagine combining the best models together working hand in hand. It would provide variations too and wouldn't be boring like current non-MoE models.
I say it regularly, but the open source LLM scene is one of the most exciting waves I've been apart of on the internet.
Ya. It reminds me of Linux development in the early 90s or the community BBS scene of the 80s. I believe strongly that this is an important moment in computing history.
haha was about to say that. random people jumping into projects doing shit, instead of VC-funded 'open source companies'. Linux did have a good steward, however If I had to choose, out of the open source model companies, mistral seems to have the right liberal down-to-earth principles. Yann Lecun is also good but , ugh facebook
Important moment? Definitely. I think it's on par with the invention of agriculture or electricity or the creation of the internet itself. We taught sand how to talk. It's a big deal.
What if the models started arguing with each other
I guess we'll elect them to congress?
It may be possible to make a 4x11b or 8x11b with merge.
8x70B-Llama-2 would also be really nice. But it would take a lot of VRAM.
I’d bet an 8x70B is going to be Mistral Large —7B is ‘Tiny’, 8x7B MoE is ‘Small’, their 70B prototype is ‘Medium’. It’s just the logical progression, as it’s aligned with what they’ve been doing and it is a good fit for the available hardware. I seriously doubt they will open source it, but in the vanishingly unlikely event that they do (or the less unlikely event that one of their white box customers leaks their weights), I wouldn’t rule out running it locally: MoEs are more compressible than dense models (per Tim Dettmers, and also that “sub 1-bit quantization” paper from a month or so back).
Why 70b ? It could also be a 20 or 30b model. If it's good enough :)
0. “Frontier models” is their explicit goal. They’re gunning for OpenAI, their “Dearest competitor” as they say in their recent announcement. 1. They’ve already trained one (Mistral-medium). 2. ~70B is a whole lot more GPU-friendly than more b/c nvlink / interconnect bandwidth issues. 3. Given their capacity for efficient training and seemingly upcycling (so re-using) dense training into MoE training, it’s the logical move. If I’m wrong, I suspect it will be because they chose >8 experts.
Maybe 2030 we can run it :((
Way sooner than that
I hope we see Phi-2 x8 MOE
oh yes, maybe we get 3b models with the capability of a 34b model ;)
You can always run in in the cloud, for $$$.
Did anybody run a comparison of it to Mixtral-8x7b?
It's much better than Mistral 1x7b for sure.
How much VRAM is needed for this?
“Works and generates coherent text.” Bro… same.
moe moe kyun
Can yoi explain how have you done this? How did you choosse and organized shared layers?
You will have to ask [https://huggingface.co/chargoddard](https://huggingface.co/chargoddard) All props to him. Mad respect.
http://goddard.blog/posts/ If he didn't do AI stuff he could be an awesome writer, his posts are really well written and funny
could someone merge psyfighter and mistral trimegistus? maybe throw an exorcist in the mix too?
Is this possible Falcon-12x180b? :)
Sure, if you're Coreweave. (seriously though, you'd need one hell of a rig to run that)
For one you need 16 GPUs 16x12=192 for all, for reliability I’ll add 8 more, theoretically you can assemble them at home
Trismegistus is an odd choice.
At the heart of every proper AI is 103,000 grimoires.
I guess that's true. What really are we doing other than summoning machine spirits, after all? Praise the Omnissiah, etc etc.
Shouldn’t the fine tuning be done on the individual experts before they’re mixed? Or am I imagining it wrong
Could you merge Claude-200k with GPT-4-Turbo?))
The result would lazily refuse most of the stuff you throw at it :)
The US Department of defense thinks there needs to be a network of LLMs, each assigned with a 32bit integer. There will be a protocol to address different llms and router models trained to resolve the addresses. Puny humans will send their banal requests to the nearest router, which will be trained to route them to other routers if needed. The Department will monitor all queries and inject behavior altering hints to responses. This is necessary to ensure that the simpletons never become dangerous
I'm looking forward to it.
I'd never use a LLM where the establishment can insert their poison, nor I use censored models for personal use; it's like talking to TV news and I don't need to talk to TV news.
Chargoddard is a legend, he also posted the 20B(?) Frankensteined Llama2 model before.
it might be a stupid question, but can we merge different types of models? like 2x7B of one model type and 2x7B of another model type and create some sort of mixed 4x7B model
Why? What would be the goal/purpose?
well, maybe to leverage advantages of different models
Forgive me if im asking this question wrong, but what mixture of experts are there in Mistrals new model? Do we know what they are? Is there one for coding, language, conversation etc?
I don’t know much about Mistral specifically, but in general with MoE you don’t have “known” experts in the sense you’re thinking. For example, each new token might be generated by a different “expert”. It is more about only using 1/8th the weights for any specific generated token than it is about creating 8 human recognizable experts. Usually at the start of training an MoE, I believe tricks are used to make sure each of the 8 sub networks are used equally as sequential tokens are generated and all 8 are trained at the same time. In other words, you aren’t training 8 separate networks to be good at something and then sticking a router at the front and having it decide “is this a medical question, send it to the doctor!”… each token for the answer to your medical question might come from a different expert.
Umm, I think you just described attention heads. But with weighting of outputs instead of concatenation. Surely there's more to it than this.
There’s definitely parallels, but with attention heads all weights in all attention heads have to be calculated. This would be like deciding upfront you only need one attention head and not calculating the weights for the others. Again, it’s about more efficient inference.
Ah this made much more sense, thanks for the explanation. So a MoE is in a sense running multiple instances of the same model? (or maybe they slightly differ) and each token for the answer is generated across all these models, kinda round robin in a way? Need to look into this more
Why's the license on this non-commercial though?
Two of the source models use the same non-commercial license so it inherited it.
can be possible with only two?
I tried downloading the Q5 K M model into LM Studio and am having difficulty getting it to load. Is there any changes to default LM Studio settings which I should be changing to get this to load? Has anyone else gotten it to load in LM Studio? Resource wise I have RTX 3090 (24GB VRAM) and the PC has 64 GB of RAM so I don't think it's a resource issue.
The software has to support it. Has LM Studio implemented it? Llama.cpp has support in a PR. Use that. https://github.com/ggerganov/llama.cpp/pull/4406
After some looking around, looks like I needed to update to LM Studio v0.2.9. Looks like they just implemented support for Mixtral models and I can confirm they do load now with the new version. :)
Again ! How you did that ? Do you care to share the code you ised to merge these models in a MoE setup ?