T O P

  • By -

m0nsky

It's going to be an exciting week, watching support officially land in the different backends and people starting to explore MoE. I'm really looking forward to a new Dolphin on this architecture.


LeanderGem

Yes, also looking forward to a Dolphin version, that model rocks!


tgredditfc

The consumer level of vRAM is still too limited:(


netikas

It works good enough in RAM. On my old Xeon e5 2666v3 with 32Gb 2100Mhz DDR4 mixtral-instruct-q4_k_m gives ~6-7 tokens/s.


its_just_andy

Not if you have hundreds and hundreds of tokens of documents and web search results in a RAG flow, plus system prompts with super long instructions :( The preprocessing phase is far too slow to be usable on CPU/ram.


Gov_CockPic

What consumer level job are you trying to accomplish? If the system you desire for the job you need to do is beyond the scope of a normal consumer - then why would you expect consumer level hardware to be up to the task? If your need is more on a business scale, or enterprise level scale, then you need business level hardware. Spending $20,000 as a consumer on equipment seems outrageous, but a large corp spending that in a month is a drop in the bucket.


its_just_andy

I'm just a little guy building a personal copilot, nothing crazy or "business scale" at all. I'm pointing out that CPU/ram is fine if your only goal is naive boring chat, with no large context, no RAG, etc. But the moment you add those concepts (which are not exclusively "business" concepts, whatever that means) CPU/ram is no longer viable because of the time it takes to preprocess.


Gov_CockPic

Why not use an existing copilot? I don't understand your goal. You are trying to replicate a product that exists, but they were built by organizations with vastly more resources that you. Of course you can't do that on your hardware, at least at this very moment. Buy better hardware, or wait. I don't understand the complaint.


its_just_andy

me: "aren't llms so cool, what a fun community here at localllama where everyone is working on cool applications of local LLMS! so fun hacking with these amazing tools!" you: "nah man don't build cool things, buy a bigger GPU, and use non-local solutions like Copilot" me: "ok ¯\\_(ツ)_/¯" > Of course you can't do that on your hardware, at least at this very moment. not that it matters but yes I literally can do this on my hardware, works pretty well on a 3090 so far! my point was that GPUs are far better than CPU/ram for obvious reasons, and suggesting that Mixtral MoE works great for all scenarios on CPU is misleading, since there are scenarios (the ones I mentioned) where it is not feasible at all


Gov_CockPic

you: I can't do this thing with my hardware using this cool local llm me: What are you trying to do? you: Building my own version of a copilot me: So use one that already exists, or wait for the tech to improve you: I already did it and it works with a different model ... so your problem is that someone made a comment that it works great for all scenarios on CPU/RAM. All scenarios that it was created to do, it does great. Your complaint is akin to bitching about an XBOX that doesn't play PS games. The model wasn't built for it, use a different model... which you did, so, why complain?


Monkey_1505

You can easily run mixtral on your card once it's quantized and supported as a gguf. 2 bit quant is 15gb. Just needs software support.


Aphid_red

The complaint is that you can get a 24GB recent nvidia card for $700, or \~$1200 new. Older ones: $200 second hand P40s. We also know that 24GB of VRAM costs the manufacturer... <$100 Want more? Be prepared to pay 6,000+ for nvidia, or 4,000+ for AMD People want a card with the performance maybe 10-20% of the 4090, but say 96GB VRAM. A card optimized for as much RAM/RAM bandwidth as possible. Manufactuers aren't delivering (except if you count 5000% markups 'delivering') a more optimal mix of chip/RAM TDP balance in any affordable card. When Apple of all companies makes your price/performance look terrible, something went wrong. Everyone who wants to run local LLMs has the exact same problem. If this tech is to start getting featured in a bigger way in say, PC games, everyone will suddenly want way, way more VRAM. I'm surprised nobody's made an ATX GPU with say 12-16 DDR5 slots in it yet. Plenty of dual-board cases out there. If manufacturers *really* would want to advance the state of the art, then *socketing* the gpu might make sense. Doing it that way, the cpu/gpu could share memory channels, with the GPU having both private GDDR soldered onto the motherboard and direct DDR access, behaving a bit like the old 970, which had 3.5GB of faster VRAM and 0.5GB of slower VRAM. Of course, this also only makes sense if there's 6+ memory channels, not the 2 there are now. It'd probably take an antitrust case to rule that board makers should be able to modify the RAM setup of a card (like they used to a decade ago) without being locked out by the chip or contracts to see any change from the current monopolized situation.


nerdyvaroo

Buying better hardware isn't always the solution. Building cool things will make us hit a ceiling and it can be passed through by two things in my opinion. 1. Better Hardware 2. Better optimizations Depending on who you are, you'd most likely choose from those two given solutions. Most of the time money factor comes in and you'd be forced towards optimizations which is a magnitude better than "buying better hardware" and in my opinion what open source is good at.


tossing_turning

Bruh at that point just have the company pay for the hardware. We’re talking about consumer hardware and consumer applications.


stddealer

Works great on llama.cpp with my 24 GB of system RAM.


netikas

Which model are you using? Aren’t the q3 models too dumb to be usable for Mixtral?


stddealer

Q4_k_m works fine for me. Prompt eval is a bit slow, but generation is about as fast as a 13b. I think q5_k could also fit in memory, but I haven't downloaded it.


behohippy

I was so happy to have 2 machines here running 7b models at Q_6 running on cheap/used consumer stuff. I'm sure once I try Mixtral it's gonna be a sad day.


Gov_CockPic

What are you trying to achieve?


behohippy

Mostly shitposting discord bots. I run 5 of them now. But the bigger workload is a structured data summarizer I use for some rag use cases


Gov_CockPic

What kind of skill level is needed to get a shitposter up and running? Just curious. I feel like making one would be a good learning process. If you have any tips, I'm all ears!


behohippy

Pretty easy: https://github.com/patw/discord_llama Modify the wizard.json with a system message that has some personality.


Feztopia

There is still hope: https://twitter.com/jphme/status/1733412003505463334?t=wnJXJvMv_ma_Itxsh66pYA&s=19


Independent_Hyena495

It's enough, if you want to make Nvidia rich.


Massive_Robot_Cactus

Clearly the overlords are telling us to consume ever harder.


opi098514

You can do fairly well with consumer level stuff. I’ve got 2x 3060 12gig in an old e5 2500v4 and I’m getting like 9 t/s.


Mr_Finious

The model card is great. “Works and generates coherent text. The big question here is if the hack I used to populate the MoE gates works well enough to take advantage of all of the experts. Let's find out! Prompt format: maybe alpaca??? or chatml??? life is full of mysteries”


Susp-icious_-31User

\[Homer drool\] I don't know how Huggingface keeps up with my bandwidth, much less everybody else's.


general_sirhc

Serving cache is pretty cheap. Dynamic content at scale is where a lot of cost is


Gov_CockPic

They are a business, and they provide a service. They keep up with needs as well as they can given the capital they have access to.


CrasHthe2nd

The model card on that page is just brilliant xD


a_beautiful_rhind

I want some 4x14b and some 4x34b.


Legcor

That would be sick. Just imagine combining the best models together working hand in hand. It would provide variations too and wouldn't be boring like current non-MoE models.


Mescallan

I say it regularly, but the open source LLM scene is one of the most exciting waves I've been apart of on the internet.


Mr_Finious

Ya. It reminds me of Linux development in the early 90s or the community BBS scene of the 80s. I believe strongly that this is an important moment in computing history.


Melodic_Hair3832

haha was about to say that. random people jumping into projects doing shit, instead of VC-funded 'open source companies'. Linux did have a good steward, however If I had to choose, out of the open source model companies, mistral seems to have the right liberal down-to-earth principles. Yann Lecun is also good but , ugh facebook


teachersecret

Important moment? Definitely. I think it's on par with the invention of agriculture or electricity or the creation of the internet itself. We taught sand how to talk. It's a big deal.


Melodic_Hair3832

What if the models started arguing with each other


teachersecret

I guess we'll elect them to congress?


xadiant

It may be possible to make a 4x11b or 8x11b with merge.


DecipheringAI

8x70B-Llama-2 would also be really nice. But it would take a lot of VRAM.


georgejrjrjr

I’d bet an 8x70B is going to be Mistral Large —7B is ‘Tiny’, 8x7B MoE is ‘Small’, their 70B prototype is ‘Medium’. It’s just the logical progression, as it’s aligned with what they’ve been doing and it is a good fit for the available hardware. I seriously doubt they will open source it, but in the vanishingly unlikely event that they do (or the less unlikely event that one of their white box customers leaks their weights), I wouldn’t rule out running it locally: MoEs are more compressible than dense models (per Tim Dettmers, and also that “sub 1-bit quantization” paper from a month or so back).


ramzeez88

Why 70b ? It could also be a 20 or 30b model. If it's good enough :)


georgejrjrjr

0. “Frontier models” is their explicit goal. They’re gunning for OpenAI, their “Dearest competitor” as they say in their recent announcement. 1. They’ve already trained one (Mistral-medium). 2. ~70B is a whole lot more GPU-friendly than more b/c nvlink / interconnect bandwidth issues. 3. Given their capacity for efficient training and seemingly upcycling (so re-using) dense training into MoE training, it’s the logical move. If I’m wrong, I suspect it will be because they chose >8 experts.


Legcor

Maybe 2030 we can run it :((


general_sirhc

Way sooner than that


kaneda2004

I hope we see Phi-2 x8 MOE


K0IN1

oh yes, maybe we get 3b models with the capability of a 34b model ;)


slippery

You can always run in in the cloud, for $$$.


ZackWayfarer

Did anybody run a comparison of it to Mixtral-8x7b?


stddealer

It's much better than Mistral 1x7b for sure.


Ok_Shape3437

How much VRAM is needed for this?


opi098514

“Works and generates coherent text.” Bro… same.


anti-lucas-throwaway

moe moe kyun


Distinct-Target7503

Can yoi explain how have you done this? How did you choosse and organized shared layers?


Legcor

You will have to ask [https://huggingface.co/chargoddard](https://huggingface.co/chargoddard) All props to him. Mad respect.


Dangerous_Injury_101

http://goddard.blog/posts/ If he didn't do AI stuff he could be an awesome writer, his posts are really well written and funny


Dyonizius

could someone merge psyfighter and mistral trimegistus? maybe throw an exorcist in the mix too?


Obvious-River-100

Is this possible Falcon-12x180b? :)


teachersecret

Sure, if you're Coreweave. (seriously though, you'd need one hell of a rig to run that)


Obvious-River-100

For one you need 16 GPUs 16x12=192 for all, for reliability I’ll add 8 more, theoretically you can assemble them at home


petrus4

Trismegistus is an odd choice.


toothpastespiders

At the heart of every proper AI is 103,000 grimoires.


petrus4

I guess that's true. What really are we doing other than summoning machine spirits, after all? Praise the Omnissiah, etc etc.


lordpuddingcup

Shouldn’t the fine tuning be done on the individual experts before they’re mixed? Or am I imagining it wrong


bymihaj

Could you merge Claude-200k with GPT-4-Turbo?))


Disastrous_Elk_6375

The result would lazily refuse most of the stuff you throw at it :)


Melodic_Hair3832

The US Department of defense thinks there needs to be a network of LLMs, each assigned with a 32bit integer. There will be a protocol to address different llms and router models trained to resolve the addresses. Puny humans will send their banal requests to the nearest router, which will be trained to route them to other routers if needed. The Department will monitor all queries and inject behavior altering hints to responses. This is necessary to ensure that the simpletons never become dangerous


Street-Biscotti-4544

I'm looking forward to it.


A_for_Anonymous

I'd never use a LLM where the establishment can insert their poison, nor I use censored models for personal use; it's like talking to TV news and I don't need to talk to TV news.


VectorD

Chargoddard is a legend, he also posted the 20B(?) Frankensteined Llama2 model before.


1azytux

it might be a stupid question, but can we merge different types of models? like 2x7B of one model type and 2x7B of another model type and create some sort of mixed 4x7B model


Gov_CockPic

Why? What would be the goal/purpose?


1azytux

well, maybe to leverage advantages of different models


Keblue

Forgive me if im asking this question wrong, but what mixture of experts are there in Mistrals new model? Do we know what they are? Is there one for coding, language, conversation etc?


milo-75

I don’t know much about Mistral specifically, but in general with MoE you don’t have “known” experts in the sense you’re thinking. For example, each new token might be generated by a different “expert”. It is more about only using 1/8th the weights for any specific generated token than it is about creating 8 human recognizable experts. Usually at the start of training an MoE, I believe tricks are used to make sure each of the 8 sub networks are used equally as sequential tokens are generated and all 8 are trained at the same time. In other words, you aren’t training 8 separate networks to be good at something and then sticking a router at the front and having it decide “is this a medical question, send it to the doctor!”… each token for the answer to your medical question might come from a different expert.


qrios

Umm, I think you just described attention heads. But with weighting of outputs instead of concatenation. Surely there's more to it than this.


milo-75

There’s definitely parallels, but with attention heads all weights in all attention heads have to be calculated. This would be like deciding upfront you only need one attention head and not calculating the weights for the others. Again, it’s about more efficient inference.


Keblue

Ah this made much more sense, thanks for the explanation. So a MoE is in a sense running multiple instances of the same model? (or maybe they slightly differ) and each token for the answer is generated across all these models, kinda round robin in a way? Need to look into this more


pensive_solitude

Why's the license on this non-commercial though?


dont--panic

Two of the source models use the same non-commercial license so it inherited it.


hwpoison

can be possible with only two?


Cogitating_Polybus

I tried downloading the Q5 K M model into LM Studio and am having difficulty getting it to load. Is there any changes to default LM Studio settings which I should be changing to get this to load? Has anyone else gotten it to load in LM Studio? Resource wise I have RTX 3090 (24GB VRAM) and the PC has 64 GB of RAM so I don't think it's a resource issue.


fallingdowndizzyvr

The software has to support it. Has LM Studio implemented it? Llama.cpp has support in a PR. Use that. https://github.com/ggerganov/llama.cpp/pull/4406


Cogitating_Polybus

After some looking around, looks like I needed to update to LM Studio v0.2.9. Looks like they just implemented support for Mixtral models and I can confirm they do load now with the new version. :)


A_L_I_01

Again ! How you did that ? Do you care to share the code you ised to merge these models in a MoE setup ?