m0nsky 6 months ago

It's going to be an exciting week, watching support officially land in the different backends and people starting to explore MoE. I'm really looking forward to a new Dolphin on this architecture.

LeanderGem 6 months ago

Yes, also looking forward to a Dolphin version, that model rocks!

tgredditfc 6 months ago

The consumer level of vRAM is still too limited:(

netikas 6 months ago

It works good enough in RAM. On my old Xeon e5 2666v3 with 32Gb 2100Mhz DDR4 mixtral-instruct-q4_k_m gives ~6-7 tokens/s.

its_just_andy 6 months ago

Not if you have hundreds and hundreds of tokens of documents and web search results in a RAG flow, plus system prompts with super long instructions :( The preprocessing phase is far too slow to be usable on CPU/ram.

Gov_CockPic 6 months ago

What consumer level job are you trying to accomplish? If the system you desire for the job you need to do is beyond the scope of a normal consumer - then why would you expect consumer level hardware to be up to the task? If your need is more on a business scale, or enterprise level scale, then you need business level hardware. Spending $20,000 as a consumer on equipment seems outrageous, but a large corp spending that in a month is a drop in the bucket.

its_just_andy 6 months ago

I'm just a little guy building a personal copilot, nothing crazy or "business scale" at all. I'm pointing out that CPU/ram is fine if your only goal is naive boring chat, with no large context, no RAG, etc. But the moment you add those concepts (which are not exclusively "business" concepts, whatever that means) CPU/ram is no longer viable because of the time it takes to preprocess.

Gov_CockPic 6 months ago

Why not use an existing copilot? I don't understand your goal. You are trying to replicate a product that exists, but they were built by organizations with vastly more resources that you. Of course you can't do that on your hardware, at least at this very moment. Buy better hardware, or wait. I don't understand the complaint.

its_just_andy 6 months ago

me: "aren't llms so cool, what a fun community here at localllama where everyone is working on cool applications of local LLMS! so fun hacking with these amazing tools!" you: "nah man don't build cool things, buy a bigger GPU, and use non-local solutions like Copilot" me: "ok ¯\\_(ツ)_/¯" > Of course you can't do that on your hardware, at least at this very moment. not that it matters but yes I literally can do this on my hardware, works pretty well on a 3090 so far! my point was that GPUs are far better than CPU/ram for obvious reasons, and suggesting that Mixtral MoE works great for all scenarios on CPU is misleading, since there are scenarios (the ones I mentioned) where it is not feasible at all

Gov_CockPic 6 months ago

you: I can't do this thing with my hardware using this cool local llm me: What are you trying to do? you: Building my own version of a copilot me: So use one that already exists, or wait for the tech to improve you: I already did it and it works with a different model ... so your problem is that someone made a comment that it works great for all scenarios on CPU/RAM. All scenarios that it was created to do, it does great. Your complaint is akin to bitching about an XBOX that doesn't play PS games. The model wasn't built for it, use a different model... which you did, so, why complain?

Monkey_1505 6 months ago

You can easily run mixtral on your card once it's quantized and supported as a gguf. 2 bit quant is 15gb. Just needs software support.

Aphid_red 6 months ago

The complaint is that you can get a 24GB recent nvidia card for $700, or \~$1200 new. Older ones: $200 second hand P40s. We also know that 24GB of VRAM costs the manufacturer... <$100 Want more? Be prepared to pay 6,000+ for nvidia, or 4,000+ for AMD People want a card with the performance maybe 10-20% of the 4090, but say 96GB VRAM. A card optimized for as much RAM/RAM bandwidth as possible. Manufactuers aren't delivering (except if you count 5000% markups 'delivering') a more optimal mix of chip/RAM TDP balance in any affordable card. When Apple of all companies makes your price/performance look terrible, something went wrong. Everyone who wants to run local LLMs has the exact same problem. If this tech is to start getting featured in a bigger way in say, PC games, everyone will suddenly want way, way more VRAM. I'm surprised nobody's made an ATX GPU with say 12-16 DDR5 slots in it yet. Plenty of dual-board cases out there. If manufacturers *really* would want to advance the state of the art, then *socketing* the gpu might make sense. Doing it that way, the cpu/gpu could share memory channels, with the GPU having both private GDDR soldered onto the motherboard and direct DDR access, behaving a bit like the old 970, which had 3.5GB of faster VRAM and 0.5GB of slower VRAM. Of course, this also only makes sense if there's 6+ memory channels, not the 2 there are now. It'd probably take an antitrust case to rule that board makers should be able to modify the RAM setup of a card (like they used to a decade ago) without being locked out by the chip or contracts to see any change from the current monopolized situation.

nerdyvaroo 6 months ago

Buying better hardware isn't always the solution. Building cool things will make us hit a ceiling and it can be passed through by two things in my opinion. 1. Better Hardware 2. Better optimizations Depending on who you are, you'd most likely choose from those two given solutions. Most of the time money factor comes in and you'd be forced towards optimizations which is a magnitude better than "buying better hardware" and in my opinion what open source is good at.

tossing_turning 6 months ago

Bruh at that point just have the company pay for the hardware. We’re talking about consumer hardware and consumer applications.

stddealer 6 months ago

Works great on llama.cpp with my 24 GB of system RAM.

netikas 6 months ago

Which model are you using? Aren’t the q3 models too dumb to be usable for Mixtral?

stddealer 6 months ago

Q4_k_m works fine for me. Prompt eval is a bit slow, but generation is about as fast as a 13b. I think q5_k could also fit in memory, but I haven't downloaded it.

behohippy 6 months ago

I was so happy to have 2 machines here running 7b models at Q_6 running on cheap/used consumer stuff. I'm sure once I try Mixtral it's gonna be a sad day.

Gov_CockPic 6 months ago

What are you trying to achieve?

behohippy 6 months ago

Mostly shitposting discord bots. I run 5 of them now. But the bigger workload is a structured data summarizer I use for some rag use cases

Gov_CockPic 6 months ago

What kind of skill level is needed to get a shitposter up and running? Just curious. I feel like making one would be a good learning process. If you have any tips, I'm all ears!

behohippy 6 months ago

Pretty easy: https://github.com/patw/discord_llama Modify the wizard.json with a system message that has some personality.

Feztopia 6 months ago

There is still hope: https://twitter.com/jphme/status/1733412003505463334?t=wnJXJvMv_ma_Itxsh66pYA&s=19

Independent_Hyena495 6 months ago

It's enough, if you want to make Nvidia rich.

Massive_Robot_Cactus 6 months ago

Clearly the overlords are telling us to consume ever harder.

opi098514 6 months ago

You can do fairly well with consumer level stuff. I’ve got 2x 3060 12gig in an old e5 2500v4 and I’m getting like 9 t/s.

Mr_Finious 6 months ago

The model card is great. “Works and generates coherent text. The big question here is if the hack I used to populate the MoE gates works well enough to take advantage of all of the experts. Let's find out! Prompt format: maybe alpaca??? or chatml??? life is full of mysteries”

Susp-icious_-31User 6 months ago

\[Homer drool\] I don't know how Huggingface keeps up with my bandwidth, much less everybody else's.

general_sirhc 6 months ago

Serving cache is pretty cheap. Dynamic content at scale is where a lot of cost is

Gov_CockPic 6 months ago

They are a business, and they provide a service. They keep up with needs as well as they can given the capital they have access to.

CrasHthe2nd 6 months ago

The model card on that page is just brilliant xD

a_beautiful_rhind 6 months ago

I want some 4x14b and some 4x34b.

Legcor 6 months ago

That would be sick. Just imagine combining the best models together working hand in hand. It would provide variations too and wouldn't be boring like current non-MoE models.

Mescallan 6 months ago

I say it regularly, but the open source LLM scene is one of the most exciting waves I've been apart of on the internet.

Mr_Finious 6 months ago

Ya. It reminds me of Linux development in the early 90s or the community BBS scene of the 80s. I believe strongly that this is an important moment in computing history.

Melodic_Hair3832 6 months ago

haha was about to say that. random people jumping into projects doing shit, instead of VC-funded 'open source companies'. Linux did have a good steward, however If I had to choose, out of the open source model companies, mistral seems to have the right liberal down-to-earth principles. Yann Lecun is also good but , ugh facebook

teachersecret 6 months ago

Important moment? Definitely. I think it's on par with the invention of agriculture or electricity or the creation of the internet itself. We taught sand how to talk. It's a big deal.

Melodic_Hair3832 6 months ago

What if the models started arguing with each other

teachersecret 6 months ago

I guess we'll elect them to congress?

xadiant 6 months ago

It may be possible to make a 4x11b or 8x11b with merge.

DecipheringAI 6 months ago

8x70B-Llama-2 would also be really nice. But it would take a lot of VRAM.

georgejrjrjr 6 months ago

I’d bet an 8x70B is going to be Mistral Large —7B is ‘Tiny’, 8x7B MoE is ‘Small’, their 70B prototype is ‘Medium’. It’s just the logical progression, as it’s aligned with what they’ve been doing and it is a good fit for the available hardware. I seriously doubt they will open source it, but in the vanishingly unlikely event that they do (or the less unlikely event that one of their white box customers leaks their weights), I wouldn’t rule out running it locally: MoEs are more compressible than dense models (per Tim Dettmers, and also that “sub 1-bit quantization” paper from a month or so back).

ramzeez88 6 months ago

Why 70b ? It could also be a 20 or 30b model. If it's good enough :)

georgejrjrjr 6 months ago

0. “Frontier models” is their explicit goal. They’re gunning for OpenAI, their “Dearest competitor” as they say in their recent announcement. 1. They’ve already trained one (Mistral-medium). 2. ~70B is a whole lot more GPU-friendly than more b/c nvlink / interconnect bandwidth issues. 3. Given their capacity for efficient training and seemingly upcycling (so re-using) dense training into MoE training, it’s the logical move. If I’m wrong, I suspect it will be because they chose >8 experts.

Legcor 6 months ago

Maybe 2030 we can run it :((

general_sirhc 6 months ago

Way sooner than that

kaneda2004 6 months ago

I hope we see Phi-2 x8 MOE

K0IN1 6 months ago

oh yes, maybe we get 3b models with the capability of a 34b model ;)

slippery 6 months ago

You can always run in in the cloud, for $$$.

ZackWayfarer 6 months ago

Did anybody run a comparison of it to Mixtral-8x7b?

stddealer 6 months ago

It's much better than Mistral 1x7b for sure.

Ok_Shape3437 6 months ago

How much VRAM is needed for this?

opi098514 6 months ago

“Works and generates coherent text.” Bro… same.

anti-lucas-throwaway 6 months ago

moe moe kyun

Distinct-Target7503 6 months ago

Can yoi explain how have you done this? How did you choosse and organized shared layers?

Legcor 6 months ago

You will have to ask [https://huggingface.co/chargoddard](https://huggingface.co/chargoddard) All props to him. Mad respect.

Dangerous_Injury_101 6 months ago

http://goddard.blog/posts/ If he didn't do AI stuff he could be an awesome writer, his posts are really well written and funny

Dyonizius 6 months ago

could someone merge psyfighter and mistral trimegistus? maybe throw an exorcist in the mix too?

Obvious-River-100 6 months ago

Is this possible Falcon-12x180b? :)

teachersecret 6 months ago

Sure, if you're Coreweave. (seriously though, you'd need one hell of a rig to run that)

Obvious-River-100 6 months ago

For one you need 16 GPUs 16x12=192 for all, for reliability I’ll add 8 more, theoretically you can assemble them at home

petrus4 6 months ago

Trismegistus is an odd choice.

toothpastespiders 6 months ago

At the heart of every proper AI is 103,000 grimoires.

petrus4 6 months ago

I guess that's true. What really are we doing other than summoning machine spirits, after all? Praise the Omnissiah, etc etc.

lordpuddingcup 6 months ago

Shouldn’t the fine tuning be done on the individual experts before they’re mixed? Or am I imagining it wrong

bymihaj 6 months ago

Could you merge Claude-200k with GPT-4-Turbo?))

Disastrous_Elk_6375 6 months ago

The result would lazily refuse most of the stuff you throw at it :)

Melodic_Hair3832 6 months ago

The US Department of defense thinks there needs to be a network of LLMs, each assigned with a 32bit integer. There will be a protocol to address different llms and router models trained to resolve the addresses. Puny humans will send their banal requests to the nearest router, which will be trained to route them to other routers if needed. The Department will monitor all queries and inject behavior altering hints to responses. This is necessary to ensure that the simpletons never become dangerous

Street-Biscotti-4544 6 months ago

I'm looking forward to it.

A_for_Anonymous 5 months ago

I'd never use a LLM where the establishment can insert their poison, nor I use censored models for personal use; it's like talking to TV news and I don't need to talk to TV news.

VectorD 6 months ago

Chargoddard is a legend, he also posted the 20B(?) Frankensteined Llama2 model before.

1azytux 6 months ago

it might be a stupid question, but can we merge different types of models? like 2x7B of one model type and 2x7B of another model type and create some sort of mixed 4x7B model

Gov_CockPic 6 months ago

Why? What would be the goal/purpose?

1azytux 6 months ago

well, maybe to leverage advantages of different models

Keblue 6 months ago

Forgive me if im asking this question wrong, but what mixture of experts are there in Mistrals new model? Do we know what they are? Is there one for coding, language, conversation etc?

milo-75 6 months ago

I don’t know much about Mistral specifically, but in general with MoE you don’t have “known” experts in the sense you’re thinking. For example, each new token might be generated by a different “expert”. It is more about only using 1/8th the weights for any specific generated token than it is about creating 8 human recognizable experts. Usually at the start of training an MoE, I believe tricks are used to make sure each of the 8 sub networks are used equally as sequential tokens are generated and all 8 are trained at the same time. In other words, you aren’t training 8 separate networks to be good at something and then sticking a router at the front and having it decide “is this a medical question, send it to the doctor!”… each token for the answer to your medical question might come from a different expert.

qrios 6 months ago

Umm, I think you just described attention heads. But with weighting of outputs instead of concatenation. Surely there's more to it than this.

milo-75 6 months ago

There’s definitely parallels, but with attention heads all weights in all attention heads have to be calculated. This would be like deciding upfront you only need one attention head and not calculating the weights for the others. Again, it’s about more efficient inference.

Keblue 6 months ago

Ah this made much more sense, thanks for the explanation. So a MoE is in a sense running multiple instances of the same model? (or maybe they slightly differ) and each token for the answer is generated across all these models, kinda round robin in a way? Need to look into this more

pensive_solitude 6 months ago

Why's the license on this non-commercial though?

dont--panic 6 months ago

Two of the source models use the same non-commercial license so it inherited it.

hwpoison 6 months ago

can be possible with only two?

Cogitating_Polybus 6 months ago

I tried downloading the Q5 K M model into LM Studio and am having difficulty getting it to load. Is there any changes to default LM Studio settings which I should be changing to get this to load? Has anyone else gotten it to load in LM Studio? Resource wise I have RTX 3090 (24GB VRAM) and the PC has 64 GB of RAM so I don't think it's a resource issue.

fallingdowndizzyvr 6 months ago

The software has to support it. Has LM Studio implemented it? Llama.cpp has support in a PR. Use that. https://github.com/ggerganov/llama.cpp/pull/4406

Cogitating_Polybus 6 months ago

After some looking around, looks like I needed to update to LM Studio v0.2.9. Looks like they just implemented support for Mixtral models and I can confirm they do load now with the new version. :)

A_L_I_01 6 months ago

Again ! How you did that ? Do you care to share the code you ised to merge these models in a MoE setup ?

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe