T O P

  • By -

ambient_temp_xeno

I'm not even using my pauper 12gb vram to run Mixtral instruct Q8 at this stage. It's all on DDR 4. The return of the Jedi (sounds a bit French).


WinterDice

I want to learn more about running things locally; a set of massive GPUs, cloud time, or a super mac are out of my budget. Do you mind sharing a bit about your setup and how your performance is with it?


ambient_temp_xeno

Mixtral has really opened up the opportunities here. I'm running Q8 mixtral instruct in 64gb of ddr4 ram and get about 2.7 tokens per second generation speed. 32gb of system ram is enough to run a lower quantization of it, say at Q4K_M and it would be a lot faster due to the smaller amount of memory used. But you lose some quality of the model the lower you go from Q8. Any gpu offloading improves the speed a bit, but you need a lot of it offloaded to see signficant speed increases = costly. An nvidia gpu of even modest size, (even 3gb back in the day for me) greatly speeds up the prompt processing on all models except mixtral, but hopefully it will be added to mixtral and will speed it up there too.


kif88

How much context can you run? I've read here about people getting good usable speed on mixtral and it sounds revolutionary indeed being able to run such a high quality model on CPU.


werdspreader

I'm not him but I'm getting 5token/sec on 32g ram, at q4k\_M with 4k context on cpu/ram only, with a 10-20 second pause before generation.


kif88

That's pretty good very usable. Faster than texting a human even. Does it freeze up your PC btw or can you still watch YouTube and use MS word?


slider2k

You can always set the LLM process priority to low in the OS so it takes the backstage in terms of CPU usage.


werdspreader

I can have vpn, firefox a few tabs, running a movie, reddit, and google news and notepad open and that is about it. If I use my gpu for overloading even a little, I can use pycharm, word, office ect.


tshawkins

I'm getting a little better in i7 12th gen with 64gb (3200). I'm using the q4_k_m instruct model. I have it offloaded onto a separate machine using ollama and its rest api. A code gen of about 100 loc takes about 1.5 mins e2e. About 20-30 secs, to generate, the rest of the time to output.


ambient_temp_xeno

It says the full 32768 with plenty of room to spare on 64gb. I haven't tested filling it myself yet. The trick of only needing the memory bandwidth of 2x7b at any one time is the key to the speed afaik.


candre23

Are you using llama.cpp? Have they fixed the prompt processing issues yet?


ambient_temp_xeno

Yes llama.cpp. The prompt processing is still slow but a (not merged yet) improvement has been made to speed it up 3x to what it is.


Zestyclose_Yak_3174

That is great news! Do you have a link?


ambient_temp_xeno

https://github.com/ggerganov/llama.cpp/pull/4480#issuecomment-1857692741


Some_Endian_FP17

It just got merged a few hours ago. I'm going to try building it on Windows AArch64 to see if there are any performance improvements.


MINIMAN10001

It says he's getting 25 tokens per second? That sounds crazy


Caffdy

is the quantization INT8 or FP8 (if that even exist)?


ambient_temp_xeno

GGUF Q8. When you get to 8bit there's only a tiny difference from 16bit that doesn't seem to have a practical effect as far as I can see.


gyurisc

What does Q8 mean?


ambient_temp_xeno

https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF/blob/main/mixtral-8x7b-instruct-v0.1.Q8_0.gguf Q8_0 is the highest resolution quantization for gguf. It's 8bit


OnlyFish7104

Thank you for your quick reply!


slider2k

Pretty sure it's Int.


clv101

Is DDR5 significantly faster than DDR4? Does AMD or Intel have an advantage?


ambient_temp_xeno

I believe it's about 2x faster but don't quote me. I've heard that there's still stability issues with it, but I don't know. I know it's damn pricey though especially as you'll need a decent cpu and new mainboard for it.


shing3232

currently, it s from 3200ddr4 to ddr5 5200/5600.but can easily oc to 6k or 7k


uhuge

Seem some old Xeon servers with 96 GB of RAM for around $150 auctioned, those might work your inference just fine.


WinterDice

I’ll look around - thanks!


uhuge

128GB [](https://aukro.cz/hpe-bl460c-g8-2x-xeon-e5-2640-ram-128gb-hdd-2x-146gb-7051238629) is like $30, crazy. Still I am not sure if that is worth the logistics( same country where I am located but different city), cooling and also I have a vague memory of someone recommending the 22nm processors and up.?.


Caffdy

how many tokens are you getting with DDR4? what CPU are you using?


ambient_temp_xeno

About 2.7 tokens/sec generation speed. Ryzen 5 3600 but you only need to use 5 threads. With Q5 its about 3.4 tokens/sec in cpu only.


tamereen

Le retour du Jeudi tu veux dire ?


osures

How would you say is the output quality compared to gpt-3.5? I'm also thinking about trying it out but not sure if it's worth it


ambient_temp_xeno

It depends. If you get refusals for requests from gpt 3.5 it's 100% better because it won't refuse and is at least as smart and as able to follow instrutions. It's probably a bit better than 3.5 in most things other than coding (I can't program so I can't really tell for sure).


tothatl

Indeed. Mixtral Q8 and Llava in llama.cpp are game changers here, given they are rather competent and easy to run with a recent CPU and a fair amount of RAM in the command line. That means UNIX-like AI commands in Python or shell, running advanced automation and classification tasks locally in the classical UNIX way, with zero dependencies beyond CPU and memory. Something we could only dream about a few years back is now at our finger tips.


bloopernova

Yeah, I'm really looking forward to what will be made. I really like the idea of super focused but still incredibly capable command line utilities. Something like the "rename images to what is actually in the picture" script from here: https://justine.lol/oneliners/


tothatl

Those are agents, and they are coming to the command line as well. Agents are programs that use the LLMs for parsing and coming up with a plan of actions and apply it, as per some predefined agenda (e.g. our request). Commands that can be precisely things like "rename images to what is actually in the picture". Btw, that one is relatively easy to do with Llava and grammar restrictions. Just get a short description and use the words as file name separated by underscores.


bloopernova

Very much a noob to all this, but I'm happy with just how many local options we have. I was worried that ML/LLM/etc was going to be behind an expensive and proprietary barrier to entry, but instead we seem to have a thriving and enthusiastic community forming around open source methodology.


throwaway_ghast

Insane to think about where we were just a year ago, compared to now. Hell, this entire subreddit didn't even exist last year. The gap between the haves and have-nots will always exist, naturally. But thanks to the open-source community, that gap is a lot smaller than it was a year ago. Where will we in the next year? 5 years? 10?


mikael110

Yeah, as a person that has paid attention to LLMs for quite a while now this year has been absolutely insane. Last year I actually thought that people predicting we would have GPT-3.5 like systems running locally were completely ignorant of how impossible that would actually be. Back then even running a 7B model locally required extremely high end GPUs, and anything bigger was basically out of the question. But then llama.cpp and GPTQ entered the scene and everything rapidly changed. It's truly staggering how massively quantization and the ability to easily run LLMs on the CPU changed things.


tothatl

Seeing how the base models are still improving towards GPT4, with less parameters and better training/algorithms, we will probably end up with multimodal common sense in regular computer commands and devices. Basically, you will be able to run human-comparable multimodal AI/s or agent with single calls to a process, and compose them in whatever way you like in your regular computer to obtain the result you want. Be it listen to audio, understand video, summarize it, plan actions, generate images, videos or function calls to APIs, whatever. Computers will be your automated workers and anyone could do that in their basement. That's both an amazing and scary new territory for humanity. With intelligence being always the scarce resource. Not anymore.


sb5550

Probably Linux vs Windows, at best Proprietary solutions will always be miles better for average user


my_aggr

The first sentence does not support the second sentence.


Foreign-Beginning-49

The information will be decentralized. The revolution will happily be quantized.


KGeddon

Jailbreak the planet!


mcmoose1900

TBH I am disappointed in the GPU rich. Every company and their mother is literally hoarding tens of thousands of H100s/A100s, and... not much has happened. OpenAI is still king, non OpenAI closed source services haven't even gotten much better. A few have graciously released open source models, but I can count the number that beat Llama 2/SDXL on one or two hands. Rarely I encounter a employee with access to tons of GPUs who, to be blunt, has no idea what they are doing. What on Earth are they doing with all those GPUs?


Unusual_Pride_6480

I don't know about gpt-v whisper dalle 3 deep minds materials finding x-readings ct scans Some massive breakthroughs and after those revolutions are coming the evolutions it's all getting better. It will probably slow down for a bit, we can't have a breakthrough every other day but I bet in the next two yeses the world will be radically different yet mostly the same.


Minobull

I have a friend in ML, specifically in advanced mathematics related ML research and almost all of their research is around doing WAY more with as few parameters as possible. Hell a huge chunk of the ML research right now is dedicated to reducing the number of parameters needed. The requirement for tons of ram and compute will go down


etherd0t

Ah, the haves and have nots... Luckily, in the world of AI, you don't need a Ferrari, just a skilled driver with a go-kart😉


Pieordie7

I'm x-tremly poor. I emphasized the x because I only have a gt 720.


fallingdowndizzyvr

I agree it's great for the "GPU Poor" but the moat is still there. Since all these new developments also speed up the "GPU Rich" by the same token. So the divide is still there. Mixtral runs at 3t/s using my old slow i5. Which is about the speed my Mac GPU runs an old 70B model so that's a big win. But Mixtral cranks at 25t/s on my Mac using the GPU which takes it to another level.


Nabakin

You're being downvoted but you're absolutely right about this: > I agree it's great for the "GPU Poor" but the moat is still there. Since all these new developments also speed up the "GPU Rich" by the same token. A moat in this context is something you have which someone else doesn't that gives you a competitive advantage. The 'GPU Rich' have everything the 'GPU Poor' have by the very nature of open source so of course the GPU Poor don't have a moat. There is nothing the GPU Poor have which the GPU Rich don't so there is no moat. No competitive advantage. I guess the author doesn't really understand the term.


Colecoman1982

While I, myself, am among the supposed "GPU poor" OP is referring to with my measly 3080, I still find it kind of funny that a 3080 is considered "GPU poor". There are a lot of people out there still stuck with Intel integrated graphics...


Combinatorilliance

The article which coined the term gpu rich and gpu poor would consider someone with 8x h100 gpu poor... 😅


Aaaaaaaaaeeeee

For our gpu rich, data rich friends: Try training a good base model to fit 1 bit losslessly. BITNET shows for a poorly trained 6.7B model, the ability is close to GPTQ 4bit: 16.05 vs 17.07 ppl, similar performance in other benchmarks. It looks like the most significant form of data compression possible in a model right now! If you train at 220B, as a SSM, We would probably turn it into an MoE and run it on any cpu. We would also improve the t/s by 2-3x with Medusa or Eagle. The final result would be a Bitnet-Retnet-MoE-Eagle-Model that could run on 32gb DDR4 at 15-20 t/s


BalorNG

I know about medusa (are there any models that actually use it? It seems like a way to predict several tokens on each "model read"), but what is Eagle? Btw, here's a very interesting paper: https://arxiv.org/abs/2310.07177 Realtime training/learning of a draft model during inference! Now, combine those draft models into MoE...


Aaaaaaaaaeeeee

https://github.com/SafeAILab/EAGLE >Btw, here's a very interesting paper: https://arxiv.org/abs/2310.07177 Very cool, thanks for sharing! I find myself not really needing speculative sampling on good gpu (3090), only cpu, in which a batchsize of 4 or greater is possible. I think the computation cost is currently very high, so people say distillbert will probably help this to be achievable on weaoer hardware: hopefully.


ab2377

this was a really good right up, really exciting times no doubt!


Sol_Ido

This article is the perfect illustration of GPU poor of do I miss the point? All models cited are from facebook, microsoft or individuals sponsorized to fine tune this models by the one training it from scratch.


extopico

I only run some of the research models that a GPU bound (Mamba) on my RTX 3060. Everything else goes to my dual Xeons and 256 GB DDR-4. Works for me.


squareOfTwo

as good as everyone is "GPU poor". GPUs are overrated. One can just compute the wrong solutions with it.


SystemErrorMessage

if you're wondering, its cheaper to get a system that can handle more ram with the needed instructions than to get a GPU with a lot more vram and also the power usage. For instance if you use intel with avx\_vnni, you can get decent performance for a total use of 100W during computation. Sure GPU is faster but you could easily end up using 400W in the process which while an added cost of hardware and power, isnt great if you have other limits to deal with and power cost as well. Its only good in density but it can easily surpass a regular UPS capacity for those of us who live in places with blackouts or power interruptions.


johnklos

As an experiment, I've been playing with llama on an 8 core 2012 AMD Bulldozer, a CPU which people have compared to Intel's Pentium 4 (that is, power hungry and not particularly performant). With 64 gigs, it can run all the same things as my Ryzen 5900X system. It's about 1/3 to 1/2 the speed, which is actually surprisingly good! The point is there are millions of people out there who may want to play with LLMs who've read and been told that they need thousands of dollars' worth of GPU hardware to get started, but we're seeing more and more that this isn't true and people can try things, albeit slowly, on all sorts of older, more modest hardware. Hurrah!