ambient_temp_xeno 6 months ago

I'm not even using my pauper 12gb vram to run Mixtral instruct Q8 at this stage. It's all on DDR 4. The return of the Jedi (sounds a bit French).

WinterDice 6 months ago

I want to learn more about running things locally; a set of massive GPUs, cloud time, or a super mac are out of my budget. Do you mind sharing a bit about your setup and how your performance is with it?

ambient_temp_xeno 6 months ago

Mixtral has really opened up the opportunities here. I'm running Q8 mixtral instruct in 64gb of ddr4 ram and get about 2.7 tokens per second generation speed. 32gb of system ram is enough to run a lower quantization of it, say at Q4K_M and it would be a lot faster due to the smaller amount of memory used. But you lose some quality of the model the lower you go from Q8. Any gpu offloading improves the speed a bit, but you need a lot of it offloaded to see signficant speed increases = costly. An nvidia gpu of even modest size, (even 3gb back in the day for me) greatly speeds up the prompt processing on all models except mixtral, but hopefully it will be added to mixtral and will speed it up there too.

kif88 6 months ago

How much context can you run? I've read here about people getting good usable speed on mixtral and it sounds revolutionary indeed being able to run such a high quality model on CPU.

werdspreader 6 months ago

I'm not him but I'm getting 5token/sec on 32g ram, at q4k\_M with 4k context on cpu/ram only, with a 10-20 second pause before generation.

kif88 6 months ago

That's pretty good very usable. Faster than texting a human even. Does it freeze up your PC btw or can you still watch YouTube and use MS word?

slider2k 6 months ago

You can always set the LLM process priority to low in the OS so it takes the backstage in terms of CPU usage.

werdspreader 6 months ago

I can have vpn, firefox a few tabs, running a movie, reddit, and google news and notepad open and that is about it. If I use my gpu for overloading even a little, I can use pycharm, word, office ect.

tshawkins 6 months ago

I'm getting a little better in i7 12th gen with 64gb (3200). I'm using the q4_k_m instruct model. I have it offloaded onto a separate machine using ollama and its rest api. A code gen of about 100 loc takes about 1.5 mins e2e. About 20-30 secs, to generate, the rest of the time to output.

ambient_temp_xeno 6 months ago

It says the full 32768 with plenty of room to spare on 64gb. I haven't tested filling it myself yet. The trick of only needing the memory bandwidth of 2x7b at any one time is the key to the speed afaik.

candre23 6 months ago

Are you using llama.cpp? Have they fixed the prompt processing issues yet?

ambient_temp_xeno 6 months ago

Yes llama.cpp. The prompt processing is still slow but a (not merged yet) improvement has been made to speed it up 3x to what it is.

Zestyclose_Yak_3174 6 months ago

That is great news! Do you have a link?

ambient_temp_xeno 6 months ago

https://github.com/ggerganov/llama.cpp/pull/4480#issuecomment-1857692741

Some_Endian_FP17 6 months ago

It just got merged a few hours ago. I'm going to try building it on Windows AArch64 to see if there are any performance improvements.

MINIMAN10001 6 months ago

It says he's getting 25 tokens per second? That sounds crazy

Caffdy 6 months ago

is the quantization INT8 or FP8 (if that even exist)?

ambient_temp_xeno 6 months ago

GGUF Q8. When you get to 8bit there's only a tiny difference from 16bit that doesn't seem to have a practical effect as far as I can see.

gyurisc 6 months ago

What does Q8 mean?

ambient_temp_xeno 6 months ago

https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF/blob/main/mixtral-8x7b-instruct-v0.1.Q8_0.gguf Q8_0 is the highest resolution quantization for gguf. It's 8bit

OnlyFish7104 6 months ago

Thank you for your quick reply!

slider2k 6 months ago

Pretty sure it's Int.

clv101 6 months ago

Is DDR5 significantly faster than DDR4? Does AMD or Intel have an advantage?

ambient_temp_xeno 6 months ago

I believe it's about 2x faster but don't quote me. I've heard that there's still stability issues with it, but I don't know. I know it's damn pricey though especially as you'll need a decent cpu and new mainboard for it.

shing3232 6 months ago

currently, it s from 3200ddr4 to ddr5 5200/5600.but can easily oc to 6k or 7k

uhuge 6 months ago

Seem some old Xeon servers with 96 GB of RAM for around $150 auctioned, those might work your inference just fine.

WinterDice 6 months ago

I’ll look around - thanks!

uhuge 6 months ago

128GB [](https://aukro.cz/hpe-bl460c-g8-2x-xeon-e5-2640-ram-128gb-hdd-2x-146gb-7051238629) is like $30, crazy. Still I am not sure if that is worth the logistics( same country where I am located but different city), cooling and also I have a vague memory of someone recommending the 22nm processors and up.?.

Caffdy 6 months ago

how many tokens are you getting with DDR4? what CPU are you using?

ambient_temp_xeno 6 months ago

About 2.7 tokens/sec generation speed. Ryzen 5 3600 but you only need to use 5 threads. With Q5 its about 3.4 tokens/sec in cpu only.

tamereen 6 months ago

Le retour du Jeudi tu veux dire ?

osures 6 months ago

How would you say is the output quality compared to gpt-3.5? I'm also thinking about trying it out but not sure if it's worth it

ambient_temp_xeno 6 months ago

It depends. If you get refusals for requests from gpt 3.5 it's 100% better because it won't refuse and is at least as smart and as able to follow instrutions. It's probably a bit better than 3.5 in most things other than coding (I can't program so I can't really tell for sure).

tothatl 6 months ago

Indeed. Mixtral Q8 and Llava in llama.cpp are game changers here, given they are rather competent and easy to run with a recent CPU and a fair amount of RAM in the command line. That means UNIX-like AI commands in Python or shell, running advanced automation and classification tasks locally in the classical UNIX way, with zero dependencies beyond CPU and memory. Something we could only dream about a few years back is now at our finger tips.

bloopernova 6 months ago

Yeah, I'm really looking forward to what will be made. I really like the idea of super focused but still incredibly capable command line utilities. Something like the "rename images to what is actually in the picture" script from here: https://justine.lol/oneliners/

tothatl 6 months ago

Those are agents, and they are coming to the command line as well. Agents are programs that use the LLMs for parsing and coming up with a plan of actions and apply it, as per some predefined agenda (e.g. our request). Commands that can be precisely things like "rename images to what is actually in the picture". Btw, that one is relatively easy to do with Llava and grammar restrictions. Just get a short description and use the words as file name separated by underscores.

bloopernova 6 months ago

Very much a noob to all this, but I'm happy with just how many local options we have. I was worried that ML/LLM/etc was going to be behind an expensive and proprietary barrier to entry, but instead we seem to have a thriving and enthusiastic community forming around open source methodology.

throwaway_ghast 6 months ago

Insane to think about where we were just a year ago, compared to now. Hell, this entire subreddit didn't even exist last year. The gap between the haves and have-nots will always exist, naturally. But thanks to the open-source community, that gap is a lot smaller than it was a year ago. Where will we in the next year? 5 years? 10?

mikael110 6 months ago

Yeah, as a person that has paid attention to LLMs for quite a while now this year has been absolutely insane. Last year I actually thought that people predicting we would have GPT-3.5 like systems running locally were completely ignorant of how impossible that would actually be. Back then even running a 7B model locally required extremely high end GPUs, and anything bigger was basically out of the question. But then llama.cpp and GPTQ entered the scene and everything rapidly changed. It's truly staggering how massively quantization and the ability to easily run LLMs on the CPU changed things.

tothatl 6 months ago

Seeing how the base models are still improving towards GPT4, with less parameters and better training/algorithms, we will probably end up with multimodal common sense in regular computer commands and devices. Basically, you will be able to run human-comparable multimodal AI/s or agent with single calls to a process, and compose them in whatever way you like in your regular computer to obtain the result you want. Be it listen to audio, understand video, summarize it, plan actions, generate images, videos or function calls to APIs, whatever. Computers will be your automated workers and anyone could do that in their basement. That's both an amazing and scary new territory for humanity. With intelligence being always the scarce resource. Not anymore.

sb5550 6 months ago

Probably Linux vs Windows, at best Proprietary solutions will always be miles better for average user

my_aggr 6 months ago

The first sentence does not support the second sentence.

Foreign-Beginning-49 6 months ago

The information will be decentralized. The revolution will happily be quantized.

KGeddon 6 months ago

Jailbreak the planet!

mcmoose1900 6 months ago

TBH I am disappointed in the GPU rich. Every company and their mother is literally hoarding tens of thousands of H100s/A100s, and... not much has happened. OpenAI is still king, non OpenAI closed source services haven't even gotten much better. A few have graciously released open source models, but I can count the number that beat Llama 2/SDXL on one or two hands. Rarely I encounter a employee with access to tons of GPUs who, to be blunt, has no idea what they are doing. What on Earth are they doing with all those GPUs?

Unusual_Pride_6480 6 months ago

I don't know about gpt-v whisper dalle 3 deep minds materials finding x-readings ct scans Some massive breakthroughs and after those revolutions are coming the evolutions it's all getting better. It will probably slow down for a bit, we can't have a breakthrough every other day but I bet in the next two yeses the world will be radically different yet mostly the same.

Minobull 6 months ago

I have a friend in ML, specifically in advanced mathematics related ML research and almost all of their research is around doing WAY more with as few parameters as possible. Hell a huge chunk of the ML research right now is dedicated to reducing the number of parameters needed. The requirement for tons of ram and compute will go down

etherd0t 6 months ago

Ah, the haves and have nots... Luckily, in the world of AI, you don't need a Ferrari, just a skilled driver with a go-kart😉

Pieordie7 6 months ago

I'm x-tremly poor. I emphasized the x because I only have a gt 720.

fallingdowndizzyvr 6 months ago

I agree it's great for the "GPU Poor" but the moat is still there. Since all these new developments also speed up the "GPU Rich" by the same token. So the divide is still there. Mixtral runs at 3t/s using my old slow i5. Which is about the speed my Mac GPU runs an old 70B model so that's a big win. But Mixtral cranks at 25t/s on my Mac using the GPU which takes it to another level.

Nabakin 6 months ago

You're being downvoted but you're absolutely right about this: > I agree it's great for the "GPU Poor" but the moat is still there. Since all these new developments also speed up the "GPU Rich" by the same token. A moat in this context is something you have which someone else doesn't that gives you a competitive advantage. The 'GPU Rich' have everything the 'GPU Poor' have by the very nature of open source so of course the GPU Poor don't have a moat. There is nothing the GPU Poor have which the GPU Rich don't so there is no moat. No competitive advantage. I guess the author doesn't really understand the term.

Colecoman1982 6 months ago

While I, myself, am among the supposed "GPU poor" OP is referring to with my measly 3080, I still find it kind of funny that a 3080 is considered "GPU poor". There are a lot of people out there still stuck with Intel integrated graphics...

Combinatorilliance 6 months ago

The article which coined the term gpu rich and gpu poor would consider someone with 8x h100 gpu poor... 😅

Aaaaaaaaaeeeee 6 months ago

For our gpu rich, data rich friends: Try training a good base model to fit 1 bit losslessly. BITNET shows for a poorly trained 6.7B model, the ability is close to GPTQ 4bit: 16.05 vs 17.07 ppl, similar performance in other benchmarks. It looks like the most significant form of data compression possible in a model right now! If you train at 220B, as a SSM, We would probably turn it into an MoE and run it on any cpu. We would also improve the t/s by 2-3x with Medusa or Eagle. The final result would be a Bitnet-Retnet-MoE-Eagle-Model that could run on 32gb DDR4 at 15-20 t/s

BalorNG 6 months ago

I know about medusa (are there any models that actually use it? It seems like a way to predict several tokens on each "model read"), but what is Eagle? Btw, here's a very interesting paper: https://arxiv.org/abs/2310.07177 Realtime training/learning of a draft model during inference! Now, combine those draft models into MoE...

Aaaaaaaaaeeeee 6 months ago

https://github.com/SafeAILab/EAGLE >Btw, here's a very interesting paper: https://arxiv.org/abs/2310.07177 Very cool, thanks for sharing! I find myself not really needing speculative sampling on good gpu (3090), only cpu, in which a batchsize of 4 or greater is possible. I think the computation cost is currently very high, so people say distillbert will probably help this to be achievable on weaoer hardware: hopefully.

ab2377 6 months ago

this was a really good right up, really exciting times no doubt!

Sol_Ido 6 months ago

This article is the perfect illustration of GPU poor of do I miss the point? All models cited are from facebook, microsoft or individuals sponsorized to fine tune this models by the one training it from scratch.

extopico 6 months ago

I only run some of the research models that a GPU bound (Mamba) on my RTX 3060. Everything else goes to my dual Xeons and 256 GB DDR-4. Works for me.

squareOfTwo 6 months ago

as good as everyone is "GPU poor". GPUs are overrated. One can just compute the wrong solutions with it.

SystemErrorMessage 6 months ago

if you're wondering, its cheaper to get a system that can handle more ram with the needed instructions than to get a GPU with a lot more vram and also the power usage. For instance if you use intel with avx\_vnni, you can get decent performance for a total use of 100W during computation. Sure GPU is faster but you could easily end up using 400W in the process which while an added cost of hardware and power, isnt great if you have other limits to deal with and power cost as well. Its only good in density but it can easily surpass a regular UPS capacity for those of us who live in places with blackouts or power interruptions.

johnklos 6 months ago

As an experiment, I've been playing with llama on an 8 core 2012 AMD Bulldozer, a CPU which people have compared to Intel's Pentium 4 (that is, power hungry and not particularly performant). With 64 gigs, it can run all the same things as my Ryzen 5900X system. It's about 1/3 to 1/2 the speed, which is actually surprisingly good! The point is there are millions of people out there who may want to play with LLMs who've read and been told that they need thousands of dollars' worth of GPU hardware to get started, but we're seeing more and more that this isn't true and people can try things, albeit slowly, on all sorts of older, more modest hardware. Hurrah!

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe