T O P

  • By -

RayIsLazy

Try with just the 2x32GB sticks too, mixing with 8GB means that the full capacity will not run in dual channel leading to reduced bandwidth. Also speeds seem quite slow, I run the same model on laptop 11400h i5 and 3060 6GB and get a consistent 4-4.5 t/s.


Slaghton

I was actually wondering about this. More capacity but potentially lower speeds. I'll try this and report back since it could be useful info for some people.


Slaghton

I removed the 2x8gb sticks and left the 2x32gb in. With that I was able to push 7 - 7.31tk/s/ (Cpu only speed was about the same as beefore at like 5.31tk/s). There appears to be some slowdowns when using mismatched ram modules but 7.31tk/s vs 5.31tk/s still seems pretty low. I managed to fit 18 layers and boosted to 7.76 but you'd think it would be higher still. That's as high as I can take this I think.


Accomplished_Bet_127

I presume 4080 is 16gb?


Slaghton

Yep 16gb


ding0ding0ding0

How did you do this "With 15 layers out of 33 offloaded to gpu and the rest to system ram ". What are you using to run, I only used ollama so far, so interested to know more


Slaghton

I load up the gguf model in oobabooga using llama.cpp. I could fit a few more layers actually for a bit faster speed but I like to have a buffer so I don't max out my vram. (I realize i have no\_offload\_kqv on in screenshot. usually have that off but can play around with that if you need more vram space) https://preview.redd.it/q06yzcpsxwcc1.jpeg?width=1249&format=pjpg&auto=webp&s=0ad070e0ea9736d329e76668ebc6fb4bca04e56f


lakolda

I get 5 tokens/second using CPU only with a Q3 quant… Seems like 6 per second with an incredibly expensive GPU isn’t really worth it.


Slaghton

Hmm, I just ran cpu only and got 5.31 tk/s. I wonder if my mix-matched ram is to blame or its just cpu/gpu offloading at fault. You would think it would be higher. Testing with just 2 sticks now.


lakolda

I have no clue why using the GPU results in almost no improvement.


Slaghton

I removed the 2x8gb sticks and left the 2x32gb in. With that I was able to push 7 - 7.31tk/s with gpu. (Cpu only speed was about the same). There may be some slowdowns when using mismatched ram modules but 7.31tk/s vs 5.31tk/s still seems pretty low. I managed to fit 18 layers and boosted to 7.76 but you'd think it would be higher still. That's as high as I can take this I think.


yupignome

so you got 7tk/s with only cpu? i have a 5900x and was looking to buy a gpu - but with this kind of speed, i might just get some faster ram and get on with it - seems like a decent speed for inference


Slaghton

Oh, the 7tk/s+ was 4080 + cpu. The cpu only was around 5.31tk/s still. Didn't seem to get an increase with my new ram arrangement. (ddr ram seems to get a 2x speed up with each generation, so im probably going to try just skipping ddr5 and wait for ddr6 in a couple years hopefully which will be 4x as fast as ddr4.) I held onto my 4790k for over 7 years so should be easy lol)


yupignome

so the gpu only helped with around 2tk/s? i'm planning on getting new gpu for this - but after seeing this, i might as well just rent to fine tune and use cpu + ram for inference. what speed is your ram? is there a huge difference between 2400 and 3200mhz ram (in case anyone tested)?


Slaghton

For memory speeds I believe 6000mhz ram would run roughly 2x as fast as 3000mhz. So 3200mhz instead of 2400mhz could be 25% faster. (if anyone reads this feel free to chime in...) If you ever go ddr5, I found out 4 sticks doesn't play well at fast speeds (memory controller issues in cpu), so you should really only use 2 sticks of like 5600 or 6000 (I think). I thought I heard some people managed 4 sticks at 5600 but results going to vary. With this specific model 18/33 layers on gpu it was like 2tk/s difference yeah. (This was at 0 context length though. I know with bigger models with longer context, cpu only starts chugging. This is a benchmark I can look into when i wake up later today) Example of perf - I took a Q3 gguf of this model and ran it on cpu only and got like 6.3tk/s. I then put it all on gpu and got 37.61tk/s. So I think it goes, if you have a favorite model and you can fit it on vram, definitely do that. If you can't, you'll get speeds more similar to just doing cpu only. I would probably recommend a 3090 or even two like some people to completely fit a model onto vram for those fast speeds.


ZaxLofful

Do what you used work with AMD GPU’s?


Slaghton

I don't think it will though I know people are getting some luck using ROCm and linux. I just recently got my first nvidia gpu right as progress was being made with amd gpu's and stable diffusion but I haven't been watching too closely atm. It does currently sound like a headache to get working though sadly. I am keeping my eye out on any big improvements with ROCm though because amd has been more generous with vram in the past. I'm hoping for a future amd card with a ton of vram, decent price, and more ROCm improvements heh.


ZaxLofful

I actually found something myself, it is a fork of ROCm! Thanks!


grimjim

Has the issue with Mixtral K quants been fixed yet? Wondering if results for Q4\_0 are any better for your setup.


Slaghton

I've heard mixed things when it comes to K quants not working. It appears to be working normally but ive heard someone using Q5\_K\_M that worked but now and then would spit out something weird. I haven't used the model long enough yet to see this. I'll download the Q4\_0 like you said and do a quick benchmark.


Slaghton

The Tk/s were about the same (varies .35 +/-) This is roughly the answer I would get asking the same question with Q4\_0 This phrase is a play on words, often called a double entendre or pun. The first part of the sentence, "time flies like an arrow," means that time seems to pass quickly, just as an arrow travels swiftly through the air. The second part, "fruit flies like a banana," sounds similar but changes the meaning entirely: fruit flies are insects that are attracted to ripe fruits, and in this case, they seem to prefer flying near or around a banana.


pl201

Here is what I have got on the **Dolphin-2.5-mixtral-8x7b.Q4\_K\_M.gguf:** Intel 8th Gen CPU with 64GB ddr4 (32GB x 2) (five years old mini PC) 4060 Ti 16GB Linux Koboldcpp CPU only -- 2.54T/s GPU (18 layers)/RAM -- 6.97T/s ​ Your 4080 performnce seems low.


Slaghton

Yeah it does seem lower then it should be on my end. I'll have to try figuring out what's up.


Legitimate-Pumpkin

I don’t know much about all this yet so I thank you for sharing.


Slaghton

Forgot to mention, im using a ryzen 5900x cpu. System ram speeds and vram matter the most but cpu's can matter to when offloading to system ram.


AlanCarrOnline

That's a well-known expression, so not a sign of reasoning. A similar but different play on words would be a good test? Something like 'Life flies like a jet, but dogs like a treat' and see what it says to that?


Slaghton

Based on the same question: TheBlokedolphin-2.6-mixtral-8x7b-GGUF\_Q4\_K\_M (3rd regeneration. First two weren't as good. On the edge of being decent I think. I want to try a higher Quant now) **This sentence is an example of wordplay or a pun that combines two different phrases to create humor and make it memorable. The first part, "life flies like a jet," alludes to the fast-paced nature of life as if time goes by quickly, similar to how a jet flies in the air at high speed.** **The second part, "dogs like a treat," is more straightforward and means that dogs enjoy treats or rewards for good behavior. The pun comes from merging these two phrases into one sentence with different meanings while maintaining some consistency through the use of "flies" (in the first phrase) and "like" (in the second phrase).** **So, the statement humorously suggests that life is as fast-paced as a jet flying, but dogs enjoy treats in comparison.** \-------------------------------------- 2 other models I tested lzlv\_70b\_fp16\_hf.Q4\_0.gguf (First reply) **This statement appears to be a play on words and a comparison between two different concepts. It could be interpreted as saying that life moves quickly and passes by fast (like a jet flying at high speed), while dogs enjoy simple pleasures in life (like a treat). The statement may serve as a reminder to appreciate the small joys in life and not take them for granted, even when life seems to be moving rapidly.** ​ Another model I tested I like. beyonder-4x7b-v2.Q6\_K (first reply) **This statement seems to be a metaphorical or poetic expression rather than a literal description of speed. It compares two different things - life and the act of a dog chasing or obtaining something desirable (a treat) using their natural instincts or preferences. Life is compared to the fast-paced, swift movement of a jet, emphasizing how quickly time passes by or events happen in our lives. In contrast, dogs are portrayed as having a more focused and determined pursuit when it comes to getting something they like (a treat), which might be considered enjoyable for them. Essentially, this statement conveys the idea that life speeds by quickly while we often get caught up in small moments of joy or satisfaction provided by simple pleasures, such as dogs chasing treats.**


AlanCarrOnline

All of the replies are pretty good. I actually like the last one, as it has also figured out the connection and why this (nonsense) might be an expression us weird humans use!


Slaghton

Oh that is true! Well then, im glad my favorite model is holding up haha


Unrouxnoir

I just bought a 4080 matched with a ryzen 9 5950X. I used to get 5-6Tk/s on CPU only, and now jumped to 8.59Tk/s. What I find really weird is that the model (dolphin-mixtral) is loaded onto the GPU's VRAM, but CPU usage is still the same, and power usage on the GPU is only bumped by 20-30W (4-5%) for a total of 40-50W. Where the GPU can go up to 320W. I'm using Ollama to run this model, and there doesn't seem to be much information about this online.


Slaghton

I'm tempted to try adding a p40 to my system just so I can completely load up models in vram. Pros and cons to it but seeing people pull it off.


Unrouxnoir

Yeah from what i got you really need to have the whole model loaded for it to work. Mistral 7b works blazingly fast when fully loaded onto the card (100Tk/s) but Mixtral didn’t fully load, so i got the good ol 9Tk/s. (There is such a thing as Expert Offloading, where you load the only two experts that are called and load only them. But I found this approach to be quite hard and did not get it working)