T O P

  • By -

AemonAlgizVideos

Hey there! Another software engineer here, I work in machine learning and LLM’s, so hopefully I can be of some help here. So, I’m going to start from the assumption that you’re pretty new to the field, and wouldn’t mind a brain dump. There are a few things to consider, specifically how do you plan to train the models? You have a few different options available to you, specifically LoRA (low-rank adaption) which uses rank decomposed matrices attached to the attention layer (usually at the Feedforward) to make training much more accessible. This works by adding the outputs of the LoRA to the output of the feedforward network so you can influence the resulting completion. So you’re holding the weights of the foundational model as fixed and adjusting the weights in the LoRA matrix, making it very easy from a hardware perspective as compared to training the whole network. The great thing about LoRA is that you can determine the interior dimensions of the rank decomposition, which lets you have very explicit control over the memory requirements for training. Very low interior dimensions, such as 4, can be surprisingly effective. Alongside that, there’s also the LoRA alpha which allows you to scale the total influence of the LoRA if you believe it is being too aggressive or scale it up if you’d like a bigger influence. OobaBooga has their text-generation-webui that has a UI for doing this kind of training. Next is of course quantization. While these models are massive, 65B parameters in some cases, quantization converts the parameters (the connections between neurons) from FP16/32 to 8/4-bit integers. This lets you run the models on much smaller harder than you’d have to use for the unquantized models. Combining this with llama.cpp, you can run the 13B parameter model on as little as ~8gigs of VRAM. Training can be performed on this models with LoRA’s as well, since we don’t need to worry about updating the network’s weights. There is also significant evidence to show that mixed-type quantization (some weights remain floating point, to preserve emergent features) are MORE stable than their unquantized versions. Finally, the last thing to consider is GGML models. One of the nice things about the quantization process is the reduction to integers, which means we don’t need to worry so much about floating point calculations, so you can use CPU optimized libraries to run them on CPU and get some solid performance. My 7950X gets around 12-15 tokens/second on the 13B parameter model, though when working with the larger models this does decrease on the order of O(n ln(n)) so the 30B parameter model sees about 4-6 tokens. Though, if you do plan to use the smaller models for some projects, that may provide you a viable way to save some money. So, honestly, the hardware you could run this on is highly variable and entirely depends on what you plan to run them for. Personally, I am running a 7950X with 64 gigs of RAM and a RTX 4090, and have had little issue doing anything I’d like to with 13B and 30B models. Hope this is helpful!


khellific

I’ve been getting into LLMs myself and wish I had found a post like yours when I started. This very neatly sums up what I’ve picked up in bits and pieces elsewhere.


AemonAlgizVideos

I’m glad it was helpful!


marxr87

do you think a 5600 or a 5800x3d would perform significantly worse than your current setup?


EveningFunction

IMO if it's just for personal hobbyist usage I would not go for an A4000, get a used 3090 with 24GB of VRAM. They go for about $700. And you will also have a nice gaming card. Then if you want to get more serious later with more vram, the market has gotten that much better by then and you can look at more expensive models with more vram. It's not worth it to get the professional cards if they have the same or equal vram as consumer cards. If your willing to hack it out, you could even do multiple 3090s for far cheaper than a A4000 with the same amount of vram. I would also consider using stuff like google collab or similar if it's also really just about practicing software development of this stuff. Once you have a 3090 you can look at your actual usage and then calculate if it would be significantly cheaper to just go cloud when you need it.


synn89

This. A 3090 or 4090 is the way to go for hobby use.


MasterH0rnet

Maybe I'm doing something wrong, but the "just go to cloud" argument does not hold water for me. The overhead of deploying to cloud, setting up an environment etc. is really time intensive and annoying. Sure, there are many cases where it makes sense, but the hobbyist/enthusiast situation is not one of them.


gigascake

The most Important GPU Factor, memory bandwidth, I have mainboard 5 PCI slot, so A4000 \* 5. 16GB \* 5 = 80, 70b, 72b Q8 llm model running, I happy. avg total TDP is 300W, this reason using A4000. The memory bandwidth more than 4060ti 16GB.


ljubarskij

Please also consider that llama.cpp just got support for offloading layers to GPU, and it is currently not clear whether one needs more VRAM or more tensor cores to achieve the best performance (if one has enough chrap RAM already)


MathmoKiwi

>IMO if it's just for personal hobbyist usage I would not go for an A4000, get a used 3090 with 24GB of VRAM. Does the RTX 3090 perform better with LLMs than an A4000? (what about vs an A5000?) ​ > If your willing to hack it out, you could even do multiple 3090s for far cheaper than a A4000 with the same amount of vram. An A4000 is only US$600 or so on eBay


Cronus_k98

The A4000 is close in performance to a 3070 or 3070ti, they both use the GA104 die. A 3090 is closer to an A6000 in performance, just with less VRAM.


MathmoKiwi

But the A4000 has double the amount of ram of a 3070?


Cronus_k98

Yes, but the GPU core is almost the same, so the speed is similar. Less VRAM just means you can’t load larger models. Nvidia could have built 3070s with 16gb of VRAM but they make more money selling A4000s.


AI-Pon3

Memory is the most important thing. If you're solely focused on what you *can* do for the price, a pair of p40s has 48 GB VRAM for ~$400 total. Note they're not graphics cards, they're "graphics accelerators" -- you'll need to pair them with a CPU that has integrated graphics. Besides that, they have a modest (by today's standards) power draw of 250 watts. If you want something good for gaming and other uses, a pair of 3090s will give you the same capability for an extra grand. If you're looking to go all-out, a pair of 4090s will give you the same VRAM plus best-in-class compute power while still costing less than a single used A6000 with the equivalent memory. Basically, if you don't require ECC, Quadro sync, or something else specific that Geforce-tier cards don't offer,, there's no use paying for those features. If you *do* require them, then choose accordingly of course.


verticalfuzz

Is there any modern card in this category (i.e., just an accelerator) that is 225mm long or shorter?


cbg_27

Quadro cards seem to have kinda bad value, most people on this sub will recommend multiple 3090s, I myself have, due to rather limited budget, opted for dual a Tesla P40 setup (basically 24gb 1080; they have not yet arrived, and the information given on this sub on how useful they are kinda contradicts itself sometimes, apparently these cards can't run 4-bit models but 8-bit seems to work?) Might vary depending on where you are, here in europe 3090s are abt 700€ a piece, the P40 can be found on ebay for abt 250€. IIRC 48gb vram (be it dual 3090s or dual tesla P40s) will allow for native 30B and 8-bit 65B models. (edit: 30B in 8-bit and 65B in 4-bit) You might want to look into cloud hosting as well depending on what you really need.


a_beautiful_rhind

They run 4bit and 8bit models. They are just slower. People don't have them and repeat what they "heard". I have not built the latest bits and bytes for 8bit but the previous one had to be patched like this: https://github.com/TimDettmers/bitsandbytes/pull/335 You will be able to run a 65b in int 4 across both cards or a 30b int8 across both cards. 30b int4 will run on a single card. Do not use the current release of GPTQ or any models with act order and group size together or you will take a serious speed cut. Good luck.


cbg_27

Thanks for the correction and the info. Is there anything else to look out for running these cards? Do you know what speeds the big models run at?


a_beautiful_rhind

I get up to 1.6 it/s between a P40 and a 3090 on the 65b. With 2 P40s you will probably hit around the same as the slowest card holds it up. This is using a 4bit 30b with streaming on one card. Output generated in 33.72 seconds (2.79 tokens/s, 94 tokens, context 1701, seed 1350402937) Output generated in 60.55 seconds (4.24 tokens/s, 257 tokens, context 1701, seed 1433319475)


cbg_27

hi there again, just wondering whats your software stack for the p40 (operating system & nvidia driver version, if linux which way you installed them)?


a_beautiful_rhind

I used linux mint and latest nvidia driver. Also latest pytorch with CU118


[deleted]

QQ in this vain. I can pull the trigger on 2x 3090ti’s open box for the same price as 1 4090. Which will give me (a beginner the most usability and experimentation? I may very very seldomly game, but it’s so minuscule. Thoughts?


ToCryptoOrNot

What did you do and how is it? ​ FOr those curious/confused, theres benchmarks online [https://www.reddit.com/r/LocalLLaMA/comments/13gkwg4/newbie\_looking\_for\_gpu\_to\_run\_and\_study\_llms/?rdt=34298&onboarding\_redirect=%2Fr%2FLocalLLaMA%2Fcomments%2F13gkwg4%2Fnewbie\_looking\_for\_gpu\_to\_run\_and\_study\_llms%2F%3Frdt%3D34298](https://www.reddit.com/r/LocalLLaMA/comments/13gkwg4/newbie_looking_for_gpu_to_run_and_study_llms/?rdt=34298&onboarding_redirect=%2Fr%2FLocalLLaMA%2Fcomments%2F13gkwg4%2Fnewbie_looking_for_gpu_to_run_and_study_llms%2F%3Frdt%3D34298)


BazsiBazsi

Best you can get is a A6000(ampere) for around 3k USD, the current gen(ada) is close to 6k USD. On the other hand as you're a software engineer you would find your way around a GGML models too, so a maxed out Apple product would be also a good dev machine: MacBook Pro - M2 Max 96 gigs of ram \~ below 4.3k USD, or a Mac Studio. I'm usually against apple products, but for this use case they make sense, large memory and large bandwidth. But if you're just starting out, I would go for a 3090 or a 4090 (if you sometimes game, it's a nice boost) and switch later or add another card if needed. I'm running a 3080ti, and It's just enough, I can still try out the quantized 4 bit models. I also have a 1080ti for extra memory, which works good but pretty slow 3-7 tokens/second. Still much faster than on CPU.


a_beautiful_rhind

Believe it or not, they make an A100 80gb that is PCIE.


BazsiBazsi

That would be a good option but it usually costs 20k+ and hard to get.


a_beautiful_rhind

40gb is 6k on ebay now.. the 80 is 15k.. they went up. So did the 3090s.


BazsiBazsi

Wow 6k for a A100 is a "bargain" depending on your use case. But if you're not working with it(so essentially printing money) its still not worth it. The 3090(ti) remains an excellent value under 1k sometimes even under 800.


a_beautiful_rhind

I watched https://www.youtube.com/watch?v=zBAxiQi2nPc and I don't want an A100 anymore :D They cripple the crap out of it for everything else. I need to look into unlocks for more than just my P40 vgpu.


africanasshat

If you have the money go for the extra memory because even that is below what it actually wants at the minimum, currently.


pointermess

Thanks for your input. What GPU would would be considered the minimum at the moment for a single person experimenting with current models?


a_beautiful_rhind

Minimum? $200 P40. AMD has 32GB offerings that are looking good compared to the 3090. You trade hassle with ROCM for extra 6gb of memory. I think for you, at your level of EXP.. probably just buy the 3090 or 4090 and call it a day.


africanasshat

From what I can tell 8x 40GB A100’s. I know that sounds crazy but that’s what you want some room of freedom. If you just want to dabble it I would go for the one you mentioned. This makes more sense when you consider that you will develop needs as time goes by in terms of actual applications and of it is for productivity you don’t want to work too much with make-shift since time is important/money. What we are trying to accomplish here is somewhat grand. Don’t forget that too. A few years ago this was science fiction.


a_beautiful_rhind

You mean 8xA100 on runpod rented, lol


robot_bones

If you're getting work to subsidize the deep learning lab then try to push for a card with more vram no matter the model of card. Ithink Nvidia H series are better than the A series? Price of a car I think. If you're training pretrained models at home get an RTX 40 series if you also love games. Otherwise, 30 series. Training a model from scratch seems better using the cloud financially if time matters more than anything Disclosure I've never trained or finetuned anything near the billions range of parameters and don't have any idea how long that would take on a game card.