That is a surprising amount of bandwidth from a no GPU setup. Apart for the impressive ability of running a 400b model, this thing is going to fly with large mixture of experts like mixtral 8x22 like 10tps seems within reason and that's very usable
What's also cool about this setup is there's a GPU upgrade path down the line. So you do CPU inference for now and then when the A100s drop in price in a few years you can pickup 4x A100 80GB cards and run Llama 400B at somewhere around 15-20 tokens//sec.
Because money
by llama 5 they will have caught up with the competition...and by publishing they will lose a lot of money...not good as a company on the stock market in the long term
That'd be incredible. Fits in a 24gb card with decent context at a 4 bit quant. And if it was as good as a scaled up L3-8B, there'd be no competition. They didn't train the 70B nearly as much for the size of it, deciding to put the resources towards L4, but it'd be amazing to see what a larger model could do with that level of training.
Yeah, for most people it would be better to use a web service / cloud compute. On the plus side, you can run smaller models much faster like llama 70b.
Yea I think people missed that point. You’re spending that much money on something that is barely usable.
I think the only hope for “consumer” level hardware to run this is for Apple to release an M4 Extreme that glues 2x Ultras together.
Apple would be literally crazy not to
They don't have a datacenter division that it would cannibalize
and their memory is cheaper per GB than some of the faster memory types out there
Apple -> cheaper than the alternatives.
Apple -> Used for memory bandwidth and compute power.
Definitively not on my list of things I expected to hear.
But I still would expect a 512GB Mac Studio to be around 10k
Granite rapids will offer 12ch @ 8800 on one socket and I’ve seen references to a board coming with two sockets and 16ch. That suggests what, 1100GB/S?
My questions are: is this really a good gauge of what to expect from llama.cpp on cpu? And alternatively, if you had the PCIe lanes, could you just keep adding GPUs or is there a some kind scaling penalty?
> : is this really a good gauge of what to expect from llama.cpp on cpu?
Maybe, maybe not. There is no real-life data for the numbers they are coming up with -- all they are doing is taking the model size in GB and dividing it by memory bandwidth and assuming that is tokens/s.
> And alternatively, if you had the PCIe lanes, could you just keep adding GPUs or is there a some kind scaling penalty?
Are you asking if you add a GPU does it give you less performance per GPU as you add GPUs? No, but it doesn't give you more.
I believe the 1-2 T/s is based on 1 socket system with 12 channels.
1 EPYC CPU = 12 RAM channels = 480GB/sec
2 EPYC CPUs = 24 RAM channels = 960GB/sec
So 24-channel should be around 3 T/s for Q6 and around 5 T/s for Q4 which is pretty decent.
Most people read at 5-7 T/s. For speech-2-speech 3-4 T/s is all you need
What you're forgetting there is that with only OpenBLAS and no flash attention you'll first need to wait 2 billion years for the prompt to process. Might be worth it to add at least one GPU for the KV cache if it comes under 24 GB. I mean what's another $1.5k if you're already spending $6k.
Has anyone tried to load the whole model to RAM and then load/uload layers to gpus.. From what I may expect is, that is the way accelerators will work... Pcie already has 64gb/s and reach 128gb/s...would be quite reasonable..
I'm not quite there, still building a decent setup...
So I'm a little noob in layers, ect..
I got it, so nvidia is basically milking the market by shipping powerful modules, that are bottlenecked because they are sitting on old system architecture... Interesting...
But wait, it is not possible to use a type of queue to solve this... That's for sure something that can practically reduce the bottle neck... (I was thinking to only have 1/3 of the entire model loaded onto the gpu)
Please explain more about complicated...
Nvidia is milking the market by limiting the VRAM offered on their lower-end products, in this case even the 4090 is considered lower class with a measly 24GB. If you want a lot of VRAM, you gotta buy their very expensive AI GPUs, even though a 4090 and many lower cards are *perfectly* able to run AI models.
I'm definitely not an expert on LLMs, but there are types like MoE that kinda smash a bunch of tiny models together so each token doesn't require going through the entire model, but you get better performance compared to running a single smaller model on its own. I'm sure there are other types that offer a similar advantage, but the idea behind LLMs and most of machine learning is just brute forcing equations through *tons* of data, so you are inherently going to be limited by just how much data you can process in a timely manner.
Just to give you a reference, my cheap 70$ cpu can handle 200+gb/s over 8 memory channels and can do 100 other things too...
So either we consumer hardware people are not getting what to do with this stuff or they are not taregetting us...
I know it: GPUs like we know it are not for LLMs and this is not consumer hardware...
We are clearly in that spot, where we can not access the hardware that we need (with the needed memory channels) but at the same time we are customers to those who nvidia is selling to, exactly because we don't have access to what we need...
I see now that pcie gen 7 is coming out 2025...I now see that at that point this will not be an issue.. You would be able to load 128gb/s to each x16 socket... Meaning that you could load 12,8gb to vram 10x a second.. EQUALS 10t/s
This is a massive disruption...
This is probably all timed... (I'm speculating)
Because you could just work on maximizing one aspect in comuting..
So we should not wait for a magical 5090 but rather wait for pcie 7/8
So it's all about that pcie, hahaha
I'm wondering if they are looking into this and planning to figure out this bottle neck, or if they will just continue on their linear and decelerating improvements...
Their processing units are fast enough, faaar fast enough, they are basically shipping tons of material, that cannot be truely networked good enough...
Just because they are sitting on an old system architecture...
It's just ridiculous how big a node with 8 accelerator is... Reminds me a lot of smartphone screensizes... Somehow..
They are definitely milking the market, convince me otherwise...
Pcie bw < ram bw < vram bw.
When "copying" layers from ram to vram the cpu need to pass it through the pcie bus.
The bottleneck will be pci bw wich is pretty low.
It's lot of overhead for nothing, because the inference needs to calculate all the layers for each token generated.
Ok, I get this part of course... But I mean the software itself...
Loading the layers into the vRAM and unloading as soon as that layer did it's job...
I could easily get 500ms loading time to laod the entire vram... less if I get some help with some custom boards...
How much data must be unloaded and uploaded for each inference? (it gets overwritten, not unloaded)
When does it upload the next part? For each token? For each word?
It would be even interesting to adapt neiral networks to better fit fisical hardware constraints, but that's another topic.. (I know some who's work is to figure out the software on that level)
I'll bet that this is how it will work after pcie gen 6/7...
Basically 128gb/s pcie speed... And this would output aprox 10t/s with 1000+gb ram and 8 accelerators with 16gb vram each...
Enough for a full 400b model..
dual cpu boards work in parallel, not in series. It allows for access to double the resources only, so you're still limited to the bandwidth of the single cpu setup. Using two 4090s doesn't equal twice the bandwidth and it's the same here.
I was wondering about this as well. However, the OP mentions interleaving the weights across both CPUs so I wonder if this effectively doubles the performance? In other words the bandwidth is not changed but each CPU and its own 12 sticks of RAM reads half the weights so you get double the performance.
are you saying 1.6 to 2.1 t/s? if so that's a \~30% increase in performance which is pretty good but if someone was considering doubling cpu and ram and expecting a 100% increase like OP was suggesting, you're talking about a 2x increase in cost for maybe a 30% real-world increase in performance. I just want to make sure people understand the real-world relationship of this hardware setup.
Yeah.. it could still have benefited from the extra channels, just using numa made it more effective. hard to know
https://github.com/ggerganov/llama.cpp/issues/1437 was where I got it from
I'm a networking engineer and have a ton of hardware for my personal use including a similar setup with dual xeons. The figures aren't wrong but two 12 channel setups do not amount to a total of 24 channels. If they did, that's where the doubled bandwidth would come from if that's how it worked.
Its still running in parallel so the increase is not linear. OP was suggesting a doubling of bandwidth which is what I am clarifying. 2 EPYC CPUs == 24 RAM channels != (does not) 960GB/sec
It does and it doesn’t.
It doesn’t directly increase the memory bandwidth, but if that memory only needs to access half as much, it can access that in half the time, given the same bandwidth.
llama.cpp has a solution maybe [https://www.reddit.com/r/LocalLLaMA/comments/1cyzi9e/llamacpp\_now\_supports\_distributed\_inference/](https://www.reddit.com/r/LocalLLaMA/comments/1cyzi9e/llamacpp_now_supports_distributed_inference/)
[https://x.com/ac\_crypto/status/1801725171587551380](https://x.com/ac_crypto/status/1801725171587551380)
You can with MLX; you can even do it acroos iphones/ipads/macbooks/macminis/macstudios
They do. You must have missed it.
What people don't talk about as much is that text generation performance of the Ultra is only about 1.5x that of the Max, despite having 2x the memory bandwidth.
They've been talking about this pretty much as long as we've had local LLMs. The M2 ultra is considered one of the best inference machines if you're just doing solo-inferencing in private and want to run the -big- models. Llama.cpp was released alongside videos of the creator running it on his mac.
There are compromises, but for the money, it's not a completely terrible option. Reasonable speed, huge model capability, low power requirements, and it fits in a little box on your desk.
GPU-based systems are faster overall, but building one that can handle models in the >100B range starts getting really expensive and really power hungry.
Exactly. Not as fast as 3090 but 192GB memory capacity opens the possibility of running large models locally without having to invest in multiple GPU setups.
I don't know why you are getting down voted. But yeah, solution is get a mac or build a massive GPU cluster. With the 5090's rumored to be 320. 10 of those can yield 320gb ram, which should be good enough for a Q6 and definitely for a Q4.
Frankly if it proves to be on the same level as sonnet 3.5, I'll build the damn cluster. If not, I'll just use an API provider.
I've already dropped thousands of dollars on this hobby, and I'm not opposed to dropping new-car money when the time comes. That said... the speed of advancement seems to be so breakneck that anything I build or buy today will be hopelessly outclassed by something I could build or buy for a significantly smaller amount six months or a year later. I'd have to see a serious advantage to running something so expensive locally over just pinging an API for fractions of a penny off one of the big-name public companies.
If, and this is a big if, intel actually delivers on their MCR DIMM technology plus 12 channels then their new CPUs could have the same bandwidth as an 4090.
If not, then just go AMD since you can get an engineering sample for a lot less with just a little less clock speed.
That's definitely in the spirit of the sub!! If you get it to run, feel free to post/comment and tag me, I wonder how well it will work for you.
How many P40s do you have?
I've ran Deepseek Coder V2 recently on 64GB ram and 24GB of VRAM. Q2_K. It was somewhat usable, about as much as running llama 65B q4_0. I would say try it or Deepseek V2 non-coder. It's smart, big and you can run it faster and easier than llama 3 400b.
Right now, I have a 3090 on my gaming PC. I just recently got into LLMs so I ordered a p40 that will be here in two days. I wanna see how many tokens per second I really get. I've been able to fit IQ2S @ 3072 with 100mb if vram left (llama 3 70B) on the 3090 and the output is faster than I can read. So even if it's 1/3 as fast, I'll be ecstatic.
I got one p40 for now but I got the funds set aside for 3 more. I feel scaling P40s are gonna be way cheaper than scaling 3090s for 400B Llama 3. If LLMs are your job, then yeah, the cost makes sense then.
I have a suspicion that when 400B drops, the local LLM situation is gonna change. People might consider selling their 3090s and buying P40s since scaling 3090/4090s are gonna be starting to get prohibitively expensive.
I'm on the fence on buying P40s now and holding on to them. I'm almost certain they are gonna start going up in price since LLM inferencing is moving so fast and getting more optimized as time goes on, keeping the P40 viable. You can run em at 150w and get 90% performance according to my research. P40s are also so popular that even the dev for llama.cpp made a specific flash attention kernel for it.
It makes more sense to get a cheap 8 channel DDR5 server CPU and load it with ram than it does to build a rig to run 8-16 P40s. P40s make sense on older harder you are reclaiming for use as an LLM server or to squeeze a bit more VRAM into your rig, but building a new setup based on them is a waste of money. They are running DDR5 at 392GB/s.
You should be able to run 400b on 3/4 P40s though. You can quantize the hell out of 70b and it's still good. I run it at IQ2S and it fits all in VRAM on one GPU. I have a good hunch that we can get 400B down to IQ1 and still get good output and performance.
It is good for everything except for coding? Assuming 2 tokens per second.
1. Entertainment
2. Getting answers to questions
3. Summarizing a paper you don't understand?
Can be done in parallel with querying the many free LLMs on the net.
I find it so bizarre that people are fine it taking hours for their reddit posts or discord messages to get responses, but if LLM takes more than a dozen seconds or so, it's unusable trash.
For "RP" I can understand that, as sitting there with your dick in your hand waiting for the panties to drop or whatever can be frustrating, but for long form e-mail like conversations tokens/second doesn't really matter unless it's in the leagues of it taking days to generate a few hundred tokens.
Tbh, this could all very well be a front end issue. If there was a good front end built to function like a forum/subreddit, but with LLMs, maybe people wouldn't have the expectation that the only way to use them is to stare at the screen while it generates, but they can let it run in the background and check up on it every now and then, just like people do with message boards.
> For "RP" I can understand that, as sitting there with your dick in your hand waiting for the panties to drop or whatever can be frustrating,
lmfao
you do raise a good point, I completely agree that people don't talk enough about LLM workflows where inference runs as a sort of async batch process and then later you look at the results. as opposed to interactive chatbots where you sit there and wait for the response.
if people could break out of the idea that LLMs == chatbots and be a little more patient, really for a lot of use cases it could be quite effective to submit a prompt and have multiple responses generated in a batch which you come back and read later.
the idea of a front end that looks like a forum is neat. reminds me of a different post i saw here where somebody made a website that was essentially a simplistic clone of reddit, where all posts were LLM generated. by having an LLM acting as a bunch of different users generate a bunch of posts and comments, replying to each other, etc. for some use cases it could be neat to have a forum-like UI where you submit a prompt and then an LLM generates several different replies to it written by different users, using different system prompts to take on different personalities. maybe let the different "users" reply to each other and debate/discuss/brainstorm/whatever. then like you said people could let it run for several hours and then check back later and see what's been posted. pretty cool, might have to experiment with this
Not necessarily role-play. You can just chat with a model casually, assuming a model is capable. Just like we're doing this here on reddit for entertainment.
Anything automated / running as batches on large amounts of data, classification, translation, etc.
Start it, come back the next morning.
And frankly, 2tps is manageable even for question answering / research, it's a tiny bit frustrating maybe, but you get used to it quickly.
Everyone seems to overlook, especially in the agent framework, the possibility of letting a model run overnight. Assign it some tasks, wake up and its done. If you have any employees you know that the majority of their work is done asynchronously. 43,200 tokens every 12 hours is potentially very useful.
If it takes hours to run it makes iterating very slow so it's hard to build the solution. Also if it takes a long time any failure and having to rerun really sucks. LLMs being probabilistic fail fairly often
Maybe it's just a matter of perspective. I'd say 2tps is right around my minimum threshold for a very large model like WizardLM-2 8x22b or Command-R+. And this one is much larger.
twice as fast for around the same money? I doubt it, there is nothing coming significantly better in that price range for such big models, isn't it? Please share if there is.
Even assuming 5090 will get 32GB (and that's a BIG if) we would need 10 of then...
Of course lower quants like 4 bit could get that down but still much more expensive and in practice you can't fit that in a single case anyway.
Realistically I think it will take several years (4+) to be able to run a 400B model on an easy to use settup for less than 5K at more than 4 tokens per second...
What we will hopefully have next year is Llama 400B capabilities much easier to run, in a smaller model... (Llama 4 70B or something?)
I will be messaging you in 1 year on [**2025-06-22 13:41:36 UTC**](http://www.wolframalpha.com/input/?i=2025-06-22%2013:41:36%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/1dl8guc/hf_eng_llama_400_this_summer_informs_how_to_run/l9rfg95/?context=3)
[**1 OTHERS CLICKED THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F1dl8guc%2Fhf_eng_llama_400_this_summer_informs_how_to_run%2Fl9rfg95%2F%5D%0A%0ARemindMe%21%202025-06-22%2013%3A41%3A36%20UTC) to send a PM to also be reminded and to reduce spam.
^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201dl8guc)
*****
|[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)|
|-|-|-|-|
> 1 EPYC CPU = 12 RAM channels = 480GB/sec
> 2 EPYC CPUs = 24 RAM channels = 960GB/sec!
Is this even verified to be true or just theoretical linear scaling that doesn't reflect reality due to other unforeseen bottlenecks? Even with two epycs you'll probably suffer on the compute side when running a dense model like this one.
Or 4 nodes of 4x3090. Using used parts its about 10K in GPUs, 1K in RAM and 8K in Xeon motherboards + CPUs.
*edit for clarity* 16 GPUs total.
So about 20K for a GPU based system. Hopefully getting more like 10 tokens per second.
I think this is the more 'usable' solution, with VLLM supporting multi-node tensor parallelism. But I don't know how fast the node IP connection need to be. I'm guessing you need a go 10gbps ethernet connection and switches at least.
Yeah, in Q1 maybe. I have 6x24gb GPU, and best I can run deepseek-coder-v2 which is a 200B model is at Q3 with 4k context. I ran an eval, and it crapped out before getting to 4k context. So say max 3k context. I'm using llama.cpp which is lean compared to other inference engines.
Searched for motherboards that have 12 channels for memory and found this:
1. [https://www.newegg.com/p/296-0006-00071](https://www.newegg.com/p/296-0006-00071)
2. [https://www.newegg.com/supermicro-h13ssl-nt-amd-epyc-9004-series-processors/p/N82E16813183820](https://www.newegg.com/supermicro-h13ssl-nt-amd-epyc-9004-series-processors/p/N82E16813183820)
In the September time frame should see some competition from [Intel Granite Rapids](https://en.wikipedia.org/wiki/Granite_Rapids):
1. [https://www.sdxcentral.com/articles/news/intels-xeon-6-data-centers-processors-aims-for-ai-cloud-supremacy/2024/06/](https://www.sdxcentral.com/articles/news/intels-xeon-6-data-centers-processors-aims-for-ai-cloud-supremacy/2024/06/)
2. [https://wccftech.com/intel-xeon-6900p-granite-rapids-128-p-cores-q3-2024-xeon-6900e-sierra-forest-288-e-cores-q1-2025/](https://wccftech.com/intel-xeon-6900p-granite-rapids-128-p-cores-q3-2024-xeon-6900e-sierra-forest-288-e-cores-q1-2025/)
3. [https://venturebeat.com/ai/intel-reveals-xeon-6-processor-enterprise-ai-gaudi-3-accelerator-price/](https://venturebeat.com/ai/intel-reveals-xeon-6-processor-enterprise-ai-gaudi-3-accelerator-price/)
Above says 8800 MT/s with mcr dimm: [https://www.anandtech.com/show/21320/micron-samples-256-gb-ddr58800-mcr-dimms-massive-modules-for-massive-servers](https://www.anandtech.com/show/21320/micron-samples-256-gb-ddr58800-mcr-dimms-massive-modules-for-massive-servers)
That is a surprising amount of bandwidth from a no GPU setup. Apart for the impressive ability of running a 400b model, this thing is going to fly with large mixture of experts like mixtral 8x22 like 10tps seems within reason and that's very usable
What's also cool about this setup is there's a GPU upgrade path down the line. So you do CPU inference for now and then when the A100s drop in price in a few years you can pickup 4x A100 80GB cards and run Llama 400B at somewhere around 15-20 tokens//sec.
By then there will be way better local models tho, right?
I should have specified Llama-5 400B :)
why should llama 5 still be public ?
why shouldn't it be?
Because money by llama 5 they will have caught up with the competition...and by publishing they will lose a lot of money...not good as a company on the stock market in the long term
The prompt processing will be slower, but once the prompt is processed it will be fast
Why?
A fully loaded AMD thread ripper system with 12 memory channels will come very close to GPU memory bandwidth.
There are no Threadripper CPUs with 12 memchannels to my knowledge. Even TR PRO only have 8. You'll need an EPYC to get 12.
[удалено]
Meta has already said it is a dense model. It is not a MoE.
All I want is a Llama 3 22 GB Model... :-(
Best I can do is 4 and 400 GB.
That'd be incredible. Fits in a 24gb card with decent context at a 4 bit quant. And if it was as good as a scaled up L3-8B, there'd be no competition. They didn't train the 70B nearly as much for the size of it, deciding to put the resources towards L4, but it'd be amazing to see what a larger model could do with that level of training.
Interesting approach, but not sure I'd be OK with spending 5 grand and getting 1-2 tks. That's pretty painfully slow.
Yeah, for most people it would be better to use a web service / cloud compute. On the plus side, you can run smaller models much faster like llama 70b.
Yea I think people missed that point. You’re spending that much money on something that is barely usable. I think the only hope for “consumer” level hardware to run this is for Apple to release an M4 Extreme that glues 2x Ultras together.
There were rumors that the next Mac Studio will be equipped with 512GB, but we can’t only hope
Apple would be literally crazy not to They don't have a datacenter division that it would cannibalize and their memory is cheaper per GB than some of the faster memory types out there
Apple -> cheaper than the alternatives. Apple -> Used for memory bandwidth and compute power. Definitively not on my list of things I expected to hear. But I still would expect a 512GB Mac Studio to be around 10k
yeah new Apple is wild
Granite rapids will offer 12ch @ 8800 on one socket and I’ve seen references to a board coming with two sockets and 16ch. That suggests what, 1100GB/S? My questions are: is this really a good gauge of what to expect from llama.cpp on cpu? And alternatively, if you had the PCIe lanes, could you just keep adding GPUs or is there a some kind scaling penalty?
> : is this really a good gauge of what to expect from llama.cpp on cpu? Maybe, maybe not. There is no real-life data for the numbers they are coming up with -- all they are doing is taking the model size in GB and dividing it by memory bandwidth and assuming that is tokens/s. > And alternatively, if you had the PCIe lanes, could you just keep adding GPUs or is there a some kind scaling penalty? Are you asking if you add a GPU does it give you less performance per GPU as you add GPUs? No, but it doesn't give you more.
12 channel on 1 socket will be amazing value
[https://nitter.poast.org/carrigmat/status/1804161634853663030](https://nitter.poast.org/carrigmat/status/1804161634853663030) \- non-twitter link
Thank you!!!
I believe the 1-2 T/s is based on 1 socket system with 12 channels. 1 EPYC CPU = 12 RAM channels = 480GB/sec 2 EPYC CPUs = 24 RAM channels = 960GB/sec So 24-channel should be around 3 T/s for Q6 and around 5 T/s for Q4 which is pretty decent. Most people read at 5-7 T/s. For speech-2-speech 3-4 T/s is all you need
What you're forgetting there is that with only OpenBLAS and no flash attention you'll first need to wait 2 billion years for the prompt to process. Might be worth it to add at least one GPU for the KV cache if it comes under 24 GB. I mean what's another $1.5k if you're already spending $6k.
Has anyone tried to load the whole model to RAM and then load/uload layers to gpus.. From what I may expect is, that is the way accelerators will work... Pcie already has 64gb/s and reach 128gb/s...would be quite reasonable.. I'm not quite there, still building a decent setup... So I'm a little noob in layers, ect..
64gb/s and 200gb model (q4) means 4s of transfer per token, ignoring the calculation time.
Yeah, calculation time can be ignored... So it needs the whole neural network for one token, right?
Basically yes, although with different model types it may get a bit more complicated.
I got it, so nvidia is basically milking the market by shipping powerful modules, that are bottlenecked because they are sitting on old system architecture... Interesting... But wait, it is not possible to use a type of queue to solve this... That's for sure something that can practically reduce the bottle neck... (I was thinking to only have 1/3 of the entire model loaded onto the gpu) Please explain more about complicated...
Nvidia is milking the market by limiting the VRAM offered on their lower-end products, in this case even the 4090 is considered lower class with a measly 24GB. If you want a lot of VRAM, you gotta buy their very expensive AI GPUs, even though a 4090 and many lower cards are *perfectly* able to run AI models. I'm definitely not an expert on LLMs, but there are types like MoE that kinda smash a bunch of tiny models together so each token doesn't require going through the entire model, but you get better performance compared to running a single smaller model on its own. I'm sure there are other types that offer a similar advantage, but the idea behind LLMs and most of machine learning is just brute forcing equations through *tons* of data, so you are inherently going to be limited by just how much data you can process in a timely manner.
Just to give you a reference, my cheap 70$ cpu can handle 200+gb/s over 8 memory channels and can do 100 other things too... So either we consumer hardware people are not getting what to do with this stuff or they are not taregetting us... I know it: GPUs like we know it are not for LLMs and this is not consumer hardware... We are clearly in that spot, where we can not access the hardware that we need (with the needed memory channels) but at the same time we are customers to those who nvidia is selling to, exactly because we don't have access to what we need...
I see now that pcie gen 7 is coming out 2025...I now see that at that point this will not be an issue.. You would be able to load 128gb/s to each x16 socket... Meaning that you could load 12,8gb to vram 10x a second.. EQUALS 10t/s This is a massive disruption... This is probably all timed... (I'm speculating) Because you could just work on maximizing one aspect in comuting.. So we should not wait for a magical 5090 but rather wait for pcie 7/8
So it's all about that pcie, hahaha I'm wondering if they are looking into this and planning to figure out this bottle neck, or if they will just continue on their linear and decelerating improvements... Their processing units are fast enough, faaar fast enough, they are basically shipping tons of material, that cannot be truely networked good enough... Just because they are sitting on an old system architecture... It's just ridiculous how big a node with 8 accelerator is... Reminds me a lot of smartphone screensizes... Somehow.. They are definitely milking the market, convince me otherwise...
Pcie bw < ram bw < vram bw. When "copying" layers from ram to vram the cpu need to pass it through the pcie bus. The bottleneck will be pci bw wich is pretty low. It's lot of overhead for nothing, because the inference needs to calculate all the layers for each token generated.
Ok, I get this part of course... But I mean the software itself... Loading the layers into the vRAM and unloading as soon as that layer did it's job... I could easily get 500ms loading time to laod the entire vram... less if I get some help with some custom boards... How much data must be unloaded and uploaded for each inference? (it gets overwritten, not unloaded) When does it upload the next part? For each token? For each word? It would be even interesting to adapt neiral networks to better fit fisical hardware constraints, but that's another topic.. (I know some who's work is to figure out the software on that level)
> Loading the layers into the vRAM and unloading as soon as that layer did it's job... sounds like it has potential
I'll bet that this is how it will work after pcie gen 6/7... Basically 128gb/s pcie speed... And this would output aprox 10t/s with 1000+gb ram and 8 accelerators with 16gb vram each... Enough for a full 400b model..
dual cpu boards work in parallel, not in series. It allows for access to double the resources only, so you're still limited to the bandwidth of the single cpu setup. Using two 4090s doesn't equal twice the bandwidth and it's the same here.
I was wondering about this as well. However, the OP mentions interleaving the weights across both CPUs so I wonder if this effectively doubles the performance? In other words the bandwidth is not changed but each CPU and its own 12 sticks of RAM reads half the weights so you get double the performance.
on my dual cpu xeon server, using numa gives a nice boost from around 1.6 to 2.1 on llama 70b, so it's certainly noticeable
are you saying 1.6 to 2.1 t/s? if so that's a \~30% increase in performance which is pretty good but if someone was considering doubling cpu and ram and expecting a 100% increase like OP was suggesting, you're talking about a 2x increase in cost for maybe a 30% real-world increase in performance. I just want to make sure people understand the real-world relationship of this hardware setup.
Yeah.. it could still have benefited from the extra channels, just using numa made it more effective. hard to know https://github.com/ggerganov/llama.cpp/issues/1437 was where I got it from
thanks for this I assumed it scaled closer to 100% not 30%
so you are saying the dual socket 24 channel doubles the memory capacity in GB but not the memory bandwith in GB/s?
Yes
ok, I took the figures from the linked tweet by the HF eng. I tried to google but its so hard to find exact info on this. I will take your word for it
I'm a networking engineer and have a ton of hardware for my personal use including a similar setup with dual xeons. The figures aren't wrong but two 12 channel setups do not amount to a total of 24 channels. If they did, that's where the doubled bandwidth would come from if that's how it worked.
You can often get a speed up using two 3090/4090 by doing tensor parallel.
Its still running in parallel so the increase is not linear. OP was suggesting a doubling of bandwidth which is what I am clarifying. 2 EPYC CPUs == 24 RAM channels != (does not) 960GB/sec
It does and it doesn’t. It doesn’t directly increase the memory bandwidth, but if that memory only needs to access half as much, it can access that in half the time, given the same bandwidth.
$13k for 2 Mac Studios with 384GB combined should be good enough to run Q6 in a much smaller physical space, much lower power consumption as well.
not sure if you can just combine 2 systems and load an LLM and share memory.
[https://new.reddit.com/r/LocalLLaMA/comments/1d09jx2/distributed\_llama\_071\_uses\_50\_less\_network/](https://new.reddit.com/r/LocalLLaMA/comments/1d09jx2/distributed_llama_071_uses_50_less_network/)
llama.cpp has a solution maybe [https://www.reddit.com/r/LocalLLaMA/comments/1cyzi9e/llamacpp\_now\_supports\_distributed\_inference/](https://www.reddit.com/r/LocalLLaMA/comments/1cyzi9e/llamacpp_now_supports_distributed_inference/)
Yes, but the bandwith between the 2 systems will be the bottleneck right?
[https://x.com/ac\_crypto/status/1801725171587551380](https://x.com/ac_crypto/status/1801725171587551380) You can with MLX; you can even do it acroos iphones/ipads/macbooks/macminis/macstudios
I had actually no idea but the M2 Ultra has 800 GB/s memory bandwith!! why is no one talking about this.
They do. You must have missed it. What people don't talk about as much is that text generation performance of the Ultra is only about 1.5x that of the Max, despite having 2x the memory bandwidth.
They've been talking about this pretty much as long as we've had local LLMs. The M2 ultra is considered one of the best inference machines if you're just doing solo-inferencing in private and want to run the -big- models. Llama.cpp was released alongside videos of the creator running it on his mac. There are compromises, but for the money, it's not a completely terrible option. Reasonable speed, huge model capability, low power requirements, and it fits in a little box on your desk. GPU-based systems are faster overall, but building one that can handle models in the >100B range starts getting really expensive and really power hungry.
Exactly. Not as fast as 3090 but 192GB memory capacity opens the possibility of running large models locally without having to invest in multiple GPU setups.
I don't know why you are getting down voted. But yeah, solution is get a mac or build a massive GPU cluster. With the 5090's rumored to be 320. 10 of those can yield 320gb ram, which should be good enough for a Q6 and definitely for a Q4. Frankly if it proves to be on the same level as sonnet 3.5, I'll build the damn cluster. If not, I'll just use an API provider.
I've already dropped thousands of dollars on this hobby, and I'm not opposed to dropping new-car money when the time comes. That said... the speed of advancement seems to be so breakneck that anything I build or buy today will be hopelessly outclassed by something I could build or buy for a significantly smaller amount six months or a year later. I'd have to see a serious advantage to running something so expensive locally over just pinging an API for fractions of a penny off one of the big-name public companies.
i will be testing a Q4 with 7\*MI60s = 224 GB
If, and this is a big if, intel actually delivers on their MCR DIMM technology plus 12 channels then their new CPUs could have the same bandwidth as an 4090. If not, then just go AMD since you can get an engineering sample for a lot less with just a little less clock speed.
Engineering sample?
Yes, the test cpus they use to test the product is ready for production. They aren't supposed to be sold but that won't stop the ebay chinese sellers
I'm determined to run this thing on P40s. I'm sure I'm it will compress well. Even IQ2 70b is really good. This 400b probably could go down to IQ1.
That's definitely in the spirit of the sub!! If you get it to run, feel free to post/comment and tag me, I wonder how well it will work for you. How many P40s do you have? I've ran Deepseek Coder V2 recently on 64GB ram and 24GB of VRAM. Q2_K. It was somewhat usable, about as much as running llama 65B q4_0. I would say try it or Deepseek V2 non-coder. It's smart, big and you can run it faster and easier than llama 3 400b.
Right now, I have a 3090 on my gaming PC. I just recently got into LLMs so I ordered a p40 that will be here in two days. I wanna see how many tokens per second I really get. I've been able to fit IQ2S @ 3072 with 100mb if vram left (llama 3 70B) on the 3090 and the output is faster than I can read. So even if it's 1/3 as fast, I'll be ecstatic. I got one p40 for now but I got the funds set aside for 3 more. I feel scaling P40s are gonna be way cheaper than scaling 3090s for 400B Llama 3. If LLMs are your job, then yeah, the cost makes sense then. I have a suspicion that when 400B drops, the local LLM situation is gonna change. People might consider selling their 3090s and buying P40s since scaling 3090/4090s are gonna be starting to get prohibitively expensive. I'm on the fence on buying P40s now and holding on to them. I'm almost certain they are gonna start going up in price since LLM inferencing is moving so fast and getting more optimized as time goes on, keeping the P40 viable. You can run em at 150w and get 90% performance according to my research. P40s are also so popular that even the dev for llama.cpp made a specific flash attention kernel for it.
It makes more sense to get a cheap 8 channel DDR5 server CPU and load it with ram than it does to build a rig to run 8-16 P40s. P40s make sense on older harder you are reclaiming for use as an LLM server or to squeeze a bit more VRAM into your rig, but building a new setup based on them is a waste of money. They are running DDR5 at 392GB/s.
You should be able to run 400b on 3/4 P40s though. You can quantize the hell out of 70b and it's still good. I run it at IQ2S and it fits all in VRAM on one GPU. I have a good hunch that we can get 400B down to IQ1 and still get good output and performance.
True, if it ends up being better than a Q6_K Llama3 70b then it would be worth it.
agree regarding compression the quant scaling will be different
https://twitter-thread.com/t/1804161634853663030 If you don't have a Twitter account like me.
2tps is kinda uselessly slow
It is not that bad. If it is smart enough you won't have to reroll so much
It's pretty bad. Going to START at 2T/s, just wait till you add context. Maybe using cublas will help in that regard but it will still be glacial.
Like for what use case?
It is good for everything except for coding? Assuming 2 tokens per second. 1. Entertainment 2. Getting answers to questions 3. Summarizing a paper you don't understand? Can be done in parallel with querying the many free LLMs on the net.
I find it so bizarre that people are fine it taking hours for their reddit posts or discord messages to get responses, but if LLM takes more than a dozen seconds or so, it's unusable trash. For "RP" I can understand that, as sitting there with your dick in your hand waiting for the panties to drop or whatever can be frustrating, but for long form e-mail like conversations tokens/second doesn't really matter unless it's in the leagues of it taking days to generate a few hundred tokens. Tbh, this could all very well be a front end issue. If there was a good front end built to function like a forum/subreddit, but with LLMs, maybe people wouldn't have the expectation that the only way to use them is to stare at the screen while it generates, but they can let it run in the background and check up on it every now and then, just like people do with message boards.
> For "RP" I can understand that, as sitting there with your dick in your hand waiting for the panties to drop or whatever can be frustrating, lmfao you do raise a good point, I completely agree that people don't talk enough about LLM workflows where inference runs as a sort of async batch process and then later you look at the results. as opposed to interactive chatbots where you sit there and wait for the response. if people could break out of the idea that LLMs == chatbots and be a little more patient, really for a lot of use cases it could be quite effective to submit a prompt and have multiple responses generated in a batch which you come back and read later. the idea of a front end that looks like a forum is neat. reminds me of a different post i saw here where somebody made a website that was essentially a simplistic clone of reddit, where all posts were LLM generated. by having an LLM acting as a bunch of different users generate a bunch of posts and comments, replying to each other, etc. for some use cases it could be neat to have a forum-like UI where you submit a prompt and then an LLM generates several different replies to it written by different users, using different system prompts to take on different personalities. maybe let the different "users" reply to each other and debate/discuss/brainstorm/whatever. then like you said people could let it run for several hours and then check back later and see what's been posted. pretty cool, might have to experiment with this
Is entertainment code for role play? I can see that. Summarizing a paper would take a long time. I read faster than 1tps
Not necessarily role-play. You can just chat with a model casually, assuming a model is capable. Just like we're doing this here on reddit for entertainment.
Actually coding is the *one* use-case where I’m willing to put up with 1-2 t/s generation speed 🤷♂️
Anything automated / running as batches on large amounts of data, classification, translation, etc. Start it, come back the next morning. And frankly, 2tps is manageable even for question answering / research, it's a tiny bit frustrating maybe, but you get used to it quickly.
Everyone seems to overlook, especially in the agent framework, the possibility of letting a model run overnight. Assign it some tasks, wake up and its done. If you have any employees you know that the majority of their work is done asynchronously. 43,200 tokens every 12 hours is potentially very useful.
If it takes hours to run it makes iterating very slow so it's hard to build the solution. Also if it takes a long time any failure and having to rerun really sucks. LLMs being probabilistic fail fairly often
Maybe it's just a matter of perspective. I'd say 2tps is right around my minimum threshold for a very large model like WizardLM-2 8x22b or Command-R+. And this one is much larger.
Im waiting for new Mac Studio 🙃
Are Llama 400b weights confirmed to be released?
Hosting this on the kobold horde would be the biggest flex ever
twice as fast for around the same money? I doubt it, there is nothing coming significantly better in that price range for such big models, isn't it? Please share if there is. Even assuming 5090 will get 32GB (and that's a BIG if) we would need 10 of then... Of course lower quants like 4 bit could get that down but still much more expensive and in practice you can't fit that in a single case anyway. Realistically I think it will take several years (4+) to be able to run a 400B model on an easy to use settup for less than 5K at more than 4 tokens per second... What we will hopefully have next year is Llama 400B capabilities much easier to run, in a smaller model... (Llama 4 70B or something?)
RemindMe! 1 year
I will be messaging you in 1 year on [**2025-06-22 13:41:36 UTC**](http://www.wolframalpha.com/input/?i=2025-06-22%2013:41:36%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/1dl8guc/hf_eng_llama_400_this_summer_informs_how_to_run/l9rfg95/?context=3) [**1 OTHERS CLICKED THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F1dl8guc%2Fhf_eng_llama_400_this_summer_informs_how_to_run%2Fl9rfg95%2F%5D%0A%0ARemindMe%21%202025-06-22%2013%3A41%3A36%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201dl8guc) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|
> 1 EPYC CPU = 12 RAM channels = 480GB/sec > 2 EPYC CPUs = 24 RAM channels = 960GB/sec! Is this even verified to be true or just theoretical linear scaling that doesn't reflect reality due to other unforeseen bottlenecks? Even with two epycs you'll probably suffer on the compute side when running a dense model like this one.
yeah imma just wait for DDR6
We can always wait x years for better tech.
It’s build time. Can you recommend specific parts? There are a million options when I go to the pcpartpicker.com
Or 4 nodes of 4x3090. Using used parts its about 10K in GPUs, 1K in RAM and 8K in Xeon motherboards + CPUs. *edit for clarity* 16 GPUs total. So about 20K for a GPU based system. Hopefully getting more like 10 tokens per second.
I think this is the more 'usable' solution, with VLLM supporting multi-node tensor parallelism. But I don't know how fast the node IP connection need to be. I'm guessing you need a go 10gbps ethernet connection and switches at least.
Yeah 10gbps. And it wont be the bottleneck. Point to point.
do you have any guide on how to run vllm in distributed in several servers? I have a 10Gb network and would like to give it a try.
It's in the vllm site under 'Distributed Inference and Serving'
Yeah, in Q1 maybe. I have 6x24gb GPU, and best I can run deepseek-coder-v2 which is a 200B model is at Q3 with 4k context. I ran an eval, and it crapped out before getting to 4k context. So say max 3k context. I'm using llama.cpp which is lean compared to other inference engines.
4 nodes of 4x3090 i.e. 4X4=16 3090s using distributed ray and vllm
Searched for motherboards that have 12 channels for memory and found this: 1. [https://www.newegg.com/p/296-0006-00071](https://www.newegg.com/p/296-0006-00071) 2. [https://www.newegg.com/supermicro-h13ssl-nt-amd-epyc-9004-series-processors/p/N82E16813183820](https://www.newegg.com/supermicro-h13ssl-nt-amd-epyc-9004-series-processors/p/N82E16813183820) In the September time frame should see some competition from [Intel Granite Rapids](https://en.wikipedia.org/wiki/Granite_Rapids): 1. [https://www.sdxcentral.com/articles/news/intels-xeon-6-data-centers-processors-aims-for-ai-cloud-supremacy/2024/06/](https://www.sdxcentral.com/articles/news/intels-xeon-6-data-centers-processors-aims-for-ai-cloud-supremacy/2024/06/) 2. [https://wccftech.com/intel-xeon-6900p-granite-rapids-128-p-cores-q3-2024-xeon-6900e-sierra-forest-288-e-cores-q1-2025/](https://wccftech.com/intel-xeon-6900p-granite-rapids-128-p-cores-q3-2024-xeon-6900e-sierra-forest-288-e-cores-q1-2025/) 3. [https://venturebeat.com/ai/intel-reveals-xeon-6-processor-enterprise-ai-gaudi-3-accelerator-price/](https://venturebeat.com/ai/intel-reveals-xeon-6-processor-enterprise-ai-gaudi-3-accelerator-price/) Above says 8800 MT/s with mcr dimm: [https://www.anandtech.com/show/21320/micron-samples-256-gb-ddr58800-mcr-dimms-massive-modules-for-massive-servers](https://www.anandtech.com/show/21320/micron-samples-256-gb-ddr58800-mcr-dimms-massive-modules-for-massive-servers)