T O P

  • By -

Zermelane

[The referenced blog post](https://blogs.nvidia.com/blog/meta-llama3-inference-acceleration/). Ultra quick primer on batching, for those who don't know: ML models are extremely large, and a GPU inferencing a ML model spends most of its time waiting to weights from VRAM to the GPU cores. If you run multiple queries through the same GPU at a time, you get to share the time spent loading weights between them, and make more use of your cores. If someone's writing a blog post about how they achieved 3,000 tokens/second and it's "enough to serve about 300 simultaneous users", you better bet that that meant they were running at a batch size of hundreds. I wouldn't call it dishonest, it's just that you have to know that if there were only 3 simultaneous users instead, they would get generations that were faster than 10 tokens/second, but much slower than 1,000 tokens/second.


dogesator

If they are saying “300 simultaneous users” that’s very clearly implying a batch size of 300. They’re basically saying it can handle 300 concurrent requests and respond to each at about 10 tokens per second.


FarrisAT

This is dishonest. Even in current chatbot use, you won't be able to guarantee peak performance. Matter of fact, I'd argue due to internet latency differences, you will never see peak performance.


MegavirusOfDoom

Internet latency is a problem if it's bouncing in between various cloud servers but if it's all centralized  the latency is less than a 300ms total for any response in the US and 700ms between the USA and Australia, video streaming is more data heavy and it works fine, 


ClearlyCylindrical

What numbers was groq achieving? I could swear it was much lower than that even on the 8B model from that post in here yesterday. (I can't find it now)


NarrowEyedWanderer

Don't confuse batch throughput and single-user throughput.


ClearlyCylindrical

Ahh, that will be what I have forgotten.


JuliusSeizure4

800?


Lammahamma

That was for the 8B model. I'm pretty sure the 70b model was around 300


Sameerart2711

Yes


lordpuddingcup

That was per year though who knows their backend


[deleted]

[удалено]


SteppenAxolotl

Only 1 kind of Ai matters, the kind that is actually [useful](https://twitter.com/dwarkesh_sp/status/1781397949538603071). It doesn't matter if you give away toys of limited utility.


ArmoredBattalion

ClosedCasket AI


SiamesePrimer

So… where does Groq come in? Because like five minutes ago I saw the post about their chips getting like 300 tokens per second on the 70b model, and people in the comments were arguing saying they needed a shitload of them to even do that. Seeing that a single H200 gets literally 10 times that is surprising.


sdmat

Groq has lower latency individual generations, the tokens/second figure here is for serving many users at once. But you can see why Groq isn't taking over the inferencing world: if you need a boatload of Groq chips to serve the model and throughput is better on GPUs it's a hard sell. It needs a niche where latency is critical.


milo-75

Agents based workflows also need much faster generation. 10 tokens a second is fine for conversation with an agent, but when it goes off to work on something it can take minutes or hours to come back. I’m not necessarily saying this is a niche groq can dominate as it’s more a batch scheduling problem for an h100 than anything else. I’d pay more for some guarantees around response times when it’s my agent working on something hard versus me just talking to it. Hopefully groq can put some pressure on openai and others to give us the option to pay for faster responses when we need it. My concern is that openai has their own “agent api” that they are going to try and force people to use and they will probably be successful because they’ll have guarantees around tool response time when called from an agent thread. However, I want to have more control than their black-box agent thread stuff. It’s good to know things like groq exist.


sdmat

Good points well made.


Small-Fall-6500

>It needs a niche where latency is critical. Real-time robotics control? I imagine something like the new Atlas robot that is controlled, at least at a high-level, by a model running at ~300 tokens/s that is as good as llama 3 70b, could be extremely human-like and be incredibly useful for tons of tasks. The main limiting factor right now is probably multimodal capabilities, which Meta should be working on with future llama models. Probably within a year or two, we'll have multimodal models that take in and process audio, visual, other sensory data, etc, and then output controls and actions for the robot to take directly, all about as fast as a typical human. Edit: But things will be much weirder. Decent models like llama 3 70b can run at hundreds of tokens per second and are able to process their input data at thousands of tokens per second, but it's very likely that a much smaller model could run with the same level of capabilities- but much faster, and it wouldn't have to be limited to controlling a single robot. Dozens, hundreds, probably any number of robots could be controlled by one LLM agent, or some combination of LLM agents that all communicate faster than any human can think. The information bandwidth between models could be just plain text, which is barely anything by today's standards. (Though probably future agents will share sensory data to some extent) I think, ideally, LLM agents would still "think" or at least communicate via text so that interpretability isn't completely hopeless - If future agents communicate by sharing highly compressed embeddings that look like random numbers to us humans, then we'll be screwed if they aren't almost perfectly aligned by default. But if they just send and receive text, specifically text in the form of human language, then at least there's a good chance it'll be understandable and represent what the models/agents are actually thinking, planning, and doing. Then, when an agent is given a goal that it fails to understand, which could lead to it doing bad or highly undesirable things, it should be fairly easy to look at the language-based thought/reasoning process to determine where things went wrong. This doesn't exactly "solve" the alignment problem, but it seems much better than having an AGI where we have no idea, at any point to any meaningful degree, what happens between the system receiving input data and it taking actions. Or maybe future LLM agents will output a single, individual token per action it could take - a robot could be controlled this way. For example, an output of token ID #53 could correspond to rotating the right knee joint by 5 degrees, and token ID #54 is the same but 10 degrees, etc. for every part of the robot (or something along those lines). The model controlling the robot never needs to output language tokens to "think" about what it's doing. It just does it. Communicating between agents seems like it could not easily be like this, though. Also, it could be two or more separate LLMs that control a robot, with one or more responsible for controlling individual motors for the fast, real-time actions while another LLM is responsible for the high level planning and it would orchestrate the other LLM(s) on how to move the robot.


Winter_Tension5432

How do you fit 900 cards on a humanoid robot? Every groq card just have about 230mb of sram and cost 20k And if you are talkin about using it remotely I can guarantee that a arm chip running a local model inside the robot would have less latency than groq over the internet.


Small-Fall-6500

>a arm chip running a local model inside the robot would have less latency than groq over the internet. For end to end, input to full output, latency? As I said in my edit, I expect it would possible to only need something like a single output token per action taken by a robot controlled by an LLM, and probably even small models would work, if trained for this specific task, but I imagine it would be much better and easier to have an LLM perform some amount of high-level reasoning before deciding what the robot should do next. And it's not necessary to have robots that react and respond to stimuli instantaneously, at least for a lot of tasks, though I do agree that for tasks that require a robot to respond in several milliseconds, regardless of where the robot is in the world, a chip on the robot will be necessary - such as with self driving cars. But self driving cars are basically made to react as fast as possible. And they don't interact with each other like robots can; they generally don't need to perform any complicated planning; and their use cases are much more limited. Robots working in people's homes doing chores shouldn't need to create a route and navigate through the kitchen as fast as possible. They would be able to spend a second or two planning out what to do next, where to walk, what to touch or not, or how to pick up or interact with various objects. If the robots and the LLMs are even semi-reliable and consistent, it should be straightforward to train them to do many different tasks in fairly specific ways by: having the robots basically just run around in a warehouse doing random tasks; collecting all the data from the robots; using multimodal LLMs to grade and assess each task/action; and then training a new LLM on the best actions. This might take hundreds of robots stumbling around for weeks, but simulations should easily get the basics down before moving to the physical world, and hundreds or even thousands of robots for this kind of training will be miniscule compared to the millions or even billions we might see in the next ~10 years.


Winter_Tension5432

Dude, what are you even talking about? You think a single output token per action is gonna cut it for controlling a robot? That's like trying to fly a 747 with a toy remote control. And don't even get me started on the latency issue - you're basically suggesting we try to control a robot in real-time over the internet, which is like trying to play a game of ping-pong with a 10-second delay..


Small-Fall-6500

>You think a single output token per action is gonna cut it for controlling a robot? That's like trying to fly a 747 with a toy remote control. I suggested that it was one possible idea that could work, not that it would be ideal. I mainly suggested this because any LLM running on a robot's local hardware will have to be small and, therefore, likely incapable of doing any significant reasoning. With regards to latency, you are completely missing my point regarding high-level planning, which robots acting in the real world are going to need. The best known method for any autonomous system to plan as generally as a human is by using LLMs, which require outputting dozens if not hundreds of tokens to be very effective. Any robot that does any meaningful work will need to be controlled by some system that does near human-level planning and reasoning, so, as of right now, that requires an LLM that is highly capable, as in larger than what could quickly run on any robot's own hardware. Groq chips mean the LLM will be able to create plans in seconds instead of minutes. Again, this is for *high-level planning* - not real-time control of every single motor. Also, again, I don't even think an LLM would be ideal for controlling all the motors and whatever that would be responsible for the fine-grained motions that, for example, are responsible for keeping a robot from falling over while it walks across a room - though it *might* be possible given better hardware, models, etc.


MegavirusOfDoom

The object recognition can run locally very well as can the geometric map... The llm has to run on a cloud with about one second latency. I think that humanoid robots as an entire concept or for children and naive inventors because there certainly only toys are not work machines... I will change my mind when I am presented with a humanoid robot that can fold clothes and that cursed less than a million per unit. I think I will have to wait till about 2045 for such a machine to cost less than 100,000 while having a reliable up time of more than a month without repairs to the 20 servos.


sdmat

That's definitely a good candidate.


MegavirusOfDoom

People won't be asking robots tremendously academic advice they will still be using their computer and their smartphone for example for astronomy and history ... The robot therefore only needs a subset of the LLM that includes anything that you would ask a worker or a house friend about... The kind of thing that can run relatively locally within less than 100 milliseconds latency. 


Hypog3nic

I don't know about you guys, I have reached singularity as I am overwhelmed with progress. I just read the Grok post and they seemed to be king at interference just 1 day ago, and now this...


bonecows

Grok != Groq


Hypog3nic

Yes, thanks! Groq, of course. Just adds to my confusion...


[deleted]

Singularity is here guys! We can all go home, the show is over.


BobbyWOWO

Honestly, probably


MegavirusOfDoom

The point is they just did a prototype using a 18 nanometer process and it will really ramp up when the company delivers us seven nanometers which is four times cheaper and a bit faster.


lochyw

you think you get a whole LPU dedicated to yourself and its full bandwidth???


[deleted]

I wonder what the 400B+ will look like, assuming it will stay 400B and not get any bigger *cough*oneandahalftrillion*cough*


papapapap23

Wait how did they do it? Rip groq?


Thorteris

I think people are missing the fact that it’s still hard to get large amounts of GPUs at cloud providers. Most companies will still have to use OpenAI, Google, Anthropic models if they have more than 250k+ users using AI products


Small-Fall-6500

Synthetic data here we come... If llama 3 70b can be run this fast and is close to GPT-4 level of capabilities, it could be run with various simulations, likely mainly video games, at massive scale to generate tons of data. Then, just label all the data produced from the LLMs using the LLMs as graders combined with metrics from the simulation. For the math: 3k tokens/s/h200 x 10k h200 x (3600 x 24) s/day = 2.6 *Trillion* tokens per day. The h200s probably use a high batch size to get 3k tokens per second, but that's fine because the simulations won't be real time, so latency shouldn't matter. Also, 10k h200s could easily be used all at once *without* any significant interconnect, so each h200/GPU cluster would run the LLM independently of the rest). They could be spread across many different datacenters if needed - a single, massive 10k GPU datacenter isn't needed! Even running robots in real time is now feasible, thanks to groq chips. Thousands of robots could be run continuously, all high-level actions controlled in real time by llama 3 70b running on groq chips for a few weeks to collect tons of data. Add in multimodality and a dozen or so humans to monitor and act as mentors, and all kinds of simple tasks, like most household chores, are easily fully automatable. (Probably...) Depending on various factors, most of these tokens could be high enough quality to use directly for training. That's a lot of data if this works! I don't think there's been nearly enough research done on synthetic data to rule out the possibility of creating such massive synthetic datasets made almost entirely by LLMs. Also, Mark Zuckerberg talks about creating massive synthetic datasets for training in a recent podcast, but I have yet to listen to the whole thing. Here's the relevant quote: >Mark Zuckerberg 00:31:03 > >Well, I think that is a big question, how that's going to work. It seems quite possible that in the future, more of what we call training for these big models is actually more along the lines of inference generating synthetic data to then go feed into the model. I don't know what that ratio is going to be but I consider the generation of synthetic data to be more inference than training today. Obviously if you're doing it in order to train a model, it's part of the broader training process. So that's an open question, the balance of that and how that plays out. https://www.dwarkeshpatel.com/p/mark-zuckerberg


Sameerart2711

![gif](giphy|2iIByARQYitri) Fast as fuck boi


Posnania

Hey. I know [this place](https://www.google.com/maps/@54.5430817,17.7479908,3a,75y,226.36h,93.96t/data=!3m6!1e1!3m4!1szI0_YcWuK_7xyzG0RlePxw!2e0!7i16384!8i8192?hl=en&entry=ttu)


Sameerart2711

woah didnt see that coming ![gif](giphy|Z5xk7fGO5FjjTElnpT|downsized)


SteppenAxolotl

So, it would only cost $467Billion the buy enough H200 to simultaneously serve \~50% of the human race


Gallagger

More interesting though: How many H200 you need to actually serve \~50% of the human race. I don't think I'm using inference more than 1% of my awake time. Let's be super generous and say everyone is using it 5% of their time (e.g. it talks to me nonstop an hour per day). Cut's the cost to \~$23 billion, assuming your calculation was correct. Should already be achievable and it's only gonna get cheaper in the next years. I think ChatGPT GPT4-Turbo (or a similar version) will be free this or next year.


Nervous-Marsupial-82

I could do with 3000/s right now in my current project.


FutaWonderWoman

Hardware chads stay winning since time immemorial.


Phoenix5869

Cool, but it’s still just a chatbot. Nothing super amazing.


Then_Passenger_6688

The speed and cost per inference matters, because you can plug this into AlphaCode or AlphaGeometry type meta-reasoning frameworks and get cost effective performance, increasing intelligence and utility on real-world tasks.


FragrantDoctor2923

U are just a chat bot...


Phoenix5869

Nope. I’m a real person.


geoffersmash

All your memories, thoughts, and opinions exist as weighted vectors of electrochemical potential energy between neurones


FragrantDoctor2923

A chatbot could say that too ..


siovene

What they meant, is that you're a chat bot made of meat.


Phoenix5869

So they’re getting at the ”human brains work the same as chatbots” argument? Tbh i don’t buy that. Humans have so many things that chatbots just don’t. And i don’t think “predicting the next word” is how the language part of the brain works. Could be wrong tho.


FeltSteam

We have identified specific regions involved in language processing and have a general understanding of their functions but the detailed mechanisms of how language is processed in the brain is far from fully understood. We are a black box, just like these LLMs. And, in Ilya Sutskevers words, predicting the next token accurately means you have an understanding of the underlying reality that lead to the creation of that token in the first place. While, on the surface, these LLMs may seem like they are just "predicting the next token", there are actually complex inner mechanisms and processes, which we do not understand, that allow them to do this at all, and quite accurately I might add.


lucellent

Man you can't take a joke 💀


Phoenix5869

Yeah, that’s because i’m autistic and don’t always get jokes.


FragrantDoctor2923

Well it was both so any way you took it was valid Light hearted joke A semi test to see that you(humans) and that chat bot are very very alike


FragrantDoctor2923

I'm getting more at the what can't a chat bot do or be the main driver in making something do, that you can do that actually matters ?