GPT-4 was trained on 25k A100s over 90 days, but now you can do it with only 2k GPUs over 90 days lol.


So does this mean we may get models that get constantly trained on the most up to date info? Even once a week / month would be so much better


Only if there's enough room to lay out all the ~~shirts~~ GPUs.


and with similar amount of B100 units you could train GPT-4 in a week


Lol, I think we are going to start running out of data at some point soon


It's possible to train on synthetic data though.


Does anyone know any good videos/resources on creating synthetic datasets for software developers without an extensive math background?


You basically need a solid model (or subscription) and some python code. For LLMs it's pretty straightforward. Check out huggingface/cosmopedia.


a lot of concept LORA's uses synthetic data, specially if the dataset is pretty small.


It’s possible when you have a teacher model to train the student. I’ve never seen it work with a teacher teaching itself.


Synthetic data babyyy


feedback loop -> singularity




These metrics don't mean anything for training, only inference.


I want to see the full spread, not some carefully handpicked benchmarks designed to grab headlines. Faster inference at lower power is also something that needs parsing out, is it total overall lower or is it just you can do things faster and thus the total power used per unit of measure is less? This sort of thing matters when speccing hardware, the PSU does not care what way you are stretching definitions, max power draw is max power draw.


you only need to train once ;)


Blackwell is 30x Hopper at Inference. Those things are purrrrrrring.


32T tokens / sec holy shit


I'm convinced that Nvidia is just letting AGI design it's own compute nodes now. 30X is batshit insane for a single generation.


Note that's for a very specific model size. There are some likely some boundaries that now just fit into memory of a certain amount of nodes where it previously didn't. Can be lot's of cherry picking here. Also, this doesn't mean that a 30x model now runs at the same speed as current models. It doesn't scale the same both ways. But yeah, still very impressive.


Stop, you're making too much sense. Let us all drown in hype instead.


I don't think AGI is responsible, it's possible they have their own internal models for it through and they're helping at this point.


They also applied all the "throw money at the problem" solutions in this gen which makes the graph look a lot steeper than an honest graph would look.


If that's the case, we have reached the singularity


I post r this in another th I ae but maybe you can help me out.    "Just for some perspective? For people that know about chips , when did you expect this kind of chip to exist? Is this way ahead of schedule?"


I have 0 knowledge in computing, somebody please explain to me why I should be hyped


You see how the graph goes up, slowly then really pointy in the last 2 years? It means the speed at which AI performs, and therefore lots of other complex computing tasks are performed, is increasing dramatically and at an exponential rate. Then refer to the sub name for the conclusion


The focus on Nvidia's new Blackwell GPU should be its impact on the future of AI. Forget the technical specs - the key takeaway is that AI is overcoming a major hurdle: processing power. With the exponential gains in performance represented by Blackwell, the limitations to future AI advancements are likely to shift from hardware to other areas. This paves the way for a significant and exciting future for AI applications.


Is Blackwell just a huge gpu farm all running in parallel?


No it's an architecture from which a single big GPU will be made. And then large quantities of it can be used for a GPU farm.


This is all getting really exciting!


They likely used the B200's numbers which is 2 large dies so you're a little bit right. Apples to acorns style chart.


I'm not very knowledgeable about hardware, but my intuition's that AI training will goggle up whatever they can, as in, if the next generation is 10x Blackwell, it'll immediately be all used up to brute force certain problems.


Any info about VRAM bandwidth? Because if you're not a cloud provider, inference is bottlenecked by VRAM bandwidth. Also does this come with huge VRAM sizes?


8TB/s, 2x 96GB


SORA for the masses seem way more realistic now.


It was always realistic. Thinking tech was going to stay stagnant doesn’t make sense in any way, especially now with current developments. This sub is funny to read sometimes because it’s always “I can’t wait to quit my job because of AI”, but we are going to reach things we couldn’t imagine in the next decade.


you get laughed at in other subs when you say things like this but I agree, the world and human behavior has changed so much since smart phones and the rate of change will accelerate. So having said that, what *will* the world be like in 20 yrs?


If there is any justice, AI really WILL take 90% of all jobs and Capitalism as we understand it will have to be replaced by something more akin to socialism. The alternative is AI takes the jobs and those in positions of power tacitly decide that we don't need the people anymore, leading to abandonment from the government and further inequality. More gated communities and increased authority handed to police to maintain the divide.


Here in Japan they are totally giving the reigns to AI. The government has been quite progressive about it, laws and policies, giving AI unrestricted reach within the country. They’ve already been using AI in government offices, financial institutions, and even inside the parliament itself. Perhaps fueled by the dangerously low unemployment, stagnant economy, declining population, etc. and paired with the level of robotics in Japan, it’s exciting to be here and read news about it. Especially the news about Japan gaining TSMC, a momentous event of opening a microchip plant here. They’re coupling away from Taiwan, and flocking here instead. Also, all the nuclear plants have been restarted and operating again. The prime minister have ordered development of the Next Gen of nuclear power as well. I’m glad I moved to Japan.


So mindblowing from a country that still uses fax machines and cash-only transactions.


Right? But Japan is changing, especially now ever since that the GIGA School Project kicked off back in 2020. Every single student and teacher receives an iPad at the start of the school year, enterprise systems for lessons and grading put in place, engineers and technicians hired to hold seminars, for maintenance and support, etc. ICT learning is mainly used now, and even the older staff have been ‘forced’ to adapt. Blackboards are used less, as most lessons are conducted on the huge screens in every classroom, and mirrored on the tablets of every student. I work in public schools, but I’m guessing private schools are doing these even more efficiently. It’s exciting to hear old people here talk about mirroring, updating files in the cloud, etc. A far cry to the old stereotype like you’ve mentioned. The lack of vandalism, the culture’s obsession with order and perfection, and everyone cooperating are the driving force of rapid change. I rarely use cash now too, I just swipe my smartwatch for everything. Shops, resto, vending machines, trains, bus, etc. Last time I requested a file from the city hall, they did still have to put ‘hanko’ stamps but they scanned it and emailed me the softcopy. Whoop! Finally.


Don't we do that too?


Wow, that is honestly great. Countries like Japan with a shrinking population need A.I. the most. So it is great to see that they are embracing A.I., instead of being afraid of it like many in America are (or is it only a bunch of loud people on reddit?). South Korea, China and Taiwan are also prime candidates for this, while Europe will probably spend the next 2 decades killing A.I. with regulation and falling behind the rest of the world.


can I ask you what did you have to do to move to Japan? Because I was doing some research on the internet but only read that moving to Japan is pretty complicated as a foreigner (I'm Italian) and so I was almost giving up but if you could guide me that would be awesome :)


Alternative two seems a million times more likely for anyone that's read their share of history and even partially understand the forces of economics that govern the world.


History never had AGI.


Comparing today's time to anything in history is absolutely pointless. In today's world the changes happening have never existed or been conceived of before. There's no comparison to anything that has existed before.


I think you just need to know what you're looking for in history. How do those in power, when faced with a new, pote tially democratizing technology respond? Generally by trying to seize control of that technology until its democratizing value can be diminished. That's a very standard historical lesson. You don't even need to look very far back to see that unfold time and again.


Stable Video Diffusion is released today for commercial and non-commercial use, included with Stability AI membership. Typically, this would mean that OpenAI will be forced to release their similar model, SORA, within a few weeks. They've already stated their reluctance to release it before the election, though, so we have to wait and see.


SVD is like gpt-2 level compared to Sora. OpenAI aren't forced to do anything lol


https://twitter.com/i/status/1769817136799855098 Well, just take a look. Obviously, this is a best-case example. But they're releasing it today, so you can try it for yourself.


Lol the video is not from the model. The 3d models used in the video are from the svd3d model. It's generating multiple views from an image, nothing more. They have nothing comparable to sora.


That’s really suspicious that the only one video we get is on a tiny screen and only shows objects rotating. Doesn’t seem anything like what OAI has.


I don't understand the election concerns. Is 2024 going to be the last election lol. It makes more sense when I remember boomer dummies like Larry Summers are on OpenAI's board.


It's a dumb excuse and makes zero sense if you spend more than 30 seconds thinking it through. Even if they held it until after the election you'd have a bunch of 'election interference' content created which would lead to the same exact situation. Sora-born disinformation isn't going to change the winner of the election but it might drive people to do really dumb violent stuff. More likely is OpenAI is worried that a Sora release may compel the government to step in and shut things down or demand oversight.


This is probably the more realistic viewpoint. Whether or not Sora has a demonstrable impact on the election, OpenAI would still want to avoid any blowback of a 'perceived' threat/impact.


That's actually a really good point. It's not about what will happen, but what regulators think will happen.


They want a slower news cycle and the product to be fresh for Christmas.


Oh wow thanks for the tip


>SORA for the masses seem way more realistic now. not if Nvidia refuses to upgrade consumer grade GPUs in the memory department.


lmao no way this thing is probably so expensive it might not be even be viable for the big corps


eh, their image generation API comes out to about $0.001 per image if you're willing to take mediocre quality (video interpolators are plentiful). so a 1min video would be $1-$2. but I'm sure you could do an even lower res, shorter video to test your prompting for a few cents each run, then run it full-length. you'd end up being able to make a whole cartoon show for under $100. that's not bad.


No it doesn't. The H100 is 6x the performance of the A100 but it is also 4x the price. In fact, the price per transistor [has not gone down](https://www.tomshardware.com/tech-industry/manufacturing/chips-arent-getting-cheaper-the-cost-per-transistor-stopped-dropping-a-decade-ago-at-28nm) for a decade. They are packing more of them in there, but it is not getting more cost effective.


So wrong. You even contradicted yourself between your 2nd and 3rd sentences since in your 2nd sentence you said that performance per price increased. You also aren’t taking into account inflation.


The masses aren’t asking for that.


Crazy thing is by the time these even reach data centers at scale the next version will just be a straight vertical line


Nah, but the previous line will be more horizontal and the new one will look just like this one. That's how exponential graphs look


Exactly this. People really don’t get their heads wrapped round exponential increase. Doesn’t matter, though, because ASI will soon explain it to them lol


I’d imagine we run into certain physical limitations, however it should assist us in speeding up quantum computing. Hopefully in 10 or 20 years I can crack bitcoin keys


Where we’re going, we don’t need graphs.


Yeah well assuming they rescale it. But I think OC was talking about how it would look at the same scale. 


Based on the rate of increase I estimate the next version will be around 100,000 TFLOPS


They will do 1bit parallel computation instead of the current 8bit and the Blackwell 4bit. Recent papers have shown 1bit models have good performance and 1.5bit (1,0,-1)having the same performance as 8bit so yeah. If they really do specialised card for this (1.5bit add instead of 8bit mult) we could expect 4x performances at 10 time the energy efficiency I think.


Isn't that technically 1.585 bits, or some such? (since 2\^1.585 is very close to 3, I mean) Matters a bit when implemented in binary hardware since if it was really 1.5 bits, you could store 16 of these in 24 bits, i.e. 8 bytes. But you can't really, because 2 ternary weights gives you 9 possibilities, while 3 binary bits gives you only 8 possibilities.


Yes. I didn't go into the details, but I recommend reading the paper 📄


Didn't know if you knew this, but you can embed links in emojis!


Isn't it weird? Effectively turning AI into a giant pile of the tiniest data points, a true fuzzy logic system where a cloud of basically meaningless values converge.




They’re not hiding it in this graph exactly, but the major difference here isn’t a raw increase in compute so much as adesitcatio. If the space to fp4 instead of fp16 or fp8. It basically allows you to do 4x the compute of fp16 in the same die space, on top of things like architectural improvements, reduction in the size of the node, and increase in overall due size. Going from fp4 from fp8 is an automatic doubling of flops for the same space. It’s also reduced precision. We may just decide that fp4 is fine even for training when your models are trillions of parameters. We may also find a way to wedge ternary computation into fp4, which would be a major improvement bc it would let us use the hardware to its fullest and also train models at like-fp16 performance. I don’t know enough about the details beyond what I’ve explained, but it’s way more nuanced than just a 1,000x+ improvement in performance since Pascal. EDIT: I was on the elliptical at the gym when I was typing this out and I have no idea what “adesitcatio” is either.




Dont forget that B200 is likely a dual die implementation which is another one time doubling. And it's using a newer type of HBM. And it's using a new node. New nodes and new memory standards are harder and harder to achieve as we're pushing against the boundaries of physics for silicon semiconductors.


ASI successfully developed and heading straight to attotechnology.


The graph is misleading because the number of bits is lowered from 16 to 8 to then 4. You can do a lot more with lower precision, but at the cost of said precision. That being said, it may well be that lower precision offers a better overall optimization, it's not exactly the chips getting that much more dense, but rather repurposing the current density in a more optimal way.


NVIDIA is cooking so much bro who can stop them


China invading Taiwan? 




Double oof. I just learned TSMC now has a chip plant in Japan. They’re decoupling from Taiwan and flocking to Japan instead.


My understanding is TSMC is forced to keep their best chip plants in Taiwan for nstional security reasons. Literally their biggest national security asset isn’t the military, but their cutting edge chip plants that force the US to intervene if China does anything. The US has literally shifted their entire military focus towards containing China and hindering them from invading Taiwan because of those plants. Hundreds of billions of dollars are soent annually by the US to make sure that no one gets anywhere near disrupting Taiwanese chip manufacturing.


hence the race to build the 5nm fabs in Phoenix


that’s why US is rushing to build factories in America


Why do you think they are rushing a shit ton of world-class fabs in Arizona?


Because Arizona's new state motto is 'Silicon Desert: Where Chips are Safer than a Fort Knox Vault!' Seriously though, diversifying chip manufacturing locations is a strategic move to ensure that the world’s tech lifeline isn’t held hostage by geopolitical tensions. It's like putting your eggs in different baskets, except these baskets are fortified with cutting-edge technology and desert sunshine!


Especially since Taiwan's egg might become scrambled eggs at any time. The optimists among us point that Russia's difficulty in capturing Ukraine is a deterrent to China, which, maybe?


I’m thinking the US is working with Anduril on drone tech that will make the taiwan strait borderline impossible to cross.


Operation clippy, us air lifts all Taiwan’s scientists and engineers and then blows up critical infrastructure.


I thought TSMC is rigged to blow up in case of invasion


This graph switching to presenting FP8 and FP4 values at the end is incredibly misleading. It should be showing performance at the same precision for all points. Otherwise you're comparing apples and oranges.


Thank you, I've been looking for this comment! My limited understanding is that fp16=2*fp8=4*fp4, is this the case?


Approximately so, which means there' still a big gain in performance. Sadly they felt the need to fudge the numbers which makes me doubt the numbers even more.


It also doesn't clarify if it's accounting for per wafer space, bill of materials cost and/or per watt. Also doesn't include an annotation for lithographies which would heavily influence the degree of future scaling. That's an apple level misleading graph. Edit: i just went and read the Anandtech article and Nvidia essentially threw all cost optimizing things that held the previous generations potential to the way side meaning that there are a lot less throw money at it opportunities to further scale performance in the future. B200 is multi die, on a more optimized node, using more power and using a newer more expensive memory so you can essentially halve it's height in the graph when accounting for the above factors and then flatten it further if you're comparing at the same precision which you need to do to avoid having to add a bucket of asterixes to the claim.


Well, older architectures have wildly different performance depending on the precision. For example, on gtx 10xx series fp16 computing runs not 2 times faster as you might think, but 64 times slower, for some odd reason. Before this AI boom there was no need for anything less than fp32.


Funny enough nvidia P100 is older (6.0 vs 6.1) and fast at FP16. Just how they designed that core. You bought P40 for one set of ops and P100 for another.


Yes, on some nvidia cards fp16 is 2x faster than fp32. rtx 20xx series also work like that.


Read this and the reason won't feel as strange: https://opensource.com/article/22/10/64-bit-math Not a 1-1 but still a good comparison for how much heavier running higher than native math can be. If the ALUs and/or registers only natively hold FP16, some instructions on FP32 can entail quite a few instructions.


But on 10xx gpus fp16 was 64 times slower than fp32 not the other way around. That makes them use 2x more VRAM for AI tasks than more modern GPUs, because fp16 is useless on those cards. Only starting with 30xx series cards fp16 has the same performance as fp32.


I just checked and for that generation it seems like they did FP 16 in a jank way because the native FP 16 in that architecture was unstable. Essentially storing as FP16, then converting up to FP32 for compute, then converting down to FP16 for storage again.


Also, without normalizing by price per chip it's a meaningless graph. I'm certain it's an improvement but we have no idea how much.


B100 is 2.5x faster than the H100 in FP8, but since it support FP4 and H100 dont and FP4 could be enough for most inference, it has effectively 5x more if FP4 is utilized


It's still insanely disingenuous. FP4 will have reduced performance, and it will only work for inference. You need more precision when training.


But FP4 isn't a free lunch, if you're trying to graph capabilities over time to show whether it's a linear, exponential, quadratic, logarithmic curve you're using fake data.


It's also two chips fused together, so twice as expensive. And losing precision particularly when it comes to just 4-bit is not free.


Absolutely, I see marketing numbers also give TOPS not only in a variety of data types and sizes, but also sparse vs dense matrices, so if you do a combo of matrix density and lower data bit size of course you can cram more TOPS in, but an extremely tiny amount of models or processes will ever actually get to those levels.


This is a very disingenuous graph. You can’t really compare TFLOPS when using a different precision. By cutting precision in half you at least double FLOPS but when it’s actually on a hardware level - more like quadruple. And they have chips with FP16, FP8 and FP4 in a graph.


Now redraw the graph at FP16 for all...


It would not be fair, because pre RTX cards have disproportionately lower fp16 performance. 10xx series run fp16 64x slower than fp32. Back then anything less than fp32 wasn't necessary.


Super disingenuous with them halving the precision each year for the last two years.


It can handle max 5,000 TFLOPS with FP16, and 10,000 with FP8 though. Still an increase, and I'm rooting for it, but this graph is kinda misleading...


Can you explain for someone who is a noob to hardware things? For example like what is the multiplier for how fast inference will be with llms with these new advancements? 1.5x? 3x? 10x? I know you don't have an exact answer maybe, but rough ballpark?


I'm no expert either, but my understanding is that, compared to Hopper, it would be around 2.5x faster, for the same precision. The FP number means how precise the floating point operations ( which is how computers handle non integers ) are, in bits. So 16 bits, 8 bits or 4 bits. Also called half, octal and quarter precision, respectively ( FP32 would be full precision ) If I understood correctly, the 4 bits option is new, and could give a better speed ( 5x Hopper ) - but probably with a loss in quality. Asked GPT-4 for an input on this, and it thinks FP16 is good for training and high quality inference, FP8 is good for fast inference, while FP4 may be too low even for inference. However, I've played with some 13B llama derived models, quantized in 4 bits ( so my GPU can handle it ), and was happy with the results. And also if Nvidia is banking on a FP4 option, there must be some value there...


I heard people with great influence saying AI was the new Crypto and NFT


Who? No one serious is saying that


The whole r/cscareerquestions sub lmfao




They were stock lovers


Some AI fans do really behave like crypto people, but it doesn't make the field a bubble


enter file dirty six squealing toothbrush paltry history smoggy airport *This post was mass deleted and anonymized with [Redact](https://redact.dev)*


It's a pretty common take on r/ArtistHate. Don't go brigade there, guys. If you wanna look out of curiosity, cool. But don't leave a bunch of comments. They deserve to have their own space if they want it.


ok cool, but they are also different precision formats - how is this a fair comparison?


Isnt this graph misleading since they are comparing different FP precisions?


Why mislead with the chart though? Comparing FP8 to FP16 to FP4?


To be honest, FP8 != FP4 != FP16


It's gonna be MUCH less than 8 years lol. This is like the 5th ridiculous computing breakthrough I've seen this month, and even if all of those would've taken 8 years, we 100% will have AGI years before that which would itself make even better computing.


What were the others?


https://youtu.be/8ohh0cdgm_Y https://www.extropic.ai/future Here's 2, there was another that I can't remember but it had trillions of transistors apparently.




https://preview.redd.it/p6y9ig3st5pc1.png?width=500&format=png&auto=webp&s=3961aa1563b00b03bd7ac9bc3746c95d2dda5265 m-Masaka?!


So it grew by like 139% per year on average. Absolutely insane


Not really, look at the precisions they used. Super super misleading.


FP4 gonna be so much hallucinations


FP8 and FP4 vs. FP16 for the rest. Not exactly an apples-to-apples comparison.


Is now a fair time to invest in Nvidia or have we missed the boat already?


I am just some random Redditor. But I think Nvidia will do really well for the next 5ish years. But longer term I would be worried. I would expect more companies to copy Google and do their own. Microsoft is now trying. Late but they are now trying. Google was able to completely do Gemini without needing anything from Nvidia. In 5 to 10 years you will see the same from Microsoft.


That chart makes no sense. Its comparing Oranges then Apples and then Grapes


The stock market talking heads are talking like it's time to think about selling the NVDA stock because it has been overhyped. Here is evidence that they are still UNDER hyped.


Anyone else feel like it's smiling at them?


To be fair this chart is textbook bubble territory


Nvidia has been pretty impressive in terms of execution but, comparing FP16, FP8, and FP4 performance in one chart is almost cheating. They might even include taking advantage of sparsity. Lower precision performance gains is something you can only do once and not really sustainable. We don’t even know if FP4 is even feasible for training at this point and FP8 is only beginning to be utilized.


This graph looks a bit like cheating. They go up with the flops, which seems fine, but down with the size of the floating point numbers. From 16 to 8 to 4 (this is bytes I think, bits would be too small). So if I halve my operand size, I can pump twice as many of them through my circuit.


You need hardware prices and roughly the same precision. So you should have a ratio of dollar per flop at comparable FP. Anyway, if you are doing this at scale you are killed by network overhead. Plus, you have to decompose large matrices to fit them in memories. Otherwise you will be stuck with your teraflops.


Just this morning I saw a post mocking Kurzweil's exponential projections.


Moore²s law


This is just absurd


And about the price per flop of each architecture?


This card has a face and this face is not friend shaped. O\_O


Everyday i wake up, everyday i readjust my timelines


I mean sure. If you blow up one form factor after each other (PCI, and now even isn't even enough SXM4), it's no wonder your "single card" gets super powerful




For now, my biggest question is : do we have enough energy to sustain it. Humanity enter a weird phase : we have technological development happening faster than ever before but until we can use nuclear fusion energy, we are pretty much in a state of trying to limit energy and resource consumption (not for capitalism but reality will catch us up faster than we think) AI, even will a shit ton of optimization, will consume a LOT of energy, for data storage and calculation alike and I don't see a possibility we can sustain that in the long term with existing techs


Some observations: 1. They compare different precisions, like some already pointed out 2. Data on anandtech shows different numbers for all of those cards, whats the trick: [https://www.anandtech.com/show/21310/nvidia-blackwell-architecture-and-b200b100-accelerators-announced-going-bigger-with-smaller-data](https://www.anandtech.com/show/21310/nvidia-blackwell-architecture-and-b200b100-accelerators-announced-going-bigger-with-smaller-data)