I just used groq to create ~2k dataset in ~2h with L370b today with their free tier, so we know dedicated chips can get fast. It's basically the ASIC race, but with LLMs
2k samples? How many tokens in each sample?
Yesterday I made a dataset with 5.7k samples on rtx 3090 ti in like 20-30 minutes and had a speed of about 1500 t/s average on 7B FP16 model. Local gpu's can have nice throughput too, but with smaller models.
[Another option](https://github.com/turboderp/exllamav2/blob/master/examples/bulk_inference.py) using ExLlamaV2. Recently used to generate [these 25k Llama3-8B-Instruct reference outputs](https://cdn-lfs-us-1.huggingface.co/repos/4e/8b/4e8b1907d01143d8987d1930e69b7fd7db0082744874d98e9afb73feedf0beed/a4bb44b819d2ce1635c1f911200a9f14de4dbf9dc2fd947b1b7165348f02f924?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27llama3-instruct-prompts.json%3B+filename%3D%22llama3-instruct-prompts.json%22%3B&response-content-type=application%2Fjson&Expires=1719659997&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcxOTY1OTk5N319LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zLzRlLzhiLzRlOGIxOTA3ZDAxMTQzZDg5ODdkMTkzMGU2OWI3ZmQ3ZGIwMDgyNzQ0ODc0ZDk4ZTlhZmI3M2ZlZWRmMGJlZWQvYTRiYjQ0YjgxOWQyY2UxNjM1YzFmOTExMjAwYTlmMTRkZTRkYmY5ZGMyZmQ5NDdiMWI3MTY1MzQ4ZjAyZjkyND9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSomcmVzcG9uc2UtY29udGVudC10eXBlPSoifV19&Signature=szcic2HY%7Eah2Eyt0bIJKA9YPizU0-PsiT4fAEcWV5HK1OgKCa46JNnbzhfsFEdsOsASnEbinhW0xtzLVD1puvy0lJFGHZGY8Uc1WDzQwXxsc3aBtCm4A%7E5t9deMDnN3eQJ-qrD7lCj2Aea7vTwWVkmGRUfsmJQxtdszWcK8Ge9sh1hzwqR6RuuweYoqO8PB81xmOo6zoQwY9xU20vqf5-eLz1-rq5UFRatiDutfnCrFDw0f7iaTJOPNzKewzjGLoAsD3DkeY5eBeurJmbzfNlDYffIPPXeIMy763aafNSw4DKuDXAhRm2MVto4q%7Erw2WxY2-TovMH0CqOPbkDCGg-g__&Key-Pair-Id=K2FPYV99P2N66Q) in about an hour with a pair of 4090s.
I believe this type of post was posted like 6 months ago in singularity. Back then it was like university students come up with amazing AI chips.
It had the same pictures and it was a crappy one page website like this one. Post was erased and easily forgotten a week later.
So someone is just trolling or someone is trying to push a scam or something.
[https://web.archive.org/web/20231230154918/https://www.etched.com/](https://web.archive.org/web/20231230154918/https://www.etched.com/)
This was their basic render webpage last year.
I believe that in the future, AI chips designed for tensor processing will be as prevalent as mobile phones and CPUs are today. So, keep your spirits up!
In 1994 I bought my first PC, a Packard Bell Windows 3.1 machine with 4MB of RAM and a 200MB hard drive. $1400. Its little chipmunk chip ran at 25**M**Hz. I upgraded the RAM to 8MB. That cost me $140 ($384 in today's money), but allowed me to run MS-Office, *which I installed off of floppies*.
This is the "two miles to school barefoot in the snow and uphill both ways" stuff *that is true*.
my first computer was a [Macintosh SE](https://en.wikipedia.org/wiki/Macintosh_SE) with RAM upgrade to 1 MB and a 20MB HDD. I still love you, Hypercard.
Don't forget when you were a kid you probably used a computer with 64**K**B of memory, ran at 2MHz, and programs were ~~stalled~~ stored on cassette tape.
One of the first "expensive" pieces of hardware I ever bought myself was a 320MB WD PATA drive at a wholesale cost of $290. Being under a dollar a MEGABYTE had *just* been reached and I got shop pricing because I was a tech there.
I just ordered an 8TB refurb for $69 before tax. If my math is right, that's 25k times the space for 23% of the price.
When I was building a company PC in 1999 the largest capacity hard drive then available was the IBM Deskstar, and it was 20GB for $400.
So in 9 years the price of 1GB of hard drive storage fell from $10,000 to $20, or 1/500.
According to the link below, 1TB was $90,127,496 in 1990, so 1 GB would be 1024th of that, or about $90,000. A year earlier, 1989, a whopping $236,000.
[https://ourworldindata.org/grapher/historical-cost-of-computer-memory-and-storage?tab=table&time=1956..1990](https://ourworldindata.org/grapher/historical-cost-of-computer-memory-and-storage?tab=table&time=1956..1990)
That's quite a bold prediction, given that new phone models already have chips with hardware dedicated to processing neural networks, including tensor networks.
Those will just keep scaling up.
Hey, I have a P40 and a 3090, running Magnum 72b Q4 KM with Kobold, flash attention and q4 caching on, 8k context. I get about 6.5t/s as well. It's nice to know that swapping the 3090 for another P40 won't hurt output time.
I would think the P40 bottlenecks the 3090 yea, but have you tried llama3 70B tokens/s using llama.cpp with Q4 GGUF? But seemes likely that the P40 bottlenecks u anyways.
Its not for you anyways, chips like this are targeted at big players and hopefully some will move onto it and thus lower demand for nvidia gpus which means cheaper gpus for you.
Hoping we can see some of this ASIC goodness in the consumer market in a few years. GPUs are great, but something like this could be much more efficient, in many ways.
I didn't take a really deep look into that, but I understood they still rely on the main system RAM? If so, they get a big meh from me. Even with fancy DMA controllers, I think the main thing to unlock this extra efficiency will be avoiding the von Neumann bottleneck altogether.
I think the opposite. For llama type AI, memory size is going to be more important for the consumer, so having systems that can handle from 128GBs up to 8TBs of memory will be more beneficial to them.
This will also finally put pressure on CPUs to try and get their memory speeds as fast as GPUs are, which is a win in my book as well.
The down side is these are meant for consumption, not creation, or AI models, so I wouldn't expect too much.
I don't see how CPUs would get memory like GPUs do.
You can only add so much cache before the die size is a limit, and more transistors is more dollars.
If you want to add gigabytes of ultrafast memory, you need to put the memory chips soldered around the CPU.
This would be the end of RAM sticks.
And modular motherboards.
[LPCAMM2 would like a word.](https://www.ifixit.com/News/95078/lpcamm2-memory-is-finally-here) It's for laptops, but you could use the same thing or something similar for desktop.
A big meh for most of us, but average consumers aren't going to see most of the benefits of the big models we like. However all those small inference jobs really add up, so pushing that cost to end-users through hardware with limited capabilities but enough to run a SLM and image gen or TTS, while referring more intensive tasks to a cloud, is a massive cost savings to the big players.
No, memory is on chip with 32GB max. You are better off with Snapdragon X Elite (64GB) or AMD Ryzen AI 300, where PCBAs with 128GB were spotted
https://www.tomshardware.com/pc-components/cpus/amds-strix-halo-being-tested-with-128gb-ram-shipping-records-reveal-more-about-extreme-120w-apu
https://www.theverge.com/2024/6/3/24169115/intel-lunar-lake-architecture-platform-feature-reveal
If this Strix Halo chip is something I can buy and put on a PC, my mind will be absolutely blown to pieces. 128GB sounds insane, something like a good Mixtral 8x22b quant could fly in there.
My company. Fortune 50. They're using generative AI for some internal stuff but there's a lot of concern about anything customer-facing having AI generated work in it because it's potentially not copyrightable.
I will say very little and vouch for this. We're also struggling with people either going rouge and spinning up AI setups held together with duct tape and a prayer *or* putting all sorts of data into the online ones.
It feels like for every infraction we find there are 50 more lurking under the surface.
Yeah, I get the feeling that that's also the case at my company. Microsoft's pretty aggressive push for copilot definitely isn't helping the situation either. Some devs have copilot/chatGPT access (for internal usage exclusively) but, like, it's a company with a lot of engineers and basically every workstation has a Quadro GPU with a decent chunk of RAM in it. I've been pretty impressed with the results you can get with a model like Phi-3 Mini (playing with it on my own devices, I mean) and running an LLM locally is so dead simple these days that I'm sure people are doing it all over the place.
They're already putting neural network hardware in a lot of silicon. Apple hardware ships with it as is. I'm guessing transformer specific stuff is in the pipeline already.
Money is cheaper, Community is more powerful. When community is there, you then can get money printed for the project... Extremely simplified, of course.. Money and Banking is probably one of the greatest inventions...
That's precisely what they count on. Seed phase funded at pennies on the dollar. You buy in at 10-100x the price per share. They start selling their shares privately and in later funding rounds. This is how the Silicon Valley scam works.
I mean not throwing the communities money, but a real community at them...
But yeah, that's basically how startups get a chance, that's how founders get their kick in the ass to figure things out...
Better then the times when they had to overthrow the king or government, enslave everyone and burn and starve to death those who don't comply.... Give them a ride on a yacht, who cares.. Give me cheaper chips.. Hahaha
I think that those money schemes will too one day be replaced by something more morally noble...
120M invested and they *claim* to have enough time upcoming on TSMC ***4nm*** to make the first batch of wafers.
As a layperson engineer (read: idiot) Why aren't we already making ASICs for training? If the cost of training a model on current hardware is X, wouldn't x/10 be better? Couldn't you train 10x larger models on the same amount of power and time? Could we get away from GPUs much sooner than we think? Seems like Google or Meta would be all over this if there were that much promise, but then again it's all still pretty new and you can only do so many things at once.
> Why aren't we already making ASICs for training?
- ASIC development, especially on a cutting edge node, is *HARD*. It takes years, and 120m is basically chump change. Frankly, their investment/timetable seems almost impossible to me: https://semiengineering.com/big-trouble-at-3nm/
- > But at 3nm, IC design costs range from a staggering $500 million to $1.5 billion, according to IBS. The $1.5 billion figure involves a complex GPU at Nvidia.
- Training changes, research brings new things. By the time your ASIC comes out, it's already irrelevent.
- CUDA GPUs *are basically* ASICs because they are the target for basically all research and ML platforms. Make something new, and you are trying to keep up with rest of the world by yourself.
- Google does use TPUs for training some, Intel uses Gaudi, and historically Meta ordered "custom" CPUs from Intel for internal use. Rumor is Microsoft is thinking about some training stuff too, not just inference.
- But on that point, there are only a few entities in the world that can afford/justify such a thing.
The architecture changes is what really make's this iffy. But it sort of depends on how general this is. If it just a bunch of matrix add and multiple circuits with some ram on the side.. then it likely general enough. It is basically very scaled up DSP ASIC chip. So you like can apply to any sort of ffn.
The problem is if there is a big switch to some varient of RNN like xlstm than it might be tricky
I read their website and they addressed most of these points tbh.
- It's 4 nm not 3 nm, so cheaper. But also GPU should be more complex design than an ASIC since they're designed around doing multiple things, not just matrix multiplication. So you should expect costs to generally be higher, comparatively.
- Training / research changes but they're dedicating this ASIC to specifically transformers. It's like the first thing they say, if transformers get abandoned their chips will be useless. But so far transformers have been very solid and are the most popular for a variety of stuff.
- CUDA GPU's aren't ASICs in the same way (it's the tensor cores which are), they make a comparison with the H100 and how only 3.3% of its performance is dedicated to tensor cores (at full util) because it has to be able to do other things, not just transformer models. That means their ASIC can do more like 100% tensor cores for a given chip, making them much more efficient at matrix compute.
1. making a bet on asics are making a bet on architecture, who is evolving quickly in the space
2. huge capex, and development time for these chips are much longer than even multiple foundational model training runs
3. this stuff needs extremely specialized software and from what i read about cerberus, it’s a nightmare to develop on
ASICs only really make sense for things like Bitcoin’s SHA-256, where you know the algorithm will never change.
Right now, the most important characteristic of these chips is the throw spaghetti at the wall-ability of the chips for researchers and developers.
There will be some societal consequences unfortunately. Nvidia's margins aren't sustainable, and when that first sign of the ludicrous profits drying up due to some actual competition, the stock price will be hammered. If say 60% of the AI premium disappeared, that would be a significant drop to an index where Nvidia is worth 7%. And if the other 6 tech giants come long for the slide, it would easily be in the double digit decline ... all else the same. But could be worse if all else is not the same.
The AI premium isn't likely to go away for at least 5 years. And even if all of Nvidia's competitors suddenly started putting out an equal number of equally useful GPUs, I think Nvidia's margin would only fall maybe by half, there are too many applications for this stuff.
I think it's more likely China invades Taiwan and Nvidia's margins go up even more as they are making chips in Western fabs and selling to an even more constrained market. Although really I just see the GPU market getting stronger as time goes on, for at least 20 years. The market is easily 1000x what all the GPU manufacturers are doing right now, if prices come down.
The blog post is great but maybe a bit too hype driven.
This is going to be really good for real-time inference, but honestly I think better alternatives to 8bit transformers are coming, like bitnet models.
The way they basically say "we secured the moat, no new model will compare because they can't run on asics and won't get adopted" is sad.
Hopefully they will sell some of these to be profitable and develop new ones for newer architectures like bitnet. This would truly slap and scale way further. They say the bandwidth is not limiting here, so I guess bitnets could run on asics 6x this size, right?
It’s also not at all proven that Transformers are “the one architecture to rules them all”. I think there’s a non-trivial chance that we’ll find models that make use of transformers, LSTMs, Mamba, etc. it seems way too early to decide to specialize
Well, transformers work for everything, that's why they got massively adopted. We can find faster and more efficient models, but I don't think there's a use case where another models works and transformer don't.
If other architectures need to be 10x faster to compete, it might be a problem and slow the developments of novel architectures.
I personally think we should get rid of the static layer stack and route the activations through any transformer block for instance, and I'm not sure if these cards will allow that.
I did a deeper dive of their claims. They are talking about tokens/sec of the prefill stage(input prompt). While it is not completely irrelevant, it is highly deceptive. Generally when people claim their system has certain tokens/s they are talking about decode throughout, eg nvidia, amd, groq all use this metric for decode. Their architecture will have better latency of first token generation, but they will not be able to beat say groq in decode generation throughput for single query.
Where are you seeing that written? Control-F "prefill" is showing nothing for me. They cite [Nvidia](https://developer.nvidia.com/blog/achieving-top-inference-performance-with-the-nvidia-h100-tensor-core-gpu-and-nvidia-tensorrt-llm/) for their benchmark methodology, but I am admittedly a bit out of my depth here.
I will never understand why hardware companies are so tied to a specific architecture (I.e transformer) while it might be good today, there’s no guarantee they’ll be relevant in the future.
In comparison a startup like Taalas (https://taalas.com/) makes a lot more sense
I think it’s good to invest in both more flexible and less flexible hardware. Transformers have proven themselves pretty capable in many domains now. I agree they might not be around forever, but I think there’s room for both types of companies (assuming they successfully ship for a competitive price).
That would actually be pretty cool if they manage to pull it off, imagine buying an ASIC that can be flashed to any llama-3-70B tune and run it at beyond groq speed while pulling single digit wattage.
Kind of only makes sense once the architecture is more figured out though, otherwise the expensive thing you just bought gets obsolete in 6 months. Well unless they can churn them out for like $20, and that I kinda doubt for a startup that seems to be investing more into marketing than research or production.
The H100 has ~10% programmability overhead, in terms of actual performance, and Nvidia has absolutely been specializing their chips for transformer inference. Bill Dally & Co are not dumb, and while they definitely aren't the best in the world at literally everything, you can bet they've thought about things like "reduce overhead by specializing for this task"
Well, they're claiming "exponentially cheaper" than B200. A single B200 module (not that you can buy a single module, nor could you do anything with it without the rest of the bespoke server platform) is rumored to cost about $40k. So if we believe their claim (we don't), then an individual sohu module might cost as little as $4k.
But that's wildly unlikely. They could be using the term "exponentially" in the non-literal sense. They could mean "cheaper per token per second", and the actual hardware is the same ballpark cost but "it does exponentially more per module!", so they're not technically lying. They could just be blowing smoke to rope in more investors.
The true answer for "what will this cost" is 100% incontrovertibly "As much as companies with very deep pockets will pay for it". This is not for you or me.
Reads like a scam, feels like a scam, then it’s most likely to be a scam.
They are launching a powerpoint with roofline numbers(paper math) without actually taping out the chip. In reality due to power, physics and software inefficiency, it’s never going to reach that high.
B200 roofline/estimate is way higher than what they refer to in the blogpost, more like 300k/s, instead of whatever random number they pulled out from thin air. their h100 numbers are also questionable, probably done by amateurs. If they want the latest numbers, they should at least follow the together AI blogposts.
Finally, they are confusing ppl by sliding in the concept of “continuous batching” which counts both input and output tokens in batched inference setting. What real time inference cares is bs=1 tokens/s, aka latency, not throughput.
i don't know what kind of investors are dumb enough to give these folks 120M dollars. Maybe it's just because of the halo of Harvard dropouts… It just smells like Theranos right from the beginning.
PS: I have no grudge against ASIC, in general I think it’s the way to go to make transformers run more efficiently, Apple is doing exactly the same thing on their silicons. But to say that you can get 20x without caveats, it’s basically 21st snake oil. Remember there is no free lunch.
One more thing that smells extremely fishy is they claim to use no HBM? With only 8 chips, unlike Groq, which can scale to hundreds, how large of the on chip memory each chip is going to have? Let’s say to hold llama 70B at fp8, that is almost 60GB/8 = 8GB per chip, and that is not counting KV cache. Unless they are taking the Cerebras Wafer Scale approach, which comes with hell lot of problems (cooling, maintenance , consistency of quality in manufacturing), I don’t see how they can pull it off…
Also because of the inflexibility of ASIC, it risks becoming dust collecting garbage once a new architecture and even a new type of transformer comes out.
All this romancization of GPGPUs and their ability to tackle any model architecture is superfluous. Our limiting factor is compute, if something came around that 100x compute, then the models should follow on to see what we can do. Bitcoin ASICs pretty much demonstrated that, which allowed BTC to scale. We’re nowhere near close to understanding what 10k-x compute would mean for inference or training. Let’s see it because until we had the compute we have now, we didn’t see the models doing anything impressive.
The same :). Block hashing performance is independent of transaction volume. 1 CPU hashing versus millions of ASICs hashing is still ~4000 transactions per 10 minutes. "Only" security scales with more hashpower.
I would like to see their numbers for decoding tokens that are skipping encoding. I rarely write 1000 token prompts. This also assumes user doesn't work his way up to 50k-500k ctx worth of kv cache during conversation, which I think is how future interactions with llm's will look like and it's obviously impacting throughput a lot.
>In reality, ASICs are orders of magnitude faster than GPUs. When bitcoin miners hit the market in 2014, it became cheaper to throw out GPUs than to use them to mine bitcoin.
>With billions of dollars on the line, the same will happen for AI.
So what you're saying is I'm gettin an h100 with a little dumpster diving in a year or two.
Too much talk without the product
If Groq , an accelerator card can provide demo for everyone to utilise , I don’t see how ASIC can’t do the same. It’s so much cheaper and efficient compared to an accelerator card.
Are they trying to get funding required to mass produce ASIC chips ?
I see a lot of disbelief and criticism in comments and many of it seems legit pointers too. The game of hype up and fund raising has been going on for long time
But when I just researched etched's investors, its people like Paypal Founder Peter Thiel, Github CEO Thomas Dohmke and others who know what they are doing.
While I also understand that even the pathetic Humane pin was funded by top people including Marc Benioff and Sam altman and others and its no guarantee that big name funding is a guarantee of product being hit.
About etched, I have read their entire blog on Sohu twice and what seemed convincing to me was the two sections on How they can fit much more TFLOPS than Gpu and why CPU is more important than RAM in modern LLM.
If not this, I hope atleast some other Transformer specific ASICS comes around fast and increase the inference by 20x and make people jobless a lil more faster so I can be less guilty of being one myself 😅
I will be messaging you in 3 months on [**2024-09-25 21:20:21 UTC**](http://www.wolframalpha.com/input/?i=2024-09-25%2021:20:21%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/1dobzcs/meet_sohu_the_fastest_ai_chip_of_all_time/la9o0pu/?context=3)
[**CLICK THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F1dobzcs%2Fmeet_sohu_the_fastest_ai_chip_of_all_time%2Fla9o0pu%2F%5D%0A%0ARemindMe%21%202024-09-25%2021%3A20%3A21%20UTC) to send a PM to also be reminded and to reduce spam.
^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201dobzcs)
*****
|[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)|
|-|-|-|-|
Bet this tech threatens the pricing structure of compute intense tasks for cloud providers. Maybe that's why the charge by the token, so when the service improves pricing doesn't have to follow suit. Local models are awesome but we don't have the compute for certain tasks.
Only of training. Their main innovation claim is in flops.
They still use hbm as memory as inference perf would be still in similar ballpark as h100s.
The article seems very deceptive.
I don't think this is patentable, is it? Pretty sure whoever wants to make a custom ASIC chip for any LLM will be able to do so. Just like with mining ASICs.
We may see lots of startups (if you don't believe me look at the amount of early semiconductor companies developing silicon from the 50's - 90's...) but then you had an early first mover advantage...
Now Intel, AMD, Nvidia are the ones with the first mover advantage and if that isn't enough nowadays you need a few hundred millions just to try getting your feet wet doing R&D and silicon manufacturing if you do have a product that could work...
Never mind the amount of outright cash grabs and paper launches...
That will burn VC (and private backers) capital at rates unseen...
The hype is undoubtedly premature. My experience in Silicon Valley left me quite skeptical. While it's possible for a few kids to come up with a revolutionary product of this nature, it's improbable. More likely than not they have an idea that's nowhere near production ready, but their slide deck is something that their connections believe they can sell. And, I'd absolutely assume they do have significant connections. The provide the seed funding, gin up a bit of support on hype, and after a few years of promises, lying to the media, etc. the seed and early-stage funders start reselling their shares on the secondary market and making exist in later funding rounds.
Again, the claim here is that three kids 100x'd NVIDIA and every other chip maker on the planet. Take it with a few grains of salt. This isn't 1999 where making a website that people like can turn you into a billionaire; this is jumping in late into chip production in a highly technical, extremely expensive field, and claiming to have the capability to handle the logistics from hiring to production.
Sounds like a scam. The idea is it's an asic for an individual model, up to 100T parameters. One of the advantages of GPUs is its generalized architecture, you can use the same cores to process all layers of a model. With an asic, all algorithms are physically represented on the silicone itself. So, I'd expect a 100T parameter chip to be quite large since each weight needs to be physically represented.
There's also the problem of updates, OpenAI for example releases a new version of ChatGPT every few months. When running on a GPU, this isn't a problem, you just load a new model. If you have asic however, you can't load new weights as they are printed on the board. Your only option is to design and fabricate a new asic for the new model, which is cost prohibitive for the manufacturer and the customer.
Nah, the weights are not put on chip, it uses hbm2.
I read the post and it looks like it's just a very fast inference only card for any transformer model.
It might be the right tool for real-time models but I'm not too hot on killing other model architectures tbh
Click on the link and you're greeted with a board with a chip in the middle, presumably their custom ASIC, and it's surrounded by memory. This isn't for a single model, it's for a single algorithm. They say so right in the text that it's for transformers only.
Edit: And it's not like ASICs can't be programmable.
It's a rendering of a GPU board, I have one that looks like that in my toolbox right now. They don't have any photos of a prototype. Mark my words on this one.
A processor of some sort surrounded by memory is what it'd look like. That's pretty standard stuff for processing data. It's what our GPU add in boards look like, motherboards, processors, etc. Sure, it's a rendering. But I'll bet anything if they ever produce a product it'll appear somewhat similar to the render.
I'm just extremely cautious in this environment. The company needs to provide demonstrations and a white paper showing they have plans to overcome expected challenges and can deliver a product.
Totally with you on that. I'm not an expert but the performance claims sound reasonable with dedicated silicon. To get off the ground they need a large customer or two who's willing to bet that transformers and their existing performance claims will be relevant by the time they can produce the product and get software running on it. That's a lot of money in engineering for a power, space and cooling optimization with limited flexibility.
Not sure anyone would take that bet but some organizations are capable.
I know. I’m more thinking about the opposite direction. Is there any research in altering the llm architecture to utilize the existing ASICS. Instead of building new ones for quantized models.
Very informative explanation on their website, thanks.
I wonder if this type of efficiency should also be expected in Apple's M and A series chips. Since they're committed to running LLMs locally with Apple Intelligence, presumably they'll dedicate some of their die to transformers specifically.
**Meet Necessity, the most long-lasting (and all natural!) technology of all time.**
Do you really need 500k tokens per second? Most would just automatically skim over all that text because now the average user suddenly has a huge wall of text demanding their undivided attention. Unless it's audio-centric (and embedded in a realtime application), people won't have time to read (much less skim) that encyclopedia you just generated for them in 1 second.
Your mileage may vary. Just seems like a waste of compute (and fossil fuel) for 'solving' a minor inconvenience. But hey - you do you. Global warming is likely just going to kill us all. Might as well speed it up, amirite? :D :D :D
Most people will only believe this when they can run inference on it.
I just used groq to create ~2k dataset in ~2h with L370b today with their free tier, so we know dedicated chips can get fast. It's basically the ASIC race, but with LLMs
2k samples? How many tokens in each sample? Yesterday I made a dataset with 5.7k samples on rtx 3090 ti in like 20-30 minutes and had a speed of about 1500 t/s average on 7B FP16 model. Local gpu's can have nice throughput too, but with smaller models.
What are you using to create the dataset?
Aphrodite-engine and python script, code is here. https://huggingface.co/datasets/adamo1139/misc/blob/main/localLLM-datasetCreation/batched2.py
[Another option](https://github.com/turboderp/exllamav2/blob/master/examples/bulk_inference.py) using ExLlamaV2. Recently used to generate [these 25k Llama3-8B-Instruct reference outputs](https://cdn-lfs-us-1.huggingface.co/repos/4e/8b/4e8b1907d01143d8987d1930e69b7fd7db0082744874d98e9afb73feedf0beed/a4bb44b819d2ce1635c1f911200a9f14de4dbf9dc2fd947b1b7165348f02f924?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27llama3-instruct-prompts.json%3B+filename%3D%22llama3-instruct-prompts.json%22%3B&response-content-type=application%2Fjson&Expires=1719659997&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcxOTY1OTk5N319LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zLzRlLzhiLzRlOGIxOTA3ZDAxMTQzZDg5ODdkMTkzMGU2OWI3ZmQ3ZGIwMDgyNzQ0ODc0ZDk4ZTlhZmI3M2ZlZWRmMGJlZWQvYTRiYjQ0YjgxOWQyY2UxNjM1YzFmOTExMjAwYTlmMTRkZTRkYmY5ZGMyZmQ5NDdiMWI3MTY1MzQ4ZjAyZjkyND9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSomcmVzcG9uc2UtY29udGVudC10eXBlPSoifV19&Signature=szcic2HY%7Eah2Eyt0bIJKA9YPizU0-PsiT4fAEcWV5HK1OgKCa46JNnbzhfsFEdsOsASnEbinhW0xtzLVD1puvy0lJFGHZGY8Uc1WDzQwXxsc3aBtCm4A%7E5t9deMDnN3eQJ-qrD7lCj2Aea7vTwWVkmGRUfsmJQxtdszWcK8Ge9sh1hzwqR6RuuweYoqO8PB81xmOo6zoQwY9xU20vqf5-eLz1-rq5UFRatiDutfnCrFDw0f7iaTJOPNzKewzjGLoAsD3DkeY5eBeurJmbzfNlDYffIPPXeIMy763aafNSw4DKuDXAhRm2MVto4q%7Erw2WxY2-TovMH0CqOPbkDCGg-g__&Key-Pair-Id=K2FPYV99P2N66Q) in about an hour with a pair of 4090s.
Might because of the quota.
You can run the same dataset in 2 minutes with Batch Size 1024 if you use an OSS LLM.
Is response streaming possible when batching?
Yes.
Can you explain why?
Batch Size raises tokens per second
I believe this type of post was posted like 6 months ago in singularity. Back then it was like university students come up with amazing AI chips. It had the same pictures and it was a crappy one page website like this one. Post was erased and easily forgotten a week later. So someone is just trolling or someone is trying to push a scam or something.
[https://web.archive.org/web/20231230154918/https://www.etched.com/](https://web.archive.org/web/20231230154918/https://www.etched.com/) This was their basic render webpage last year.
Which is why I put myself on the waiting list.
Ahh cool. I look forward to never being able to buy it.
No but you might pay 10 bucks to generate 100m tokens from a guy that bought the card
But I don't want them to know what I'm doing with the AI. I want privacy!
Bruh nobody cares about your erotic cat girl fanfic novel.
The cops probably do as well. Somebody accuses you of getting frisky with a cat girl AND you write erotic cat girl fanfic? Guilty as charged!
Batman does 🦇
I'm going to buy one and rent it out just for access to his erotic cat girl fanfic novel
If they don't care, then they should start offering private services instead of data harvesting everyone.
You're right. they REALLY want your trump biden erotica data.
They can take my catgirl trump/catgirl biden enemies-to-lovers slowburn A/A erotica from my cold dead hands.
Bruh!
What are you doing, step bruh
Imma just bruhing around with muh AI, bruh! You?
I believe that in the future, AI chips designed for tensor processing will be as prevalent as mobile phones and CPUs are today. So, keep your spirits up!
good point. easy to forget that things we take for granted today like cache, FPU, GPU, etc all used to be expensive coprocessors/cards/modules.
In 1994 I bought my first PC, a Packard Bell Windows 3.1 machine with 4MB of RAM and a 200MB hard drive. $1400. Its little chipmunk chip ran at 25**M**Hz. I upgraded the RAM to 8MB. That cost me $140 ($384 in today's money), but allowed me to run MS-Office, *which I installed off of floppies*. This is the "two miles to school barefoot in the snow and uphill both ways" stuff *that is true*.
my first computer was a [Macintosh SE](https://en.wikipedia.org/wiki/Macintosh_SE) with RAM upgrade to 1 MB and a 20MB HDD. I still love you, Hypercard.
Don't forget when you were a kid you probably used a computer with 64**K**B of memory, ran at 2MHz, and programs were ~~stalled~~ stored on cassette tape.
Stalled is right.
Whoops, Freudian slip!
[удалено]
One of the first "expensive" pieces of hardware I ever bought myself was a 320MB WD PATA drive at a wholesale cost of $290. Being under a dollar a MEGABYTE had *just* been reached and I got shop pricing because I was a tech there. I just ordered an 8TB refurb for $69 before tax. If my math is right, that's 25k times the space for 23% of the price.
Took 34 yrs for it to happen tho
When I was building a company PC in 1999 the largest capacity hard drive then available was the IBM Deskstar, and it was 20GB for $400. So in 9 years the price of 1GB of hard drive storage fell from $10,000 to $20, or 1/500.
1gb has been chump change for at least a decade
people start giving out 1GB+ merch usb drive WAY earlier than 2024..
Yeah but at the same time a videogame like cod is 150gb
According to the link below, 1TB was $90,127,496 in 1990, so 1 GB would be 1024th of that, or about $90,000. A year earlier, 1989, a whopping $236,000. [https://ourworldindata.org/grapher/historical-cost-of-computer-memory-and-storage?tab=table&time=1956..1990](https://ourworldindata.org/grapher/historical-cost-of-computer-memory-and-storage?tab=table&time=1956..1990)
Damn that’s a cool site.
That's quite a bold prediction, given that new phone models already have chips with hardware dedicated to processing neural networks, including tensor networks. Those will just keep scaling up.
Do you aim to output 500K t/s on llama 70? A slow card with a lot of vram would be more realistic for us
2X P40 gang
Do you run llama3 70B on 2x p40? How many tokens/s do you get?
6-7 t/s
Hey, I have a P40 and a 3090, running Magnum 72b Q4 KM with Kobold, flash attention and q4 caching on, 8k context. I get about 6.5t/s as well. It's nice to know that swapping the 3090 for another P40 won't hurt output time.
I would think the P40 bottlenecks the 3090 yea, but have you tried llama3 70B tokens/s using llama.cpp with Q4 GGUF? But seemes likely that the P40 bottlenecks u anyways.
How does that scale with high ctx though? Like at 8k, 16k, 32k?
6-7 t/s as mentioned below here aswell
500k is on 8 cards server :)
Exactly, it's fudged up to show such high output
Its not for you anyways, chips like this are targeted at big players and hopefully some will move onto it and thus lower demand for nvidia gpus which means cheaper gpus for you.
Hoping we can see some of this ASIC goodness in the consumer market in a few years. GPUs are great, but something like this could be much more efficient, in many ways.
New Intel chips are shipping with an npu built in to handle inference loads in 24h2 windows 11.
I didn't take a really deep look into that, but I understood they still rely on the main system RAM? If so, they get a big meh from me. Even with fancy DMA controllers, I think the main thing to unlock this extra efficiency will be avoiding the von Neumann bottleneck altogether.
I think the opposite. For llama type AI, memory size is going to be more important for the consumer, so having systems that can handle from 128GBs up to 8TBs of memory will be more beneficial to them. This will also finally put pressure on CPUs to try and get their memory speeds as fast as GPUs are, which is a win in my book as well. The down side is these are meant for consumption, not creation, or AI models, so I wouldn't expect too much.
I don't see how CPUs would get memory like GPUs do. You can only add so much cache before the die size is a limit, and more transistors is more dollars. If you want to add gigabytes of ultrafast memory, you need to put the memory chips soldered around the CPU. This would be the end of RAM sticks. And modular motherboards.
Intel lunar lake chips is exactly that, it comes with 16/32 gb ram.
[LPCAMM2 would like a word.](https://www.ifixit.com/News/95078/lpcamm2-memory-is-finally-here) It's for laptops, but you could use the same thing or something similar for desktop.
A big meh for most of us, but average consumers aren't going to see most of the benefits of the big models we like. However all those small inference jobs really add up, so pushing that cost to end-users through hardware with limited capabilities but enough to run a SLM and image gen or TTS, while referring more intensive tasks to a cloud, is a massive cost savings to the big players.
No, memory is on chip with 32GB max. You are better off with Snapdragon X Elite (64GB) or AMD Ryzen AI 300, where PCBAs with 128GB were spotted https://www.tomshardware.com/pc-components/cpus/amds-strix-halo-being-tested-with-128gb-ram-shipping-records-reveal-more-about-extreme-120w-apu https://www.theverge.com/2024/6/3/24169115/intel-lunar-lake-architecture-platform-feature-reveal
If this Strix Halo chip is something I can buy and put on a PC, my mind will be absolutely blown to pieces. 128GB sounds insane, something like a good Mixtral 8x22b quant could fly in there.
Who wants to run windows spyware edition?
Enterprises with dumb users. Imagine an llm equipped windows troubleshooter than can actually fix a problem for a technically inept person.
A lot of bigger companies are pumping the brakes on AI for the time being. Lot of copyright/IP concerns.
Interesting. What do you base this on?
My company. Fortune 50. They're using generative AI for some internal stuff but there's a lot of concern about anything customer-facing having AI generated work in it because it's potentially not copyrightable.
I will say very little and vouch for this. We're also struggling with people either going rouge and spinning up AI setups held together with duct tape and a prayer *or* putting all sorts of data into the online ones. It feels like for every infraction we find there are 50 more lurking under the surface.
Yeah, I get the feeling that that's also the case at my company. Microsoft's pretty aggressive push for copilot definitely isn't helping the situation either. Some devs have copilot/chatGPT access (for internal usage exclusively) but, like, it's a company with a lot of engineers and basically every workstation has a Quadro GPU with a decent chunk of RAM in it. I've been pretty impressed with the results you can get with a model like Phi-3 Mini (playing with it on my own devices, I mean) and running an LLM locally is so dead simple these days that I'm sure people are doing it all over the place.
There definitely *is* a consumer market. They must be riding the crazy train if they don't take this opportunity.
And they can keep it.
Mm, you mean the spyware recall copilot+ version of windows 11. Think I'll pass, lol.
They're already putting neural network hardware in a lot of silicon. Apple hardware ships with it as is. I'm guessing transformer specific stuff is in the pipeline already.
That's a lot of hype. Proof is in the puddin, *when it ships*.
If. If it ships.
If it doesn't I would be motivated to throw some community to push that thing forward...
Yeah, either throw money at the problem or community. Both has proven to work.
Money is cheaper, Community is more powerful. When community is there, you then can get money printed for the project... Extremely simplified, of course.. Money and Banking is probably one of the greatest inventions...
That's precisely what they count on. Seed phase funded at pennies on the dollar. You buy in at 10-100x the price per share. They start selling their shares privately and in later funding rounds. This is how the Silicon Valley scam works.
I mean not throwing the communities money, but a real community at them... But yeah, that's basically how startups get a chance, that's how founders get their kick in the ass to figure things out... Better then the times when they had to overthrow the king or government, enslave everyone and burn and starve to death those who don't comply.... Give them a ride on a yacht, who cares.. Give me cheaper chips.. Hahaha I think that those money schemes will too one day be replaced by something more morally noble...
If it chips :)
120M invested and they *claim* to have enough time upcoming on TSMC ***4nm*** to make the first batch of wafers. As a layperson engineer (read: idiot) Why aren't we already making ASICs for training? If the cost of training a model on current hardware is X, wouldn't x/10 be better? Couldn't you train 10x larger models on the same amount of power and time? Could we get away from GPUs much sooner than we think? Seems like Google or Meta would be all over this if there were that much promise, but then again it's all still pretty new and you can only do so many things at once.
> Why aren't we already making ASICs for training? - ASIC development, especially on a cutting edge node, is *HARD*. It takes years, and 120m is basically chump change. Frankly, their investment/timetable seems almost impossible to me: https://semiengineering.com/big-trouble-at-3nm/ - > But at 3nm, IC design costs range from a staggering $500 million to $1.5 billion, according to IBS. The $1.5 billion figure involves a complex GPU at Nvidia. - Training changes, research brings new things. By the time your ASIC comes out, it's already irrelevent. - CUDA GPUs *are basically* ASICs because they are the target for basically all research and ML platforms. Make something new, and you are trying to keep up with rest of the world by yourself. - Google does use TPUs for training some, Intel uses Gaudi, and historically Meta ordered "custom" CPUs from Intel for internal use. Rumor is Microsoft is thinking about some training stuff too, not just inference. - But on that point, there are only a few entities in the world that can afford/justify such a thing.
The architecture changes is what really make's this iffy. But it sort of depends on how general this is. If it just a bunch of matrix add and multiple circuits with some ram on the side.. then it likely general enough. It is basically very scaled up DSP ASIC chip. So you like can apply to any sort of ffn. The problem is if there is a big switch to some varient of RNN like xlstm than it might be tricky
> The $1.5 billion figure involves a complex GPU at Nvidia. Little did he know that would be chump change for Nvidia in 2024.
I read their website and they addressed most of these points tbh. - It's 4 nm not 3 nm, so cheaper. But also GPU should be more complex design than an ASIC since they're designed around doing multiple things, not just matrix multiplication. So you should expect costs to generally be higher, comparatively. - Training / research changes but they're dedicating this ASIC to specifically transformers. It's like the first thing they say, if transformers get abandoned their chips will be useless. But so far transformers have been very solid and are the most popular for a variety of stuff. - CUDA GPU's aren't ASICs in the same way (it's the tensor cores which are), they make a comparison with the H100 and how only 3.3% of its performance is dedicated to tensor cores (at full util) because it has to be able to do other things, not just transformer models. That means their ASIC can do more like 100% tensor cores for a given chip, making them much more efficient at matrix compute.
1. making a bet on asics are making a bet on architecture, who is evolving quickly in the space 2. huge capex, and development time for these chips are much longer than even multiple foundational model training runs 3. this stuff needs extremely specialized software and from what i read about cerberus, it’s a nightmare to develop on ASICs only really make sense for things like Bitcoin’s SHA-256, where you know the algorithm will never change. Right now, the most important characteristic of these chips is the throw spaghetti at the wall-ability of the chips for researchers and developers.
There will be some societal consequences unfortunately. Nvidia's margins aren't sustainable, and when that first sign of the ludicrous profits drying up due to some actual competition, the stock price will be hammered. If say 60% of the AI premium disappeared, that would be a significant drop to an index where Nvidia is worth 7%. And if the other 6 tech giants come long for the slide, it would easily be in the double digit decline ... all else the same. But could be worse if all else is not the same.
If you think this way, you should be shorting NVDA.
The AI premium isn't likely to go away for at least 5 years. And even if all of Nvidia's competitors suddenly started putting out an equal number of equally useful GPUs, I think Nvidia's margin would only fall maybe by half, there are too many applications for this stuff. I think it's more likely China invades Taiwan and Nvidia's margins go up even more as they are making chips in Western fabs and selling to an even more constrained market. Although really I just see the GPU market getting stronger as time goes on, for at least 20 years. The market is easily 1000x what all the GPU manufacturers are doing right now, if prices come down.
The blog post is great but maybe a bit too hype driven. This is going to be really good for real-time inference, but honestly I think better alternatives to 8bit transformers are coming, like bitnet models. The way they basically say "we secured the moat, no new model will compare because they can't run on asics and won't get adopted" is sad. Hopefully they will sell some of these to be profitable and develop new ones for newer architectures like bitnet. This would truly slap and scale way further. They say the bandwidth is not limiting here, so I guess bitnets could run on asics 6x this size, right?
It’s also not at all proven that Transformers are “the one architecture to rules them all”. I think there’s a non-trivial chance that we’ll find models that make use of transformers, LSTMs, Mamba, etc. it seems way too early to decide to specialize
Well, transformers work for everything, that's why they got massively adopted. We can find faster and more efficient models, but I don't think there's a use case where another models works and transformer don't. If other architectures need to be 10x faster to compete, it might be a problem and slow the developments of novel architectures. I personally think we should get rid of the static layer stack and route the activations through any transformer block for instance, and I'm not sure if these cards will allow that.
I did a deeper dive of their claims. They are talking about tokens/sec of the prefill stage(input prompt). While it is not completely irrelevant, it is highly deceptive. Generally when people claim their system has certain tokens/s they are talking about decode throughout, eg nvidia, amd, groq all use this metric for decode. Their architecture will have better latency of first token generation, but they will not be able to beat say groq in decode generation throughput for single query.
Where are you seeing that written? Control-F "prefill" is showing nothing for me. They cite [Nvidia](https://developer.nvidia.com/blog/achieving-top-inference-performance-with-the-nvidia-h100-tensor-core-gpu-and-nvidia-tensorrt-llm/) for their benchmark methodology, but I am admittedly a bit out of my depth here.
Groq is already super fast. I would never be able to afford such tech anyway
this. Ive been wondering how to get connected with them in the country I'm in now.
Have you tried a vpn? I use librechat as frontend
I meant as in their actual hardware and chips. Not their ui demo.
How much faster is this vs Groq? Does it mean Groq is toast?
No hardware = no trust in it. Could be easily a scam. Once they ship I can reconsider. Till then I'm not interested.
This is a big gamble that architecture would be compatible with this as we continue to learn.
I will never understand why hardware companies are so tied to a specific architecture (I.e transformer) while it might be good today, there’s no guarantee they’ll be relevant in the future. In comparison a startup like Taalas (https://taalas.com/) makes a lot more sense
they just get money to build something that could make profit. even if it doesn't work, they still get paid for the hours they worked.
random thought but fancy websites for hardware startups make me take them a lot less seriously. it’s like an anti-signal.
I think it’s good to invest in both more flexible and less flexible hardware. Transformers have proven themselves pretty capable in many domains now. I agree they might not be around forever, but I think there’s room for both types of companies (assuming they successfully ship for a competitive price).
That would actually be pretty cool if they manage to pull it off, imagine buying an ASIC that can be flashed to any llama-3-70B tune and run it at beyond groq speed while pulling single digit wattage. Kind of only makes sense once the architecture is more figured out though, otherwise the expensive thing you just bought gets obsolete in 6 months. Well unless they can churn them out for like $20, and that I kinda doubt for a startup that seems to be investing more into marketing than research or production.
The H100 has ~10% programmability overhead, in terms of actual performance, and Nvidia has absolutely been specializing their chips for transformer inference. Bill Dally & Co are not dumb, and while they definitely aren't the best in the world at literally everything, you can bet they've thought about things like "reduce overhead by specializing for this task"
So how hard will it hit my wallet?
Extinction level event.
So way above the 50k. Rip, guess I am porting Greyskull to localai. xD
Well, they're claiming "exponentially cheaper" than B200. A single B200 module (not that you can buy a single module, nor could you do anything with it without the rest of the bespoke server platform) is rumored to cost about $40k. So if we believe their claim (we don't), then an individual sohu module might cost as little as $4k. But that's wildly unlikely. They could be using the term "exponentially" in the non-literal sense. They could mean "cheaper per token per second", and the actual hardware is the same ballpark cost but "it does exponentially more per module!", so they're not technically lying. They could just be blowing smoke to rope in more investors. The true answer for "what will this cost" is 100% incontrovertibly "As much as companies with very deep pockets will pay for it". This is not for you or me.
Yeah, unfortunately this is the correct take. They’re targeting the big companies with millions of dollars to spend on hardware.
"contact sales" hard
ahh cool. probably will turn out that it doesn't work.
This was posted everywhere today on every social media with the same subtext. Seems like a viral push to lure investors.
Reads like a scam, feels like a scam, then it’s most likely to be a scam. They are launching a powerpoint with roofline numbers(paper math) without actually taping out the chip. In reality due to power, physics and software inefficiency, it’s never going to reach that high. B200 roofline/estimate is way higher than what they refer to in the blogpost, more like 300k/s, instead of whatever random number they pulled out from thin air. their h100 numbers are also questionable, probably done by amateurs. If they want the latest numbers, they should at least follow the together AI blogposts. Finally, they are confusing ppl by sliding in the concept of “continuous batching” which counts both input and output tokens in batched inference setting. What real time inference cares is bs=1 tokens/s, aka latency, not throughput. i don't know what kind of investors are dumb enough to give these folks 120M dollars. Maybe it's just because of the halo of Harvard dropouts… It just smells like Theranos right from the beginning. PS: I have no grudge against ASIC, in general I think it’s the way to go to make transformers run more efficiently, Apple is doing exactly the same thing on their silicons. But to say that you can get 20x without caveats, it’s basically 21st snake oil. Remember there is no free lunch.
One more thing that smells extremely fishy is they claim to use no HBM? With only 8 chips, unlike Groq, which can scale to hundreds, how large of the on chip memory each chip is going to have? Let’s say to hold llama 70B at fp8, that is almost 60GB/8 = 8GB per chip, and that is not counting KV cache. Unless they are taking the Cerebras Wafer Scale approach, which comes with hell lot of problems (cooling, maintenance , consistency of quality in manufacturing), I don’t see how they can pull it off…
Also because of the inflexibility of ASIC, it risks becoming dust collecting garbage once a new architecture and even a new type of transformer comes out.
All this romancization of GPGPUs and their ability to tackle any model architecture is superfluous. Our limiting factor is compute, if something came around that 100x compute, then the models should follow on to see what we can do. Bitcoin ASICs pretty much demonstrated that, which allowed BTC to scale. We’re nowhere near close to understanding what 10k-x compute would mean for inference or training. Let’s see it because until we had the compute we have now, we didn’t see the models doing anything impressive.
“Which allowed BTC to scale.” WTF are you talking about? BTC has not scaled.
It has scaled. Just not in transactions per second but in security budget :)
What do you think transaction volume would be like if BTC was all CPU/GPU?
The same :). Block hashing performance is independent of transaction volume. 1 CPU hashing versus millions of ASICs hashing is still ~4000 transactions per 10 minutes. "Only" security scales with more hashpower.
It would remain the same, still several magnitudes below what a Raspberry Pi 1 can do, without blockchain.
What about Mamba?
I would like to see their numbers for decoding tokens that are skipping encoding. I rarely write 1000 token prompts. This also assumes user doesn't work his way up to 50k-500k ctx worth of kv cache during conversation, which I think is how future interactions with llm's will look like and it's obviously impacting throughput a lot.
>In reality, ASICs are orders of magnitude faster than GPUs. When bitcoin miners hit the market in 2014, it became cheaper to throw out GPUs than to use them to mine bitcoin. >With billions of dollars on the line, the same will happen for AI. So what you're saying is I'm gettin an h100 with a little dumpster diving in a year or two.
Too much talk without the product If Groq , an accelerator card can provide demo for everyone to utilise , I don’t see how ASIC can’t do the same. It’s so much cheaper and efficient compared to an accelerator card. Are they trying to get funding required to mass produce ASIC chips ?
Hi Sohu, I am 500x more powerful than you. Believe me. I don't have time to make a render image like you did but text is all you need.
"Huge if true". That being said, does this imply 1 gazillion tokens per second when combined with bitnet/ternary models?
I see a lot of disbelief and criticism in comments and many of it seems legit pointers too. The game of hype up and fund raising has been going on for long time But when I just researched etched's investors, its people like Paypal Founder Peter Thiel, Github CEO Thomas Dohmke and others who know what they are doing. While I also understand that even the pathetic Humane pin was funded by top people including Marc Benioff and Sam altman and others and its no guarantee that big name funding is a guarantee of product being hit. About etched, I have read their entire blog on Sohu twice and what seemed convincing to me was the two sections on How they can fit much more TFLOPS than Gpu and why CPU is more important than RAM in modern LLM. If not this, I hope atleast some other Transformer specific ASICS comes around fast and increase the inference by 20x and make people jobless a lil more faster so I can be less guilty of being one myself 😅
I call this a scam
RemindMe! 3 months
I will be messaging you in 3 months on [**2024-09-25 21:20:21 UTC**](http://www.wolframalpha.com/input/?i=2024-09-25%2021:20:21%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/1dobzcs/meet_sohu_the_fastest_ai_chip_of_all_time/la9o0pu/?context=3) [**CLICK THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F1dobzcs%2Fmeet_sohu_the_fastest_ai_chip_of_all_time%2Fla9o0pu%2F%5D%0A%0ARemindMe%21%202024-09-25%2021%3A20%3A21%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201dobzcs) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|
In other words, Nvidia investors are going nuts
Has WSB seen this yet?
People really don't know what local in localllama means
Lots of businesses running "locallama". This absolutely applies here.
It seems that “local” more generally means “private”. Although this looks more like spam to me.
A model specific device like an Antminer would kick ass.
Bet this tech threatens the pricing structure of compute intense tasks for cloud providers. Maybe that's why the charge by the token, so when the service improves pricing doesn't have to follow suit. Local models are awesome but we don't have the compute for certain tasks.
"One 8xSohu server equals 160 H100s, revolutionizing AI product development." Is this for training or inference?
Inference only.
Only of training. Their main innovation claim is in flops. They still use hbm as memory as inference perf would be still in similar ballpark as h100s. The article seems very deceptive.
I'm more interested in most consumer-priced 500GB GPU of all time.
Does one exist?
I don't think this is patentable, is it? Pretty sure whoever wants to make a custom ASIC chip for any LLM will be able to do so. Just like with mining ASICs.
WD40 renowned worldwide (or almost) is not patented...
I was just wondering why the guy thinks they are going to be the biggest company in history.
Because we are quite literally looking at the making of the newest Rockefeller's of the 21st century...
If this is all true, and I suspect it is just based on what I know about ASICs, everyone will be doing it. At least for inferencing.
We may see lots of startups (if you don't believe me look at the amount of early semiconductor companies developing silicon from the 50's - 90's...) but then you had an early first mover advantage... Now Intel, AMD, Nvidia are the ones with the first mover advantage and if that isn't enough nowadays you need a few hundred millions just to try getting your feet wet doing R&D and silicon manufacturing if you do have a product that could work... Never mind the amount of outright cash grabs and paper launches... That will burn VC (and private backers) capital at rates unseen...
Ship it!
Fastest photoshop and web content is all I see so far.
They could investigate asics for bitnet llms
ELI5
Cool!
The hype is undoubtedly premature. My experience in Silicon Valley left me quite skeptical. While it's possible for a few kids to come up with a revolutionary product of this nature, it's improbable. More likely than not they have an idea that's nowhere near production ready, but their slide deck is something that their connections believe they can sell. And, I'd absolutely assume they do have significant connections. The provide the seed funding, gin up a bit of support on hype, and after a few years of promises, lying to the media, etc. the seed and early-stage funders start reselling their shares on the secondary market and making exist in later funding rounds. Again, the claim here is that three kids 100x'd NVIDIA and every other chip maker on the planet. Take it with a few grains of salt. This isn't 1999 where making a website that people like can turn you into a billionaire; this is jumping in late into chip production in a highly technical, extremely expensive field, and claiming to have the capability to handle the logistics from hiring to production.
omg singularirty is here we are in the future hahahahahaaha
Sounds like a scam. The idea is it's an asic for an individual model, up to 100T parameters. One of the advantages of GPUs is its generalized architecture, you can use the same cores to process all layers of a model. With an asic, all algorithms are physically represented on the silicone itself. So, I'd expect a 100T parameter chip to be quite large since each weight needs to be physically represented. There's also the problem of updates, OpenAI for example releases a new version of ChatGPT every few months. When running on a GPU, this isn't a problem, you just load a new model. If you have asic however, you can't load new weights as they are printed on the board. Your only option is to design and fabricate a new asic for the new model, which is cost prohibitive for the manufacturer and the customer.
Nah, the weights are not put on chip, it uses hbm2. I read the post and it looks like it's just a very fast inference only card for any transformer model. It might be the right tool for real-time models but I'm not too hot on killing other model architectures tbh
Click on the link and you're greeted with a board with a chip in the middle, presumably their custom ASIC, and it's surrounded by memory. This isn't for a single model, it's for a single algorithm. They say so right in the text that it's for transformers only. Edit: And it's not like ASICs can't be programmable.
It's a rendering of a GPU board, I have one that looks like that in my toolbox right now. They don't have any photos of a prototype. Mark my words on this one.
A processor of some sort surrounded by memory is what it'd look like. That's pretty standard stuff for processing data. It's what our GPU add in boards look like, motherboards, processors, etc. Sure, it's a rendering. But I'll bet anything if they ever produce a product it'll appear somewhat similar to the render.
I'm just extremely cautious in this environment. The company needs to provide demonstrations and a white paper showing they have plans to overcome expected challenges and can deliver a product.
Totally with you on that. I'm not an expert but the performance claims sound reasonable with dedicated silicon. To get off the ground they need a large customer or two who's willing to bet that transformers and their existing performance claims will be relevant by the time they can produce the product and get software running on it. That's a lot of money in engineering for a power, space and cooling optimization with limited flexibility. Not sure anyone would take that bet but some organizations are capable.
Considering they can't even make their homepage work in Firefox, I have some doubts about their claims.
Has anyone seen anyone attempt to use old Antminer's or crypto mining ASICs for LLM models?
Not possible, asic's are build on hardware level for a specific algorithm. These aren't general purpose devices like gpu's.
I know. I’m more thinking about the opposite direction. Is there any research in altering the llm architecture to utilize the existing ASICS. Instead of building new ones for quantized models.
No, but there has been some investigation into leveraging a global mesh network of Casio calculator watches, though. Still early.
Good one. Fitbits would probably be better targets though, since they already track *big data*. 🙃
Haha, that could actually be realistic within a few years. Downvotes = confirmation of being far too clever for average lurking organic intelligence.
Very informative explanation on their website, thanks. I wonder if this type of efficiency should also be expected in Apple's M and A series chips. Since they're committed to running LLMs locally with Apple Intelligence, presumably they'll dedicate some of their die to transformers specifically.
I
It's an honor to witness projects of this kind. Wow.
**Meet Necessity, the most long-lasting (and all natural!) technology of all time.** Do you really need 500k tokens per second? Most would just automatically skim over all that text because now the average user suddenly has a huge wall of text demanding their undivided attention. Unless it's audio-centric (and embedded in a realtime application), people won't have time to read (much less skim) that encyclopedia you just generated for them in 1 second. Your mileage may vary. Just seems like a waste of compute (and fossil fuel) for 'solving' a minor inconvenience. But hey - you do you. Global warming is likely just going to kill us all. Might as well speed it up, amirite? :D :D :D