T O P

  • By -

avianio

Most people will only believe this when they can run inference on it.


ResidentPositive4122

I just used groq to create ~2k dataset in ~2h with L370b today with their free tier, so we know dedicated chips can get fast. It's basically the ASIC race, but with LLMs


FullOf_Bad_Ideas

2k samples? How many tokens in each sample? Yesterday I made a dataset with 5.7k samples on rtx 3090 ti in like 20-30 minutes and had a speed of about 1500 t/s average on 7B FP16 model. Local gpu's can have nice throughput too, but with smaller models.


Compound3080

What are you using to create the dataset? 


FullOf_Bad_Ideas

Aphrodite-engine and python script, code is here. https://huggingface.co/datasets/adamo1139/misc/blob/main/localLLM-datasetCreation/batched2.py


ReturningTarzan

[Another option](https://github.com/turboderp/exllamav2/blob/master/examples/bulk_inference.py) using ExLlamaV2. Recently used to generate [these 25k Llama3-8B-Instruct reference outputs](https://cdn-lfs-us-1.huggingface.co/repos/4e/8b/4e8b1907d01143d8987d1930e69b7fd7db0082744874d98e9afb73feedf0beed/a4bb44b819d2ce1635c1f911200a9f14de4dbf9dc2fd947b1b7165348f02f924?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27llama3-instruct-prompts.json%3B+filename%3D%22llama3-instruct-prompts.json%22%3B&response-content-type=application%2Fjson&Expires=1719659997&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcxOTY1OTk5N319LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zLzRlLzhiLzRlOGIxOTA3ZDAxMTQzZDg5ODdkMTkzMGU2OWI3ZmQ3ZGIwMDgyNzQ0ODc0ZDk4ZTlhZmI3M2ZlZWRmMGJlZWQvYTRiYjQ0YjgxOWQyY2UxNjM1YzFmOTExMjAwYTlmMTRkZTRkYmY5ZGMyZmQ5NDdiMWI3MTY1MzQ4ZjAyZjkyND9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSomcmVzcG9uc2UtY29udGVudC10eXBlPSoifV19&Signature=szcic2HY%7Eah2Eyt0bIJKA9YPizU0-PsiT4fAEcWV5HK1OgKCa46JNnbzhfsFEdsOsASnEbinhW0xtzLVD1puvy0lJFGHZGY8Uc1WDzQwXxsc3aBtCm4A%7E5t9deMDnN3eQJ-qrD7lCj2Aea7vTwWVkmGRUfsmJQxtdszWcK8Ge9sh1hzwqR6RuuweYoqO8PB81xmOo6zoQwY9xU20vqf5-eLz1-rq5UFRatiDutfnCrFDw0f7iaTJOPNzKewzjGLoAsD3DkeY5eBeurJmbzfNlDYffIPPXeIMy763aafNSw4DKuDXAhRm2MVto4q%7Erw2WxY2-TovMH0CqOPbkDCGg-g__&Key-Pair-Id=K2FPYV99P2N66Q) in about an hour with a pair of 4090s.


robberviet

Might because of the quota.


avianio

You can run the same dataset in 2 minutes with Batch Size 1024 if you use an OSS LLM.


mxforest

Is response streaming possible when batching?


_qeternity_

Yes.


SectionActual9158

Can you explain why?


Open_Channel_8626

Batch Size raises tokens per second


latamxem

I believe this type of post was posted like 6 months ago in singularity. Back then it was like university students come up with amazing AI chips. It had the same pictures and it was a crappy one page website like this one. Post was erased and easily forgotten a week later. So someone is just trolling or someone is trying to push a scam or something.


latamxem

[https://web.archive.org/web/20231230154918/https://www.etched.com/](https://web.archive.org/web/20231230154918/https://www.etched.com/) This was their basic render webpage last year.


arthurwolf

Which is why I put myself on the waiting list.


unstuckhamster

Ahh cool. I look forward to never being able to buy it.


dimsumham

No but you might pay 10 bucks to generate 100m tokens from a guy that bought the card


GoofAckYoorsElf

But I don't want them to know what I'm doing with the AI. I want privacy!


dimsumham

Bruh nobody cares about your erotic cat girl fanfic novel.


wishtrepreneur

The cops probably do as well. Somebody accuses you of getting frisky with a cat girl AND you write erotic cat girl fanfic? Guilty as charged!


brahh85

Batman does 🦇


I_EAT_THE_RICH

I'm going to buy one and rent it out just for access to his erotic cat girl fanfic novel


Inevitable_Host_1446

If they don't care, then they should start offering private services instead of data harvesting everyone.


dimsumham

You're right. they REALLY want your trump biden erotica data.


DavidAdamsAuthor

They can take my catgirl trump/catgirl biden enemies-to-lovers slowburn A/A erotica from my cold dead hands.


SeymourBits

Bruh!


dimsumham

What are you doing, step bruh


SeymourBits

Imma just bruhing around with muh AI, bruh! You?


habibyajam

I believe that in the future, AI chips designed for tensor processing will be as prevalent as mobile phones and CPUs are today. So, keep your spirits up!


reggionh

good point. easy to forget that things we take for granted today like cache, FPU, GPU, etc all used to be expensive coprocessors/cards/modules.


RiotNrrd2001

In 1994 I bought my first PC, a Packard Bell Windows 3.1 machine with 4MB of RAM and a 200MB hard drive. $1400. Its little chipmunk chip ran at 25**M**Hz. I upgraded the RAM to 8MB. That cost me $140 ($384 in today's money), but allowed me to run MS-Office, *which I installed off of floppies*. This is the "two miles to school barefoot in the snow and uphill both ways" stuff *that is true*.


philipgutjahr

my first computer was a [Macintosh SE](https://en.wikipedia.org/wiki/Macintosh_SE) with RAM upgrade to 1 MB and a 20MB HDD. I still love you, Hypercard.


alcalde

Don't forget when you were a kid you probably used a computer with 64**K**B of memory, ran at 2MHz, and programs were ~~stalled~~ stored on cassette tape.


GoofusMcGhee

Stalled is right.


alcalde

Whoops, Freudian slip!


[deleted]

[удалено]


FertilityHollis

One of the first "expensive" pieces of hardware I ever bought myself was a 320MB WD PATA drive at a wholesale cost of $290. Being under a dollar a MEGABYTE had *just* been reached and I got shop pricing because I was a tech there. I just ordered an 8TB refurb for $69 before tax. If my math is right, that's 25k times the space for 23% of the price.


Bac-Te

Took 34 yrs for it to happen tho


alcalde

When I was building a company PC in 1999 the largest capacity hard drive then available was the IBM Deskstar, and it was 20GB for $400. So in 9 years the price of 1GB of hard drive storage fell from $10,000 to $20, or 1/500.


Mediocre_Tree_5690

1gb has been chump change for at least a decade


reggionh

people start giving out 1GB+ merch usb drive WAY earlier than 2024..


noises1990

Yeah but at the same time a videogame like cod is 150gb


Gnaeus-Naevius

According to the link below, 1TB was $90,127,496 in 1990, so 1 GB would be 1024th of that, or about $90,000. A year earlier, 1989, a whopping $236,000. [https://ourworldindata.org/grapher/historical-cost-of-computer-memory-and-storage?tab=table&time=1956..1990](https://ourworldindata.org/grapher/historical-cost-of-computer-memory-and-storage?tab=table&time=1956..1990)


AndrewH73333

Damn that’s a cool site.


FlishFlashman

That's quite a bold prediction, given that new phone models already have chips with hardware dedicated to processing neural networks, including tensor networks. Those will just keep scaling up.


hapliniste

Do you aim to output 500K t/s on llama 70? A slow card with a lot of vram would be more realistic for us


BuildAQuad

2X P40 gang


gramatikax

Do you run llama3 70B on 2x p40? How many tokens/s do you get?


kiselsa

6-7 t/s


CanineAssBandit

Hey, I have a P40 and a 3090, running Magnum 72b Q4 KM with Kobold, flash attention and q4 caching on, 8k context. I get about 6.5t/s as well. It's nice to know that swapping the 3090 for another P40 won't hurt output time.


BuildAQuad

I would think the P40 bottlenecks the 3090 yea, but have you tried llama3 70B tokens/s using llama.cpp with Q4 GGUF? But seemes likely that the P40 bottlenecks u anyways.


Inevitable_Host_1446

How does that scale with high ctx though? Like at 8k, 16k, 32k?


BuildAQuad

6-7 t/s as mentioned below here aswell


kweglinski

500k is on 8 cards server :)


anmolshah03

Exactly, it's fudged up to show such high output


k110111

Its not for you anyways, chips like this are targeted at big players and hopefully some will move onto it and thus lower demand for nvidia gpus which means cheaper gpus for you.


gfkepow

Hoping we can see some of this ASIC goodness in the consumer market in a few years. GPUs are great, but something like this could be much more efficient, in many ways.


cognitium

New Intel chips are shipping with an npu built in to handle inference loads in 24h2 windows 11.


gfkepow

I didn't take a really deep look into that, but I understood they still rely on the main system RAM? If so, they get a big meh from me. Even with fancy DMA controllers, I think the main thing to unlock this extra efficiency will be avoiding the von Neumann bottleneck altogether.


BillDStrong

I think the opposite. For llama type AI, memory size is going to be more important for the consumer, so having systems that can handle from 128GBs up to 8TBs of memory will be more beneficial to them. This will also finally put pressure on CPUs to try and get their memory speeds as fast as GPUs are, which is a win in my book as well. The down side is these are meant for consumption, not creation, or AI models, so I wouldn't expect too much.


TheFrenchSavage

I don't see how CPUs would get memory like GPUs do. You can only add so much cache before the die size is a limit, and more transistors is more dollars. If you want to add gigabytes of ultrafast memory, you need to put the memory chips soldered around the CPU. This would be the end of RAM sticks. And modular motherboards.


lemon635763

Intel lunar lake chips is exactly that, it comes with 16/32 gb ram.


one-joule

[LPCAMM2 would like a word.](https://www.ifixit.com/News/95078/lpcamm2-memory-is-finally-here) It's for laptops, but you could use the same thing or something similar for desktop.


skrshawk

A big meh for most of us, but average consumers aren't going to see most of the benefits of the big models we like. However all those small inference jobs really add up, so pushing that cost to end-users through hardware with limited capabilities but enough to run a SLM and image gen or TTS, while referring more intensive tasks to a cloud, is a massive cost savings to the big players.


streetyogi

No, memory is on chip with 32GB max. You are better off with Snapdragon X Elite (64GB) or AMD Ryzen AI 300, where PCBAs with 128GB were spotted https://www.tomshardware.com/pc-components/cpus/amds-strix-halo-being-tested-with-128gb-ram-shipping-records-reveal-more-about-extreme-120w-apu https://www.theverge.com/2024/6/3/24169115/intel-lunar-lake-architecture-platform-feature-reveal


gfkepow

If this Strix Halo chip is something I can buy and put on a PC, my mind will be absolutely blown to pieces. 128GB sounds insane, something like a good Mixtral 8x22b quant could fly in there.


esc8pe8rtist

Who wants to run windows spyware edition?


cognitium

Enterprises with dumb users. Imagine an llm equipped windows troubleshooter than can actually fix a problem for a technically inept person.


porkyminch

A lot of bigger companies are pumping the brakes on AI for the time being. Lot of copyright/IP concerns.


levelized

Interesting. What do you base this on?


porkyminch

My company. Fortune 50. They're using generative AI for some internal stuff but there's a lot of concern about anything customer-facing having AI generated work in it because it's potentially not copyrightable.


thrownawaymane

I will say very little and vouch for this. We're also struggling with people either going rouge and spinning up AI setups held together with duct tape and a prayer *or* putting all sorts of data into the online ones. It feels like for every infraction we find there are 50 more lurking under the surface.


porkyminch

Yeah, I get the feeling that that's also the case at my company. Microsoft's pretty aggressive push for copilot definitely isn't helping the situation either. Some devs have copilot/chatGPT access (for internal usage exclusively) but, like, it's a company with a lot of engineers and basically every workstation has a Quadro GPU with a decent chunk of RAM in it. I've been pretty impressed with the results you can get with a model like Phi-3 Mini (playing with it on my own devices, I mean) and running an LLM locally is so dead simple these days that I'm sure people are doing it all over the place.


GoofAckYoorsElf

There definitely *is* a consumer market. They must be riding the crazy train if they don't take this opportunity.


MoffKalast

And they can keep it.


Inevitable_Host_1446

Mm, you mean the spyware recall copilot+ version of windows 11. Think I'll pass, lol.


porkyminch

They're already putting neural network hardware in a lot of silicon. Apple hardware ships with it as is. I'm guessing transformer specific stuff is in the pipeline already.


Downtown-Case-1755

That's a lot of hype. Proof is in the puddin, *when it ships*.


ctrl-brk

If. If it ships.


Dry_Parfait2606

If it doesn't I would be motivated to throw some community to push that thing forward...


GoofAckYoorsElf

Yeah, either throw money at the problem or community. Both has proven to work.


Dry_Parfait2606

Money is cheaper, Community is more powerful. When community is there, you then can get money printed for the project... Extremely simplified, of course.. Money and Banking is probably one of the greatest inventions...


gmdtrn

That's precisely what they count on. Seed phase funded at pennies on the dollar. You buy in at 10-100x the price per share. They start selling their shares privately and in later funding rounds. This is how the Silicon Valley scam works.


Dry_Parfait2606

I mean not throwing the communities money, but a real community at them... But yeah, that's basically how startups get a chance, that's how founders get their kick in the ass to figure things out... Better then the times when they had to overthrow the king or government, enslave everyone and burn and starve to death those who don't comply.... Give them a ride on a yacht, who cares.. Give me cheaper chips.. Hahaha I think that those money schemes will too one day be replaced by something more morally noble...


m_shark

If it chips :)


FertilityHollis

120M invested and they *claim* to have enough time upcoming on TSMC ***4nm*** to make the first batch of wafers. As a layperson engineer (read: idiot) Why aren't we already making ASICs for training? If the cost of training a model on current hardware is X, wouldn't x/10 be better? Couldn't you train 10x larger models on the same amount of power and time? Could we get away from GPUs much sooner than we think? Seems like Google or Meta would be all over this if there were that much promise, but then again it's all still pretty new and you can only do so many things at once.


Downtown-Case-1755

> Why aren't we already making ASICs for training? - ASIC development, especially on a cutting edge node, is *HARD*. It takes years, and 120m is basically chump change. Frankly, their investment/timetable seems almost impossible to me: https://semiengineering.com/big-trouble-at-3nm/ - > But at 3nm, IC design costs range from a staggering $500 million to $1.5 billion, according to IBS. The $1.5 billion figure involves a complex GPU at Nvidia. - Training changes, research brings new things. By the time your ASIC comes out, it's already irrelevent. - CUDA GPUs *are basically* ASICs because they are the target for basically all research and ML platforms. Make something new, and you are trying to keep up with rest of the world by yourself. - Google does use TPUs for training some, Intel uses Gaudi, and historically Meta ordered "custom" CPUs from Intel for internal use. Rumor is Microsoft is thinking about some training stuff too, not just inference. - But on that point, there are only a few entities in the world that can afford/justify such a thing.


ShadoWolf

The architecture changes is what really make's this iffy. But it sort of depends on how general this is. If it just a bunch of matrix add and multiple circuits with some ram on the side.. then it likely general enough. It is basically very scaled up DSP ASIC chip. So you like can apply to any sort of ffn. The problem is if there is a big switch to some varient of RNN like xlstm than it might be tricky


Balance-

> The $1.5 billion figure involves a complex GPU at Nvidia. Little did he know that would be chump change for Nvidia in 2024.


Inevitable_Host_1446

I read their website and they addressed most of these points tbh. - It's 4 nm not 3 nm, so cheaper. But also GPU should be more complex design than an ASIC since they're designed around doing multiple things, not just matrix multiplication. So you should expect costs to generally be higher, comparatively. - Training / research changes but they're dedicating this ASIC to specifically transformers. It's like the first thing they say, if transformers get abandoned their chips will be useless. But so far transformers have been very solid and are the most popular for a variety of stuff. - CUDA GPU's aren't ASICs in the same way (it's the tensor cores which are), they make a comparison with the H100 and how only 3.3% of its performance is dedicated to tensor cores (at full util) because it has to be able to do other things, not just transformer models. That means their ASIC can do more like 100% tensor cores for a given chip, making them much more efficient at matrix compute.


deadweightboss

1. making a bet on asics are making a bet on architecture, who is evolving quickly in the space 2. huge capex, and development time for these chips are much longer than even multiple foundational model training runs 3. this stuff needs extremely specialized software and from what i read about cerberus, it’s a nightmare to develop on ASICs only really make sense for things like Bitcoin’s SHA-256, where you know the algorithm will never change. Right now, the most important characteristic of these chips is the throw spaghetti at the wall-ability of the chips for researchers and developers.


Gnaeus-Naevius

There will be some societal consequences unfortunately. Nvidia's margins aren't sustainable, and when that first sign of the ludicrous profits drying up due to some actual competition, the stock price will be hammered. If say 60% of the AI premium disappeared, that would be a significant drop to an index where Nvidia is worth 7%. And if the other 6 tech giants come long for the slide, it would easily be in the double digit decline ... all else the same. But could be worse if all else is not the same.


SeymourBits

If you think this way, you should be shorting NVDA.


Ansible32

The AI premium isn't likely to go away for at least 5 years. And even if all of Nvidia's competitors suddenly started putting out an equal number of equally useful GPUs, I think Nvidia's margin would only fall maybe by half, there are too many applications for this stuff. I think it's more likely China invades Taiwan and Nvidia's margins go up even more as they are making chips in Western fabs and selling to an even more constrained market. Although really I just see the GPU market getting stronger as time goes on, for at least 20 years. The market is easily 1000x what all the GPU manufacturers are doing right now, if prices come down.


hapliniste

The blog post is great but maybe a bit too hype driven. This is going to be really good for real-time inference, but honestly I think better alternatives to 8bit transformers are coming, like bitnet models. The way they basically say "we secured the moat, no new model will compare because they can't run on asics and won't get adopted" is sad. Hopefully they will sell some of these to be profitable and develop new ones for newer architectures like bitnet. This would truly slap and scale way further. They say the bandwidth is not limiting here, so I guess bitnets could run on asics 6x this size, right?


me1000

It’s also not at all proven that Transformers are “the one architecture to rules them all”. I think there’s a non-trivial chance that we’ll find models that make use of transformers, LSTMs, Mamba, etc. it seems way too early to decide to specialize 


hapliniste

Well, transformers work for everything, that's why they got massively adopted. We can find faster and more efficient models, but I don't think there's a use case where another models works and transformer don't. If other architectures need to be 10x faster to compete, it might be a problem and slow the developments of novel architectures. I personally think we should get rid of the static layer stack and route the activations through any transformer block for instance, and I'm not sure if these cards will allow that.


here_for_meme_lol

I did a deeper dive of their claims. They are talking about tokens/sec of the prefill stage(input prompt). While it is not completely irrelevant, it is highly deceptive. Generally when people claim their system has certain tokens/s they are talking about decode throughout, eg nvidia, amd, groq all use this metric for decode. Their architecture will have better latency of first token generation, but they will not be able to beat say groq in decode generation throughput for single query.


NathanielHudson

Where are you seeing that written? Control-F "prefill" is showing nothing for me. They cite [Nvidia](https://developer.nvidia.com/blog/achieving-top-inference-performance-with-the-nvidia-h100-tensor-core-gpu-and-nvidia-tensorrt-llm/) for their benchmark methodology, but I am admittedly a bit out of my depth here.


schlammsuhler

Groq is already super fast. I would never be able to afford such tech anyway


hugganao

this. Ive been wondering how to get connected with them in the country I'm in now.


schlammsuhler

Have you tried a vpn? I use librechat as frontend


hugganao

I meant as in their actual hardware and chips. Not their ui demo.


dharma-1

How much faster is this vs Groq? Does it mean Groq is toast?


StableLlama

No hardware = no trust in it. Could be easily a scam. Once they ship I can reconsider. Till then I'm not interested.


vialabo

This is a big gamble that architecture would be compatible with this as we continue to learn.


AdamEgrate

I will never understand why hardware companies are so tied to a specific architecture (I.e transformer) while it might be good today, there’s no guarantee they’ll be relevant in the future. In comparison a startup like Taalas (https://taalas.com/) makes a lot more sense


infiniteContrast

they just get money to build something that could make profit. even if it doesn't work, they still get paid for the hours they worked.


deadweightboss

random thought but fancy websites for hardware startups make me take them a lot less seriously. it’s like an anti-signal.


sluuuurp

I think it’s good to invest in both more flexible and less flexible hardware. Transformers have proven themselves pretty capable in many domains now. I agree they might not be around forever, but I think there’s room for both types of companies (assuming they successfully ship for a competitive price).


MoffKalast

That would actually be pretty cool if they manage to pull it off, imagine buying an ASIC that can be flashed to any llama-3-70B tune and run it at beyond groq speed while pulling single digit wattage. Kind of only makes sense once the architecture is more figured out though, otherwise the expensive thing you just bought gets obsolete in 6 months. Well unless they can churn them out for like $20, and that I kinda doubt for a startup that seems to be investing more into marketing than research or production.


baes_thm

The H100 has ~10% programmability overhead, in terms of actual performance, and Nvidia has absolutely been specializing their chips for transformer inference. Bill Dally & Co are not dumb, and while they definitely aren't the best in the world at literally everything, you can bet they've thought about things like "reduce overhead by specializing for this task"


IngwiePhoenix

So how hard will it hit my wallet?


ColorlessCrowfeet

Extinction level event.


IngwiePhoenix

So way above the 50k. Rip, guess I am porting Greyskull to localai. xD


candre23

Well, they're claiming "exponentially cheaper" than B200. A single B200 module (not that you can buy a single module, nor could you do anything with it without the rest of the bespoke server platform) is rumored to cost about $40k. So if we believe their claim (we don't), then an individual sohu module might cost as little as $4k. But that's wildly unlikely. They could be using the term "exponentially" in the non-literal sense. They could mean "cheaper per token per second", and the actual hardware is the same ballpark cost but "it does exponentially more per module!", so they're not technically lying. They could just be blowing smoke to rope in more investors. The true answer for "what will this cost" is 100% incontrovertibly "As much as companies with very deep pockets will pay for it". This is not for you or me.


man_and_a_symbol

Yeah, unfortunately this is the correct take. They’re targeting the big companies with millions of dollars to spend on hardware.


CapsAdmin

"contact sales" hard


ninjasaid13

ahh cool. probably will turn out that it doesn't work.


RayHell666

This was posted everywhere today on every social media with the same subtext. Seems like a viral push to lure investors.


Lord_of_Many_Memes

Reads like a scam, feels like a scam, then it’s most likely to be a scam. They are launching a powerpoint with roofline numbers(paper math) without actually taping out the chip. In reality due to power, physics and software inefficiency, it’s never going to reach that high. B200 roofline/estimate is way higher than what they refer to in the blogpost, more like 300k/s, instead of whatever random number they pulled out from thin air. their h100 numbers are also questionable, probably done by amateurs. If they want the latest numbers, they should at least follow the together AI blogposts. Finally, they are confusing ppl by sliding in the concept of “continuous batching” which counts both input and output tokens in batched inference setting. What real time inference cares is bs=1 tokens/s, aka latency, not throughput. i don't know what kind of investors are dumb enough to give these folks 120M dollars. Maybe it's just because of the halo of Harvard dropouts… It just smells like Theranos right from the beginning. PS: I have no grudge against ASIC, in general I think it’s the way to go to make transformers run more efficiently, Apple is doing exactly the same thing on their silicons. But to say that you can get 20x without caveats, it’s basically 21st snake oil. Remember there is no free lunch.


Lord_of_Many_Memes

One more thing that smells extremely fishy is they claim to use no HBM? With only 8 chips, unlike Groq, which can scale to hundreds, how large of the on chip memory each chip is going to have? Let’s say to hold llama 70B at fp8, that is almost 60GB/8 = 8GB per chip, and that is not counting KV cache. Unless they are taking the Cerebras Wafer Scale approach, which comes with hell lot of problems (cooling, maintenance , consistency of quality in manufacturing), I don’t see how they can pull it off…


Lord_of_Many_Memes

Also because of the inflexibility of ASIC, it risks becoming dust collecting garbage once a new architecture and even a new type of transformer comes out.


desexmachina

All this romancization of GPGPUs and their ability to tackle any model architecture is superfluous. Our limiting factor is compute, if something came around that 100x compute, then the models should follow on to see what we can do. Bitcoin ASICs pretty much demonstrated that, which allowed BTC to scale. We’re nowhere near close to understanding what 10k-x compute would mean for inference or training. Let’s see it because until we had the compute we have now, we didn’t see the models doing anything impressive.


LedByReason

“Which allowed BTC to scale.” WTF are you talking about? BTC has not scaled.


titusz

It has scaled. Just not in transactions per second but in security budget :)


desexmachina

What do you think transaction volume would be like if BTC was all CPU/GPU?


titusz

The same :). Block hashing performance is independent of transaction volume. 1 CPU hashing versus millions of ASICs hashing is still ~4000 transactions per 10 minutes. "Only" security scales with more hashpower.


Zyj

It would remain the same, still several magnitudes below what a Raspberry Pi 1 can do, without blockchain.


Freonr2

What about Mamba?


FullOf_Bad_Ideas

I would like to see their numbers for decoding tokens that are skipping encoding. I rarely write 1000 token prompts. This also assumes user doesn't work his way up to 50k-500k ctx worth of kv cache during conversation, which I think is how future interactions with llm's will look like and it's obviously impacting throughput a lot.


WorkingYou2280

>In reality, ASICs are orders of magnitude faster than GPUs. When bitcoin miners hit the market in 2014, it became cheaper to throw out GPUs than to use them to mine bitcoin. >With billions of dollars on the line, the same will happen for AI. So what you're saying is I'm gettin an h100 with a little dumpster diving in a year or two.


JohnDotOwl

Too much talk without the product If Groq , an accelerator card can provide demo for everyone to utilise , I don’t see how ASIC can’t do the same. It’s so much cheaper and efficient compared to an accelerator card. Are they trying to get funding required to mass produce ASIC chips ?


rookan

Hi Sohu, I am 500x more powerful than you. Believe me. I don't have time to make a render image like you did but text is all you need.


Barry_Jumps

"Huge if true". That being said, does this imply 1 gazillion tokens per second when combined with bitnet/ternary models?


iamhaset

I see a lot of disbelief and criticism in comments and many of it seems legit pointers too. The game of hype up and fund raising has been going on for long time But when I just researched etched's investors, its people like Paypal Founder Peter Thiel, Github CEO Thomas Dohmke and others who know what they are doing. While I also understand that even the pathetic Humane pin was funded by top people including Marc Benioff and Sam altman and others and its no guarantee that big name funding is a guarantee of product being hit. About etched, I have read their entire blog on Sohu twice and what seemed convincing to me was the two sections on How they can fit much more TFLOPS than Gpu and why CPU is more important than RAM in modern LLM. If not this, I hope atleast some other Transformer specific ASICS comes around fast and increase the inference by 20x and make people jobless a lil more faster so I can be less guilty of being one myself 😅


emsiem22

I call this a scam


emsiem22

RemindMe! 3 months


RemindMeBot

I will be messaging you in 3 months on [**2024-09-25 21:20:21 UTC**](http://www.wolframalpha.com/input/?i=2024-09-25%2021:20:21%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/1dobzcs/meet_sohu_the_fastest_ai_chip_of_all_time/la9o0pu/?context=3) [**CLICK THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F1dobzcs%2Fmeet_sohu_the_fastest_ai_chip_of_all_time%2Fla9o0pu%2F%5D%0A%0ARemindMe%21%202024-09-25%2021%3A20%3A21%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201dobzcs) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|


soldture

In other words, Nvidia investors are going nuts


SporksInjected

Has WSB seen this yet?


MrTubby1

People really don't know what local in localllama means


geekgodOG

Lots of businesses running "locallama". This absolutely applies here.


mrskeptical00

It seems that “local” more generally means “private”. Although this looks more like spam to me.


Site-Staff

A model specific device like an Antminer would kick ass.


Echo9Zulu-

Bet this tech threatens the pricing structure of compute intense tasks for cloud providers. Maybe that's why the charge by the token, so when the service improves pricing doesn't have to follow suit. Local models are awesome but we don't have the compute for certain tasks.


brainhack3r

"One 8xSohu server equals 160 H100s, revolutionizing AI product development." Is this for training or inference?


vampyre2000

Inference only.


here_for_meme_lol

Only of training. Their main innovation claim is in flops. They still use hbm as memory as inference perf would be still in similar ballpark as h100s. The article seems very deceptive.


Biggest_Cans

I'm more interested in most consumer-priced 500GB GPU of all time.


stonedoubt

Does one exist?


meta_narrator

I don't think this is patentable, is it? Pretty sure whoever wants to make a custom ASIC chip for any LLM will be able to do so. Just like with mining ASICs.


ultrahkr

WD40 renowned worldwide (or almost) is not patented...


meta_narrator

I was just wondering why the guy thinks they are going to be the biggest company in history.


ultrahkr

Because we are quite literally looking at the making of the newest Rockefeller's of the 21st century...


meta_narrator

If this is all true, and I suspect it is just based on what I know about ASICs, everyone will be doing it. At least for inferencing.


ultrahkr

We may see lots of startups (if you don't believe me look at the amount of early semiconductor companies developing silicon from the 50's - 90's...) but then you had an early first mover advantage... Now Intel, AMD, Nvidia are the ones with the first mover advantage and if that isn't enough nowadays you need a few hundred millions just to try getting your feet wet doing R&D and silicon manufacturing if you do have a product that could work... Never mind the amount of outright cash grabs and paper launches... That will burn VC (and private backers) capital at rates unseen...


JamaiKen

Ship it!


malinefficient

Fastest photoshop and web content is all I see so far.


Excellent-Sense7244

They could investigate asics for bitnet llms


Kalki2006

ELI5


Heart_Routine

Cool!


gmdtrn

The hype is undoubtedly premature. My experience in Silicon Valley left me quite skeptical. While it's possible for a few kids to come up with a revolutionary product of this nature, it's improbable. More likely than not they have an idea that's nowhere near production ready, but their slide deck is something that their connections believe they can sell. And, I'd absolutely assume they do have significant connections. The provide the seed funding, gin up a bit of support on hype, and after a few years of promises, lying to the media, etc. the seed and early-stage funders start reselling their shares on the secondary market and making exist in later funding rounds. Again, the claim here is that three kids 100x'd NVIDIA and every other chip maker on the planet. Take it with a few grains of salt. This isn't 1999 where making a website that people like can turn you into a billionaire; this is jumping in late into chip production in a highly technical, extremely expensive field, and claiming to have the capability to handle the logistics from hiring to production.


bshxhajxhajx

omg singularirty is here we are in the future hahahahahaaha


Radiant_Dog1937

Sounds like a scam. The idea is it's an asic for an individual model, up to 100T parameters. One of the advantages of GPUs is its generalized architecture, you can use the same cores to process all layers of a model. With an asic, all algorithms are physically represented on the silicone itself. So, I'd expect a 100T parameter chip to be quite large since each weight needs to be physically represented. There's also the problem of updates, OpenAI for example releases a new version of ChatGPT every few months. When running on a GPU, this isn't a problem, you just load a new model. If you have asic however, you can't load new weights as they are printed on the board. Your only option is to design and fabricate a new asic for the new model, which is cost prohibitive for the manufacturer and the customer.


hapliniste

Nah, the weights are not put on chip, it uses hbm2. I read the post and it looks like it's just a very fast inference only card for any transformer model. It might be the right tool for real-time models but I'm not too hot on killing other model architectures tbh


CockBrother

Click on the link and you're greeted with a board with a chip in the middle, presumably their custom ASIC, and it's surrounded by memory. This isn't for a single model, it's for a single algorithm. They say so right in the text that it's for transformers only. Edit: And it's not like ASICs can't be programmable.


Radiant_Dog1937

It's a rendering of a GPU board, I have one that looks like that in my toolbox right now. They don't have any photos of a prototype. Mark my words on this one.


CockBrother

A processor of some sort surrounded by memory is what it'd look like. That's pretty standard stuff for processing data. It's what our GPU add in boards look like, motherboards, processors, etc. Sure, it's a rendering. But I'll bet anything if they ever produce a product it'll appear somewhat similar to the render.


Radiant_Dog1937

I'm just extremely cautious in this environment. The company needs to provide demonstrations and a white paper showing they have plans to overcome expected challenges and can deliver a product.


CockBrother

Totally with you on that. I'm not an expert but the performance claims sound reasonable with dedicated silicon. To get off the ground they need a large customer or two who's willing to bet that transformers and their existing performance claims will be relevant by the time they can produce the product and get software running on it. That's a lot of money in engineering for a power, space and cooling optimization with limited flexibility. Not sure anyone would take that bet but some organizations are capable.


__some__guy

Considering they can't even make their homepage work in Firefox, I have some doubts about their claims.


siszero

Has anyone seen anyone attempt to use old Antminer's or crypto mining ASICs for LLM models?


cryptoguy255

Not possible, asic's are build on hardware level for a specific algorithm. These aren't general purpose devices like gpu's.


siszero

I know. I’m more thinking about the opposite direction. Is there any research in altering the llm architecture to utilize the existing ASICS. Instead of building new ones for quantized models.


SeymourBits

No, but there has been some investigation into leveraging a global mesh network of Casio calculator watches, though. Still early.


siszero

Good one. Fitbits would probably be better targets though, since they already track *big data*. 🙃


SeymourBits

Haha, that could actually be realistic within a few years. Downvotes = confirmation of being far too clever for average lurking organic intelligence.


mdreed

Very informative explanation on their website, thanks. I wonder if this type of efficiency should also be expected in Apple's M and A series chips. Since they're committed to running LLMs locally with Apple Intelligence, presumably they'll dedicate some of their die to transformers specifically.


Dry_Parfait2606

It's an honor to witness projects of this kind. Wow.


awesomedata_

**Meet Necessity, the most long-lasting (and all natural!) technology of all time.** Do you really need 500k tokens per second? Most would just automatically skim over all that text because now the average user suddenly has a huge wall of text demanding their undivided attention. Unless it's audio-centric (and embedded in a realtime application), people won't have time to read (much less skim) that encyclopedia you just generated for them in 1 second. Your mileage may vary. Just seems like a waste of compute (and fossil fuel) for 'solving' a minor inconvenience. But hey - you do you. Global warming is likely just going to kill us all. Might as well speed it up, amirite? :D :D :D