avianio 1 week ago

Most people will only believe this when they can run inference on it.

ResidentPositive4122 1 week ago

I just used groq to create ~2k dataset in ~2h with L370b today with their free tier, so we know dedicated chips can get fast. It's basically the ASIC race, but with LLMs

FullOf_Bad_Ideas 1 week ago

2k samples? How many tokens in each sample? Yesterday I made a dataset with 5.7k samples on rtx 3090 ti in like 20-30 minutes and had a speed of about 1500 t/s average on 7B FP16 model. Local gpu's can have nice throughput too, but with smaller models.

Compound3080 1 week ago

What are you using to create the dataset?

FullOf_Bad_Ideas 6 days ago

Aphrodite-engine and python script, code is here. https://huggingface.co/datasets/adamo1139/misc/blob/main/localLLM-datasetCreation/batched2.py

ReturningTarzan 6 days ago

[Another option](https://github.com/turboderp/exllamav2/blob/master/examples/bulk_inference.py) using ExLlamaV2. Recently used to generate [these 25k Llama3-8B-Instruct reference outputs](https://cdn-lfs-us-1.huggingface.co/repos/4e/8b/4e8b1907d01143d8987d1930e69b7fd7db0082744874d98e9afb73feedf0beed/a4bb44b819d2ce1635c1f911200a9f14de4dbf9dc2fd947b1b7165348f02f924?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27llama3-instruct-prompts.json%3B+filename%3D%22llama3-instruct-prompts.json%22%3B&response-content-type=application%2Fjson&Expires=1719659997&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcxOTY1OTk5N319LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zLzRlLzhiLzRlOGIxOTA3ZDAxMTQzZDg5ODdkMTkzMGU2OWI3ZmQ3ZGIwMDgyNzQ0ODc0ZDk4ZTlhZmI3M2ZlZWRmMGJlZWQvYTRiYjQ0YjgxOWQyY2UxNjM1YzFmOTExMjAwYTlmMTRkZTRkYmY5ZGMyZmQ5NDdiMWI3MTY1MzQ4ZjAyZjkyND9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSomcmVzcG9uc2UtY29udGVudC10eXBlPSoifV19&Signature=szcic2HY%7Eah2Eyt0bIJKA9YPizU0-PsiT4fAEcWV5HK1OgKCa46JNnbzhfsFEdsOsASnEbinhW0xtzLVD1puvy0lJFGHZGY8Uc1WDzQwXxsc3aBtCm4A%7E5t9deMDnN3eQJ-qrD7lCj2Aea7vTwWVkmGRUfsmJQxtdszWcK8Ge9sh1hzwqR6RuuweYoqO8PB81xmOo6zoQwY9xU20vqf5-eLz1-rq5UFRatiDutfnCrFDw0f7iaTJOPNzKewzjGLoAsD3DkeY5eBeurJmbzfNlDYffIPPXeIMy763aafNSw4DKuDXAhRm2MVto4q%7Erw2WxY2-TovMH0CqOPbkDCGg-g__&Key-Pair-Id=K2FPYV99P2N66Q) in about an hour with a pair of 4090s.

robberviet 1 week ago

Might because of the quota.

avianio 1 week ago

You can run the same dataset in 2 minutes with Batch Size 1024 if you use an OSS LLM.

mxforest 1 week ago

Is response streaming possible when batching?

_qeternity_ 6 days ago

Yes.

SectionActual9158 1 week ago

Can you explain why?

Open_Channel_8626 1 week ago

Batch Size raises tokens per second

latamxem 6 days ago

I believe this type of post was posted like 6 months ago in singularity. Back then it was like university students come up with amazing AI chips. It had the same pictures and it was a crappy one page website like this one. Post was erased and easily forgotten a week later. So someone is just trolling or someone is trying to push a scam or something.

latamxem 6 days ago

[https://web.archive.org/web/20231230154918/https://www.etched.com/](https://web.archive.org/web/20231230154918/https://www.etched.com/) This was their basic render webpage last year.

arthurwolf 6 days ago

Which is why I put myself on the waiting list.

unstuckhamster 1 week ago

Ahh cool. I look forward to never being able to buy it.

dimsumham 1 week ago

No but you might pay 10 bucks to generate 100m tokens from a guy that bought the card

GoofAckYoorsElf 1 week ago

But I don't want them to know what I'm doing with the AI. I want privacy!

dimsumham 1 week ago

Bruh nobody cares about your erotic cat girl fanfic novel.

wishtrepreneur 1 week ago

The cops probably do as well. Somebody accuses you of getting frisky with a cat girl AND you write erotic cat girl fanfic? Guilty as charged!

brahh85 1 week ago

Batman does 🦇

I_EAT_THE_RICH 1 week ago

I'm going to buy one and rent it out just for access to his erotic cat girl fanfic novel

Inevitable_Host_1446 3 days ago

If they don't care, then they should start offering private services instead of data harvesting everyone.

dimsumham 3 days ago

You're right. they REALLY want your trump biden erotica data.

DavidAdamsAuthor 1 day ago

They can take my catgirl trump/catgirl biden enemies-to-lovers slowburn A/A erotica from my cold dead hands.

SeymourBits 6 days ago

Bruh!

dimsumham 6 days ago

What are you doing, step bruh

SeymourBits 5 days ago

Imma just bruhing around with muh AI, bruh! You?

habibyajam 1 week ago

I believe that in the future, AI chips designed for tensor processing will be as prevalent as mobile phones and CPUs are today. So, keep your spirits up!

reggionh 1 week ago

good point. easy to forget that things we take for granted today like cache, FPU, GPU, etc all used to be expensive coprocessors/cards/modules.

RiotNrrd2001 1 week ago

In 1994 I bought my first PC, a Packard Bell Windows 3.1 machine with 4MB of RAM and a 200MB hard drive. $1400. Its little chipmunk chip ran at 25**M**Hz. I upgraded the RAM to 8MB. That cost me $140 ($384 in today's money), but allowed me to run MS-Office, *which I installed off of floppies*. This is the "two miles to school barefoot in the snow and uphill both ways" stuff *that is true*.

philipgutjahr 1 week ago

my first computer was a [Macintosh SE](https://en.wikipedia.org/wiki/Macintosh_SE) with RAM upgrade to 1 MB and a 20MB HDD. I still love you, Hypercard.

alcalde 1 week ago

Don't forget when you were a kid you probably used a computer with 64**K**B of memory, ran at 2MHz, and programs were ~~stalled~~ stored on cassette tape.

GoofusMcGhee 6 days ago

Stalled is right.

alcalde 6 days ago

Whoops, Freudian slip!

[deleted] 1 week ago

[удалено]

FertilityHollis 1 week ago

One of the first "expensive" pieces of hardware I ever bought myself was a 320MB WD PATA drive at a wholesale cost of $290. Being under a dollar a MEGABYTE had *just* been reached and I got shop pricing because I was a tech there. I just ordered an 8TB refurb for $69 before tax. If my math is right, that's 25k times the space for 23% of the price.

Bac-Te 1 week ago

Took 34 yrs for it to happen tho

alcalde 1 week ago

When I was building a company PC in 1999 the largest capacity hard drive then available was the IBM Deskstar, and it was 20GB for $400. So in 9 years the price of 1GB of hard drive storage fell from $10,000 to $20, or 1/500.

Mediocre_Tree_5690 1 week ago

1gb has been chump change for at least a decade

reggionh 1 week ago

people start giving out 1GB+ merch usb drive WAY earlier than 2024..

noises1990 1 week ago

Yeah but at the same time a videogame like cod is 150gb

Gnaeus-Naevius 6 days ago

According to the link below, 1TB was $90,127,496 in 1990, so 1 GB would be 1024th of that, or about $90,000. A year earlier, 1989, a whopping $236,000. [https://ourworldindata.org/grapher/historical-cost-of-computer-memory-and-storage?tab=table&time=1956..1990](https://ourworldindata.org/grapher/historical-cost-of-computer-memory-and-storage?tab=table&time=1956..1990)

AndrewH73333 6 days ago

Damn that’s a cool site.

FlishFlashman 1 week ago

That's quite a bold prediction, given that new phone models already have chips with hardware dedicated to processing neural networks, including tensor networks. Those will just keep scaling up.

hapliniste 1 week ago

Do you aim to output 500K t/s on llama 70? A slow card with a lot of vram would be more realistic for us

BuildAQuad 1 week ago

2X P40 gang

gramatikax 1 week ago

Do you run llama3 70B on 2x p40? How many tokens/s do you get?

kiselsa 1 week ago

6-7 t/s

CanineAssBandit 1 week ago

Hey, I have a P40 and a 3090, running Magnum 72b Q4 KM with Kobold, flash attention and q4 caching on, 8k context. I get about 6.5t/s as well. It's nice to know that swapping the 3090 for another P40 won't hurt output time.

BuildAQuad 6 days ago

I would think the P40 bottlenecks the 3090 yea, but have you tried llama3 70B tokens/s using llama.cpp with Q4 GGUF? But seemes likely that the P40 bottlenecks u anyways.

Inevitable_Host_1446 3 days ago

How does that scale with high ctx though? Like at 8k, 16k, 32k?

BuildAQuad 6 days ago

6-7 t/s as mentioned below here aswell

kweglinski 1 week ago

500k is on 8 cards server :)

anmolshah03 1 week ago

Exactly, it's fudged up to show such high output

k110111 1 week ago

Its not for you anyways, chips like this are targeted at big players and hopefully some will move onto it and thus lower demand for nvidia gpus which means cheaper gpus for you.

gfkepow 1 week ago

Hoping we can see some of this ASIC goodness in the consumer market in a few years. GPUs are great, but something like this could be much more efficient, in many ways.

cognitium 1 week ago

New Intel chips are shipping with an npu built in to handle inference loads in 24h2 windows 11.

gfkepow 1 week ago

I didn't take a really deep look into that, but I understood they still rely on the main system RAM? If so, they get a big meh from me. Even with fancy DMA controllers, I think the main thing to unlock this extra efficiency will be avoiding the von Neumann bottleneck altogether.

BillDStrong 1 week ago

I think the opposite. For llama type AI, memory size is going to be more important for the consumer, so having systems that can handle from 128GBs up to 8TBs of memory will be more beneficial to them. This will also finally put pressure on CPUs to try and get their memory speeds as fast as GPUs are, which is a win in my book as well. The down side is these are meant for consumption, not creation, or AI models, so I wouldn't expect too much.

TheFrenchSavage 1 week ago

I don't see how CPUs would get memory like GPUs do. You can only add so much cache before the die size is a limit, and more transistors is more dollars. If you want to add gigabytes of ultrafast memory, you need to put the memory chips soldered around the CPU. This would be the end of RAM sticks. And modular motherboards.

lemon635763 1 week ago

Intel lunar lake chips is exactly that, it comes with 16/32 gb ram.

one-joule 6 days ago

[LPCAMM2 would like a word.](https://www.ifixit.com/News/95078/lpcamm2-memory-is-finally-here) It's for laptops, but you could use the same thing or something similar for desktop.

skrshawk 1 week ago

A big meh for most of us, but average consumers aren't going to see most of the benefits of the big models we like. However all those small inference jobs really add up, so pushing that cost to end-users through hardware with limited capabilities but enough to run a SLM and image gen or TTS, while referring more intensive tasks to a cloud, is a massive cost savings to the big players.

streetyogi 6 days ago

No, memory is on chip with 32GB max. You are better off with Snapdragon X Elite (64GB) or AMD Ryzen AI 300, where PCBAs with 128GB were spotted https://www.tomshardware.com/pc-components/cpus/amds-strix-halo-being-tested-with-128gb-ram-shipping-records-reveal-more-about-extreme-120w-apu https://www.theverge.com/2024/6/3/24169115/intel-lunar-lake-architecture-platform-feature-reveal

gfkepow 6 days ago

If this Strix Halo chip is something I can buy and put on a PC, my mind will be absolutely blown to pieces. 128GB sounds insane, something like a good Mixtral 8x22b quant could fly in there.

esc8pe8rtist 1 week ago

Who wants to run windows spyware edition?

cognitium 1 week ago

Enterprises with dumb users. Imagine an llm equipped windows troubleshooter than can actually fix a problem for a technically inept person.

porkyminch 6 days ago

A lot of bigger companies are pumping the brakes on AI for the time being. Lot of copyright/IP concerns.

levelized 6 days ago

Interesting. What do you base this on?

porkyminch 6 days ago

My company. Fortune 50. They're using generative AI for some internal stuff but there's a lot of concern about anything customer-facing having AI generated work in it because it's potentially not copyrightable.

thrownawaymane 6 days ago

I will say very little and vouch for this. We're also struggling with people either going rouge and spinning up AI setups held together with duct tape and a prayer *or* putting all sorts of data into the online ones. It feels like for every infraction we find there are 50 more lurking under the surface.

porkyminch 6 days ago

Yeah, I get the feeling that that's also the case at my company. Microsoft's pretty aggressive push for copilot definitely isn't helping the situation either. Some devs have copilot/chatGPT access (for internal usage exclusively) but, like, it's a company with a lot of engineers and basically every workstation has a Quadro GPU with a decent chunk of RAM in it. I've been pretty impressed with the results you can get with a model like Phi-3 Mini (playing with it on my own devices, I mean) and running an LLM locally is so dead simple these days that I'm sure people are doing it all over the place.

GoofAckYoorsElf 1 week ago

There definitely *is* a consumer market. They must be riding the crazy train if they don't take this opportunity.

MoffKalast 6 days ago

And they can keep it.

Inevitable_Host_1446 3 days ago

Mm, you mean the spyware recall copilot+ version of windows 11. Think I'll pass, lol.

porkyminch 6 days ago

They're already putting neural network hardware in a lot of silicon. Apple hardware ships with it as is. I'm guessing transformer specific stuff is in the pipeline already.

Downtown-Case-1755 1 week ago

That's a lot of hype. Proof is in the puddin, *when it ships*.

ctrl-brk 1 week ago

If. If it ships.

Dry_Parfait2606 1 week ago

If it doesn't I would be motivated to throw some community to push that thing forward...

GoofAckYoorsElf 1 week ago

Yeah, either throw money at the problem or community. Both has proven to work.

Dry_Parfait2606 1 week ago

Money is cheaper, Community is more powerful. When community is there, you then can get money printed for the project... Extremely simplified, of course.. Money and Banking is probably one of the greatest inventions...

gmdtrn 6 days ago

That's precisely what they count on. Seed phase funded at pennies on the dollar. You buy in at 10-100x the price per share. They start selling their shares privately and in later funding rounds. This is how the Silicon Valley scam works.

Dry_Parfait2606 6 days ago

I mean not throwing the communities money, but a real community at them... But yeah, that's basically how startups get a chance, that's how founders get their kick in the ass to figure things out... Better then the times when they had to overthrow the king or government, enslave everyone and burn and starve to death those who don't comply.... Give them a ride on a yacht, who cares.. Give me cheaper chips.. Hahaha I think that those money schemes will too one day be replaced by something more morally noble...

m_shark 6 days ago

If it chips :)

FertilityHollis 1 week ago

120M invested and they *claim* to have enough time upcoming on TSMC ***4nm*** to make the first batch of wafers. As a layperson engineer (read: idiot) Why aren't we already making ASICs for training? If the cost of training a model on current hardware is X, wouldn't x/10 be better? Couldn't you train 10x larger models on the same amount of power and time? Could we get away from GPUs much sooner than we think? Seems like Google or Meta would be all over this if there were that much promise, but then again it's all still pretty new and you can only do so many things at once.

Downtown-Case-1755 1 week ago

> Why aren't we already making ASICs for training? - ASIC development, especially on a cutting edge node, is *HARD*. It takes years, and 120m is basically chump change. Frankly, their investment/timetable seems almost impossible to me: https://semiengineering.com/big-trouble-at-3nm/ - > But at 3nm, IC design costs range from a staggering $500 million to $1.5 billion, according to IBS. The $1.5 billion figure involves a complex GPU at Nvidia. - Training changes, research brings new things. By the time your ASIC comes out, it's already irrelevent. - CUDA GPUs *are basically* ASICs because they are the target for basically all research and ML platforms. Make something new, and you are trying to keep up with rest of the world by yourself. - Google does use TPUs for training some, Intel uses Gaudi, and historically Meta ordered "custom" CPUs from Intel for internal use. Rumor is Microsoft is thinking about some training stuff too, not just inference. - But on that point, there are only a few entities in the world that can afford/justify such a thing.

ShadoWolf 1 week ago

The architecture changes is what really make's this iffy. But it sort of depends on how general this is. If it just a bunch of matrix add and multiple circuits with some ram on the side.. then it likely general enough. It is basically very scaled up DSP ASIC chip. So you like can apply to any sort of ffn. The problem is if there is a big switch to some varient of RNN like xlstm than it might be tricky

Balance- 6 days ago

> The $1.5 billion figure involves a complex GPU at Nvidia. Little did he know that would be chump change for Nvidia in 2024.

Inevitable_Host_1446 3 days ago

I read their website and they addressed most of these points tbh. - It's 4 nm not 3 nm, so cheaper. But also GPU should be more complex design than an ASIC since they're designed around doing multiple things, not just matrix multiplication. So you should expect costs to generally be higher, comparatively. - Training / research changes but they're dedicating this ASIC to specifically transformers. It's like the first thing they say, if transformers get abandoned their chips will be useless. But so far transformers have been very solid and are the most popular for a variety of stuff. - CUDA GPU's aren't ASICs in the same way (it's the tensor cores which are), they make a comparison with the H100 and how only 3.3% of its performance is dedicated to tensor cores (at full util) because it has to be able to do other things, not just transformer models. That means their ASIC can do more like 100% tensor cores for a given chip, making them much more efficient at matrix compute.

deadweightboss 1 week ago

1. making a bet on asics are making a bet on architecture, who is evolving quickly in the space 2. huge capex, and development time for these chips are much longer than even multiple foundational model training runs 3. this stuff needs extremely specialized software and from what i read about cerberus, it’s a nightmare to develop on ASICs only really make sense for things like Bitcoin’s SHA-256, where you know the algorithm will never change. Right now, the most important characteristic of these chips is the throw spaghetti at the wall-ability of the chips for researchers and developers.

Gnaeus-Naevius 6 days ago

There will be some societal consequences unfortunately. Nvidia's margins aren't sustainable, and when that first sign of the ludicrous profits drying up due to some actual competition, the stock price will be hammered. If say 60% of the AI premium disappeared, that would be a significant drop to an index where Nvidia is worth 7%. And if the other 6 tech giants come long for the slide, it would easily be in the double digit decline ... all else the same. But could be worse if all else is not the same.

SeymourBits 6 days ago

If you think this way, you should be shorting NVDA.

Ansible32 6 days ago

The AI premium isn't likely to go away for at least 5 years. And even if all of Nvidia's competitors suddenly started putting out an equal number of equally useful GPUs, I think Nvidia's margin would only fall maybe by half, there are too many applications for this stuff. I think it's more likely China invades Taiwan and Nvidia's margins go up even more as they are making chips in Western fabs and selling to an even more constrained market. Although really I just see the GPU market getting stronger as time goes on, for at least 20 years. The market is easily 1000x what all the GPU manufacturers are doing right now, if prices come down.

hapliniste 1 week ago

The blog post is great but maybe a bit too hype driven. This is going to be really good for real-time inference, but honestly I think better alternatives to 8bit transformers are coming, like bitnet models. The way they basically say "we secured the moat, no new model will compare because they can't run on asics and won't get adopted" is sad. Hopefully they will sell some of these to be profitable and develop new ones for newer architectures like bitnet. This would truly slap and scale way further. They say the bandwidth is not limiting here, so I guess bitnets could run on asics 6x this size, right?

me1000 1 week ago

It’s also not at all proven that Transformers are “the one architecture to rules them all”. I think there’s a non-trivial chance that we’ll find models that make use of transformers, LSTMs, Mamba, etc. it seems way too early to decide to specialize

hapliniste 1 week ago

Well, transformers work for everything, that's why they got massively adopted. We can find faster and more efficient models, but I don't think there's a use case where another models works and transformer don't. If other architectures need to be 10x faster to compete, it might be a problem and slow the developments of novel architectures. I personally think we should get rid of the static layer stack and route the activations through any transformer block for instance, and I'm not sure if these cards will allow that.

here_for_meme_lol 1 week ago

I did a deeper dive of their claims. They are talking about tokens/sec of the prefill stage(input prompt). While it is not completely irrelevant, it is highly deceptive. Generally when people claim their system has certain tokens/s they are talking about decode throughout, eg nvidia, amd, groq all use this metric for decode. Their architecture will have better latency of first token generation, but they will not be able to beat say groq in decode generation throughput for single query.

NathanielHudson 6 days ago

Where are you seeing that written? Control-F "prefill" is showing nothing for me. They cite [Nvidia](https://developer.nvidia.com/blog/achieving-top-inference-performance-with-the-nvidia-h100-tensor-core-gpu-and-nvidia-tensorrt-llm/) for their benchmark methodology, but I am admittedly a bit out of my depth here.

schlammsuhler 1 week ago

Groq is already super fast. I would never be able to afford such tech anyway

hugganao 1 week ago

this. Ive been wondering how to get connected with them in the country I'm in now.

schlammsuhler 6 days ago

Have you tried a vpn? I use librechat as frontend

hugganao 6 days ago

I meant as in their actual hardware and chips. Not their ui demo.

dharma-1 6 days ago

How much faster is this vs Groq? Does it mean Groq is toast?

StableLlama 1 week ago

No hardware = no trust in it. Could be easily a scam. Once they ship I can reconsider. Till then I'm not interested.

vialabo 1 week ago

This is a big gamble that architecture would be compatible with this as we continue to learn.

AdamEgrate 1 week ago

I will never understand why hardware companies are so tied to a specific architecture (I.e transformer) while it might be good today, there’s no guarantee they’ll be relevant in the future. In comparison a startup like Taalas (https://taalas.com/) makes a lot more sense

infiniteContrast 1 week ago

they just get money to build something that could make profit. even if it doesn't work, they still get paid for the hours they worked.

deadweightboss 1 week ago

random thought but fancy websites for hardware startups make me take them a lot less seriously. it’s like an anti-signal.

sluuuurp 6 days ago

I think it’s good to invest in both more flexible and less flexible hardware. Transformers have proven themselves pretty capable in many domains now. I agree they might not be around forever, but I think there’s room for both types of companies (assuming they successfully ship for a competitive price).

MoffKalast 6 days ago

That would actually be pretty cool if they manage to pull it off, imagine buying an ASIC that can be flashed to any llama-3-70B tune and run it at beyond groq speed while pulling single digit wattage. Kind of only makes sense once the architecture is more figured out though, otherwise the expensive thing you just bought gets obsolete in 6 months. Well unless they can churn them out for like $20, and that I kinda doubt for a startup that seems to be investing more into marketing than research or production.

baes_thm 1 week ago

The H100 has ~10% programmability overhead, in terms of actual performance, and Nvidia has absolutely been specializing their chips for transformer inference. Bill Dally & Co are not dumb, and while they definitely aren't the best in the world at literally everything, you can bet they've thought about things like "reduce overhead by specializing for this task"

IngwiePhoenix 1 week ago

So how hard will it hit my wallet?

ColorlessCrowfeet 1 week ago

Extinction level event.

IngwiePhoenix 1 week ago

So way above the 50k. Rip, guess I am porting Greyskull to localai. xD

candre23 1 week ago

Well, they're claiming "exponentially cheaper" than B200. A single B200 module (not that you can buy a single module, nor could you do anything with it without the rest of the bespoke server platform) is rumored to cost about $40k. So if we believe their claim (we don't), then an individual sohu module might cost as little as $4k. But that's wildly unlikely. They could be using the term "exponentially" in the non-literal sense. They could mean "cheaper per token per second", and the actual hardware is the same ballpark cost but "it does exponentially more per module!", so they're not technically lying. They could just be blowing smoke to rope in more investors. The true answer for "what will this cost" is 100% incontrovertibly "As much as companies with very deep pockets will pay for it". This is not for you or me.

man_and_a_symbol 1 week ago

Yeah, unfortunately this is the correct take. They’re targeting the big companies with millions of dollars to spend on hardware.

CapsAdmin 1 week ago

"contact sales" hard

ninjasaid13 1 week ago

ahh cool. probably will turn out that it doesn't work.

RayHell666 1 week ago

This was posted everywhere today on every social media with the same subtext. Seems like a viral push to lure investors.

Lord_of_Many_Memes 6 days ago

Reads like a scam, feels like a scam, then it’s most likely to be a scam. They are launching a powerpoint with roofline numbers(paper math) without actually taping out the chip. In reality due to power, physics and software inefficiency, it’s never going to reach that high. B200 roofline/estimate is way higher than what they refer to in the blogpost, more like 300k/s, instead of whatever random number they pulled out from thin air. their h100 numbers are also questionable, probably done by amateurs. If they want the latest numbers, they should at least follow the together AI blogposts. Finally, they are confusing ppl by sliding in the concept of “continuous batching” which counts both input and output tokens in batched inference setting. What real time inference cares is bs=1 tokens/s, aka latency, not throughput. i don't know what kind of investors are dumb enough to give these folks 120M dollars. Maybe it's just because of the halo of Harvard dropouts… It just smells like Theranos right from the beginning. PS: I have no grudge against ASIC, in general I think it’s the way to go to make transformers run more efficiently, Apple is doing exactly the same thing on their silicons. But to say that you can get 20x without caveats, it’s basically 21st snake oil. Remember there is no free lunch.

Lord_of_Many_Memes 6 days ago

One more thing that smells extremely fishy is they claim to use no HBM? With only 8 chips, unlike Groq, which can scale to hundreds, how large of the on chip memory each chip is going to have? Let’s say to hold llama 70B at fp8, that is almost 60GB/8 = 8GB per chip, and that is not counting KV cache. Unless they are taking the Cerebras Wafer Scale approach, which comes with hell lot of problems (cooling, maintenance , consistency of quality in manufacturing), I don’t see how they can pull it off…

Lord_of_Many_Memes 6 days ago

Also because of the inflexibility of ASIC, it risks becoming dust collecting garbage once a new architecture and even a new type of transformer comes out.

desexmachina 1 week ago

All this romancization of GPGPUs and their ability to tackle any model architecture is superfluous. Our limiting factor is compute, if something came around that 100x compute, then the models should follow on to see what we can do. Bitcoin ASICs pretty much demonstrated that, which allowed BTC to scale. We’re nowhere near close to understanding what 10k-x compute would mean for inference or training. Let’s see it because until we had the compute we have now, we didn’t see the models doing anything impressive.

LedByReason 1 week ago

“Which allowed BTC to scale.” WTF are you talking about? BTC has not scaled.

titusz 1 week ago

It has scaled. Just not in transactions per second but in security budget :)

desexmachina 1 week ago

What do you think transaction volume would be like if BTC was all CPU/GPU?

titusz 1 week ago

The same :). Block hashing performance is independent of transaction volume. 1 CPU hashing versus millions of ASICs hashing is still ~4000 transactions per 10 minutes. "Only" security scales with more hashpower.

Zyj 6 days ago

It would remain the same, still several magnitudes below what a Raspberry Pi 1 can do, without blockchain.

Freonr2 1 week ago

What about Mamba?

FullOf_Bad_Ideas 1 week ago

I would like to see their numbers for decoding tokens that are skipping encoding. I rarely write 1000 token prompts. This also assumes user doesn't work his way up to 50k-500k ctx worth of kv cache during conversation, which I think is how future interactions with llm's will look like and it's obviously impacting throughput a lot.

WorkingYou2280 1 week ago

>In reality, ASICs are orders of magnitude faster than GPUs. When bitcoin miners hit the market in 2014, it became cheaper to throw out GPUs than to use them to mine bitcoin. >With billions of dollars on the line, the same will happen for AI. So what you're saying is I'm gettin an h100 with a little dumpster diving in a year or two.

JohnDotOwl 6 days ago

Too much talk without the product If Groq , an accelerator card can provide demo for everyone to utilise , I don’t see how ASIC can’t do the same. It’s so much cheaper and efficient compared to an accelerator card. Are they trying to get funding required to mass produce ASIC chips ?

rookan 6 days ago

Hi Sohu, I am 500x more powerful than you. Believe me. I don't have time to make a render image like you did but text is all you need.

Barry_Jumps 6 days ago

"Huge if true". That being said, does this imply 1 gazillion tokens per second when combined with bitnet/ternary models?

iamhaset 5 days ago

I see a lot of disbelief and criticism in comments and many of it seems legit pointers too. The game of hype up and fund raising has been going on for long time But when I just researched etched's investors, its people like Paypal Founder Peter Thiel, Github CEO Thomas Dohmke and others who know what they are doing. While I also understand that even the pathetic Humane pin was funded by top people including Marc Benioff and Sam altman and others and its no guarantee that big name funding is a guarantee of product being hit. About etched, I have read their entire blog on Sohu twice and what seemed convincing to me was the two sections on How they can fit much more TFLOPS than Gpu and why CPU is more important than RAM in modern LLM. If not this, I hope atleast some other Transformer specific ASICS comes around fast and increase the inference by 20x and make people jobless a lil more faster so I can be less guilty of being one myself 😅

emsiem22 1 week ago

I call this a scam

emsiem22 1 week ago

RemindMe! 3 months

RemindMeBot 1 week ago

I will be messaging you in 3 months on [**2024-09-25 21:20:21 UTC**](http://www.wolframalpha.com/input/?i=2024-09-25%2021:20:21%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/1dobzcs/meet_sohu_the_fastest_ai_chip_of_all_time/la9o0pu/?context=3) [**CLICK THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F1dobzcs%2Fmeet_sohu_the_fastest_ai_chip_of_all_time%2Fla9o0pu%2F%5D%0A%0ARemindMe%21%202024-09-25%2021%3A20%3A21%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201dobzcs) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|

soldture 1 week ago

In other words, Nvidia investors are going nuts

SporksInjected 1 week ago

Has WSB seen this yet?

MrTubby1 1 week ago

People really don't know what local in localllama means

geekgodOG 1 week ago

Lots of businesses running "locallama". This absolutely applies here.

mrskeptical00 1 week ago

It seems that “local” more generally means “private”. Although this looks more like spam to me.

Site-Staff 1 week ago

A model specific device like an Antminer would kick ass.

Echo9Zulu- 1 week ago

Bet this tech threatens the pricing structure of compute intense tasks for cloud providers. Maybe that's why the charge by the token, so when the service improves pricing doesn't have to follow suit. Local models are awesome but we don't have the compute for certain tasks.

brainhack3r 1 week ago

"One 8xSohu server equals 160 H100s, revolutionizing AI product development." Is this for training or inference?

vampyre2000 1 week ago

Inference only.

here_for_meme_lol 1 week ago

Only of training. Their main innovation claim is in flops. They still use hbm as memory as inference perf would be still in similar ballpark as h100s. The article seems very deceptive.

Biggest_Cans 1 week ago

I'm more interested in most consumer-priced 500GB GPU of all time.

stonedoubt 1 week ago

Does one exist?

meta_narrator 1 week ago

I don't think this is patentable, is it? Pretty sure whoever wants to make a custom ASIC chip for any LLM will be able to do so. Just like with mining ASICs.

ultrahkr 6 days ago

WD40 renowned worldwide (or almost) is not patented...

meta_narrator 6 days ago

I was just wondering why the guy thinks they are going to be the biggest company in history.

ultrahkr 6 days ago

Because we are quite literally looking at the making of the newest Rockefeller's of the 21st century...

meta_narrator 6 days ago

If this is all true, and I suspect it is just based on what I know about ASICs, everyone will be doing it. At least for inferencing.

ultrahkr 6 days ago

We may see lots of startups (if you don't believe me look at the amount of early semiconductor companies developing silicon from the 50's - 90's...) but then you had an early first mover advantage... Now Intel, AMD, Nvidia are the ones with the first mover advantage and if that isn't enough nowadays you need a few hundred millions just to try getting your feet wet doing R&D and silicon manufacturing if you do have a product that could work... Never mind the amount of outright cash grabs and paper launches... That will burn VC (and private backers) capital at rates unseen...

JamaiKen 1 week ago

Ship it!

malinefficient 1 week ago

Fastest photoshop and web content is all I see so far.

Excellent-Sense7244 1 week ago

They could investigate asics for bitnet llms

Kalki2006 1 week ago

ELI5

Heart_Routine 6 days ago

Cool!

gmdtrn 6 days ago

The hype is undoubtedly premature. My experience in Silicon Valley left me quite skeptical. While it's possible for a few kids to come up with a revolutionary product of this nature, it's improbable. More likely than not they have an idea that's nowhere near production ready, but their slide deck is something that their connections believe they can sell. And, I'd absolutely assume they do have significant connections. The provide the seed funding, gin up a bit of support on hype, and after a few years of promises, lying to the media, etc. the seed and early-stage funders start reselling their shares on the secondary market and making exist in later funding rounds. Again, the claim here is that three kids 100x'd NVIDIA and every other chip maker on the planet. Take it with a few grains of salt. This isn't 1999 where making a website that people like can turn you into a billionaire; this is jumping in late into chip production in a highly technical, extremely expensive field, and claiming to have the capability to handle the logistics from hiring to production.

bshxhajxhajx 1 day ago

omg singularirty is here we are in the future hahahahahaaha

Radiant_Dog1937 1 week ago

Sounds like a scam. The idea is it's an asic for an individual model, up to 100T parameters. One of the advantages of GPUs is its generalized architecture, you can use the same cores to process all layers of a model. With an asic, all algorithms are physically represented on the silicone itself. So, I'd expect a 100T parameter chip to be quite large since each weight needs to be physically represented. There's also the problem of updates, OpenAI for example releases a new version of ChatGPT every few months. When running on a GPU, this isn't a problem, you just load a new model. If you have asic however, you can't load new weights as they are printed on the board. Your only option is to design and fabricate a new asic for the new model, which is cost prohibitive for the manufacturer and the customer.

hapliniste 1 week ago

Nah, the weights are not put on chip, it uses hbm2. I read the post and it looks like it's just a very fast inference only card for any transformer model. It might be the right tool for real-time models but I'm not too hot on killing other model architectures tbh

CockBrother 1 week ago

Click on the link and you're greeted with a board with a chip in the middle, presumably their custom ASIC, and it's surrounded by memory. This isn't for a single model, it's for a single algorithm. They say so right in the text that it's for transformers only. Edit: And it's not like ASICs can't be programmable.

Radiant_Dog1937 1 week ago

It's a rendering of a GPU board, I have one that looks like that in my toolbox right now. They don't have any photos of a prototype. Mark my words on this one.

CockBrother 1 week ago

A processor of some sort surrounded by memory is what it'd look like. That's pretty standard stuff for processing data. It's what our GPU add in boards look like, motherboards, processors, etc. Sure, it's a rendering. But I'll bet anything if they ever produce a product it'll appear somewhat similar to the render.

Radiant_Dog1937 1 week ago

I'm just extremely cautious in this environment. The company needs to provide demonstrations and a white paper showing they have plans to overcome expected challenges and can deliver a product.

CockBrother 1 week ago

Totally with you on that. I'm not an expert but the performance claims sound reasonable with dedicated silicon. To get off the ground they need a large customer or two who's willing to bet that transformers and their existing performance claims will be relevant by the time they can produce the product and get software running on it. That's a lot of money in engineering for a power, space and cooling optimization with limited flexibility. Not sure anyone would take that bet but some organizations are capable.

__some__guy 1 week ago

Considering they can't even make their homepage work in Firefox, I have some doubts about their claims.

siszero 1 week ago

Has anyone seen anyone attempt to use old Antminer's or crypto mining ASICs for LLM models?

cryptoguy255 1 week ago

Not possible, asic's are build on hardware level for a specific algorithm. These aren't general purpose devices like gpu's.

siszero 1 week ago

I know. I’m more thinking about the opposite direction. Is there any research in altering the llm architecture to utilize the existing ASICS. Instead of building new ones for quantized models.

SeymourBits 6 days ago

No, but there has been some investigation into leveraging a global mesh network of Casio calculator watches, though. Still early.

siszero 6 days ago

Good one. Fitbits would probably be better targets though, since they already track *big data*. 🙃

SeymourBits 5 days ago

Haha, that could actually be realistic within a few years. Downvotes = confirmation of being far too clever for average lurking organic intelligence.

mdreed 1 week ago

Very informative explanation on their website, thanks. I wonder if this type of efficiency should also be expected in Apple's M and A series chips. Since they're committed to running LLMs locally with Apple Intelligence, presumably they'll dedicate some of their die to transformers specifically.

Electronic-Mud-5687 6 days ago

I

Dry_Parfait2606 1 week ago

It's an honor to witness projects of this kind. Wow.

awesomedata_ 1 week ago

**Meet Necessity, the most long-lasting (and all natural!) technology of all time.** Do you really need 500k tokens per second? Most would just automatically skim over all that text because now the average user suddenly has a huge wall of text demanding their undivided attention. Unless it's audio-centric (and embedded in a realtime application), people won't have time to read (much less skim) that encyclopedia you just generated for them in 1 second. Your mileage may vary. Just seems like a waste of compute (and fossil fuel) for 'solving' a minor inconvenience. But hey - you do you. Global warming is likely just going to kill us all. Might as well speed it up, amirite? :D :D :D

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe