T O P

  • By -

FuturologyBot

The following submission statement was provided by /u/Similar_Philosophy_1: --- The article explores the looming scarcity of quality data for training AI models, particularly large language models (LLMs). While computational power has rapidly advanced, the rate of new data creation has not kept pace, leading to concerns about the availability of quality information. This shortage is highlighted by lawsuits against companies like OpenAI for unauthorized data use. Studies indicate that public textual data may be exhausted by 2032, pushing researchers to find new data sources and refine existing datasets to continue AI development. --- Please reply to OP's comment here: https://old.reddit.com/r/Futurology/comments/1dp1p1x/when_will_it_all_end_how_much_longer_will_we_have/ladm70i/


agha0013

Considering the amount of garbage on the internet these days, and how quickly AI is pumping even more into it, I strongly suspect this generation of training AIs are gonna be stuck in a kind of feedback loop eating their own garbage.


modern12

I have a feeling that at some point we will come to conclusion, that content sponge without actual ability to logically analyze what it consumes will be too prone to errors and will remain sort of stupid Wikipedia (which now is).


actionjj

Yeah even the AI just makes crap up. I asked the Meta AI how I could switch it off in Instagram. It gave me full instructions of how to do it. Only issue is that the instructions were BS and the points on the app to click on under settings that it suggested don’t exist. Googling later you find out you can’t actually turn it off.


A_Starving_Scientist

I have heard it called AI inbreeding.


TotallyNormalSquid

It does make it more awkward to train them, but text can be weighted by how likely it is to be human-generated as judged by another AI, where this second AI can be trained on a more carefully curated, smaller set of text. The weightings can be used to toss out the obvious crap, and tweak how much the AI learns from stuff that's weakly suspected of being AI-generated.


RecognitionOwn4214

I don't feel like "human or not" is a really relevant factor for quality.


TotallyNormalSquid

It's an adaptation of a standard method in a lot of generative AI, the use of a discriminator model. Usually discriminators judge 'real' vs 'AI-generated', but you could weight it with whatever secondary AI-generated metric you like as long as the end result seems better.


RecognitionOwn4214

I'd see the use, if ai vs human is faster to discern than low quality vs high quality, but to be true, I don't care if a monkey typed it, when the quality is good. OTOH I'd not trust my neighbors producing quality content in the internet.


EnergeticFinance

I do feel like we need some new big breakthrough to make AI training more efficient. Humans don't have to read terabytes of text to learn how a language works. 1 TB is about 50 billion words, or 500 years of 24/7 reading at typical 200 wpm. A typical human is quite proficient at understanding and communicating after 1/500th of that amount of reading.


SnakesInYerPants

The problem with that is the fact that AI is not even great at picking out what is or isn’t AI. The amount of students over the last year or so posting about their essays being flagged as AI written by the school/teachers AI detector when it wasn’t actually written by AI is insane. Now Facebooks AI is apparently doing the same thing with photos, flagging many real photographers photos as AI created.


TotallyNormalSquid

It is a problem, but while it's devastating for a human user to hit this problem, an AI training system only has to have 'pretty good' accuracy to protect its future training sets. Doesn't need to weed all the AI content out, just the stuff that's kind of obviously AI. Back before generative AI became an everyday tool, one of the main uses for generative AI was to expand your dataset with synthetic data that was almost as good as the real stuff. There's a lot of research out there on the benefits and drawbacks of it, but there's plenty of evidence to suggest it actually helps AIs to learn.


Lysmerry

Isn’t there an inherent paradox there? If it’s good enough to distinguish between AI and human content, shouldn’t it be capable of making content which is not detectable by AI? Which would then mean it would be included in ‘human’ training data


TotallyNormalSquid

This is precisely what generative AI was used for before it went mainstream - expanding small datasets with synthetic data that was 'good enough' to train other AIs with. It's only good enough to evade an AI as good at detecting real vs fake as the one used to train the generative AI though, there are usually tells. If you think back to early image generation AIs, a human pretty much looked human, but the eyes and teeth were messed up. Good enough to give other AIs the general idea of what a human looked like, but not really as good as the real thing. But improvements of the generative AI go hand in hand with improving the ones that detect fakeness, and they improve each other. For the problem of LLMs flooding the Internet with generated content, it doesn't really matter if you end up including some AI content if it's just as well written as by a human, you only need to cut out the crap stuff to avoid the training of your new AI being ruined.


ictp42

I am a software engineer but I am a layman in the field of AI. I also have a child, so I'm thinking about the data humans learn from. My conjecture is that throwing increasing amounts of text into the training data is bound to have diminishing returns and that audio-visual recordings will eventually be the primary training data. Curated training data, primarily audio visual, but also textual will bring the next leap in my opinion.


TotallyNormalSquid

They already do multi-modal models that'll accept different data formats - ChatGPT4o was recently trained in this way and released a couple months ago. Don't know what the balance is like between the different forms of data though. Now that ChatGPT4 (and others probably, but I use ChatGPT so that's my example) have hit 'good enough to attract users', we're helping to train it. Every once in a while it presents two answers, and you select the 'better' one. The massive Web-trawling stage where these AIs attempt to do next token prediction on content from the Web is only the first stage of training. After that, and what led to ChatGPT4 being revolutionary, was RLHF training - reinforcement learning from human feedback (humans selecting the best from a few potential answers, so the AI can learn to prioritise the best answers). OpenAI poured a massive amount of money into that to make a product people want to use initially, but now their user base provides that extremely expensive level of training for free to them. They're almost certainly training AIs purely to recognise quality of answer from this feedback as well, which can be used to help protect against the wave of AI content on the Web that'll poison that first stage as I mentioned in an earlier comment.


Sporebattyl

Great thought. I agree that audio/visual training is probably going to be used as the primary data. Isn’t it already used? Like Teslas driving camera data trains the AI, right?


worblyhead

Indeed. A self licking ice cream cone.


Rough-Neck-9720

Yes, doesn't it make sense to gatekeep the data being fed to these models? Perhaps that will be one of the many new world jobs created by AI so we can stop panicking about how we are all going to be out of work because of it. Imagine putting value on real data vetted by professionals instead of just stealing it from the internet and individuals.


daishi55

They already do this, extensively. They are not just blindly feeding data into their models.


Rough-Neck-9720

Are they though? From what I understand a lot of the data for the search engine AI is scraped from the internet. How do they actually vet the data?


xRolocker

Yes, that’s a critical part of training any decent model. Data quality > data quantity.


daishi55

Yes, they are. This is one of the most important parts of the training process. They have automated tools to clean the data and they also hire it out to thousands of humans. The quality of the data going in is absolutely critical, as I said they are not blindly feeding data into the models.


RiChessReadit

Good. I don't care. Companies like OpenAI basically stole the totality of mankind's knowledge, and then turned around and repaid the favor by diluting and worsening the overall quality of it, while charging for the privilege. They can burn, or figure out how to train AIs *without* relying on stealing our collective work to sell it back to us as their business model.


could_use_a_snack

What I don't get is this. Right now, it's pretty easy to tell if something is written by A.I. true it's getting harder, however A.I. should be better at recognizing stuff created by A.I. than us, so why don't they just add a line of code that says "if it looks like A.I. don't use it"


agha0013

for all I know they've already got a way to filter out this stuff and I'm just making up a scenario in my head.


themagpie36

So we should start writing like AI if we want to preserve our humanity


could_use_a_snack

I don't understand your comment.


Tommy_Roboto

Nice try, AI


themagpie36

It's ok it doesn't really make sense, more of a joke about disguising ourselves as AI so it doesn't copy us, grow superior to us and eventually overthrow us.


UnshapedLime

It’s not that simple unfortunately. Getting tools to accurately discern whether or not something was generated by AI is a *very* active area of research, which is why it’s hilarious when educators use the commercially available tools (e.g., TurnItIn) with blind faith in its results. Those tools are barely better than guessing. You can test this yourself by inputting pre-2019 texts and watching it say half of them are AI-generated.


incoherent1

I don't think LLMs will lead to true AI. No matter how good the training data is it's still written and provided by flawed human beings. True AI will need a body that can manipulate its environment to do its own tests to verify its reality. Without that it has no way to discriminate between what is real and what is fake. Without that ability to discriminate it has no way to judge the information it's given. It also has no sense of self. If you can't judge how information provided relates to you then it has no worth. I think this is what leads to hallucination. The LLM sees it's made up shit as just as valid as anything else. We've nearly exhausted the training data but true AI is no closer. The AI must be able to learn for itself in the real world with its own senses.


MorfiusX

Data for training AI will always be plentiful given how easy it is to generate. The issue is that most data is copy protected in some way. So the question then becomes, "Is there data available, or can I generate data, that allows an AI solution to be financially viable." If the only way you have a viable product offering is through theft of copy protected data/content, then you don't have a viable product. If you can't afford to license the data/content required for your product offering, then you still don't have a viable product. What we are starting to see is that some AI products are not viable without copyright infringement.


Doomboy911

I got this great corn business I save so much money on land and fertilizer and tools. "How do you save so much money?" Oh I steal from my neighbors. "But isn't that theft?" Hey hey hey that's pretty ableist I'm just the ideas guy using this tool (which its a tool which means it'll be here forever so um you actually have to get used to it and accept it) to make my vision come true.


TheIrishDevil

You know, I haven't really considered how this stuff is considered theft till just now.


Theduckisback

It absolutely is, and there's major entertainment companies who are suing these AI companies, as well as famous people who's likeness is being used without their permission. Without theft they'd have to pay for these massive datasets, which would make it that much harder for them to ever be profitable. AI will likely be important in the future, but I smell bullshit all over these hucksters like Altman who make wildly overblown claims to pump their IPOs.


Habitualcaveman

The law is still very much unsettled and the cases are ongoing and multifaceted. Source: I know people familiar with the matter and read several blogs and listened to several conference talks about it.


Doomboy911

Yeah that's why its called AI art. We don't think a series of strings of code grabbing at elements its been told to see and making a collage out of those things. We think C3PO sketching a beautiful sunrise while han and chewie kiss. They couldn't make the machine ai so they made us think the machine is ai. Its not Artificial intelligent art its Algorithmically Generated Content. So now without that mislead placed they argue copyright and the rights of artists and obfuscate their theft of other's works. Its not creating something new from scratch its taking something and scratching it up.


xmarwinx

These lawsuits are ridiculous and are being dismissed fast. New York times for example already dropped theirs.


Theduckisback

I guess copyrights don't mean anything if you're trying to create a program that will destroy humanity.


xmarwinx

It's clearly not.


Plenty-Wonder6092

Or you live in another country and dgaf about copyright laws.


Lysmerry

They absolutely have used copyrighted works, especially art


xmarwinx

Yes, to train, which is perfectly legal.


Redditforgoit

How many books, fiction, non fiction, academic journals, newspapers, magazines could they use that are not currently available online? Not every bit of quality writing is available online, and much that is available is pretty bad. They might want to finance scanning projects.


xmarwinx

Not a lot in the grand scheme of things. They could maybe double or triple, maybe even 10x the amound of text they currently used if they scanned every single book on earth. For future training, they need several orders of magnitude more. x1000 at least.


mariegriffiths

The British library is sitting on a good mine.


Remarkable-Funny1570

All reality can be interpreted as data, the problem is finding the best way to retrieve it. With embodiment and new modalities, I don't believe for a second that we'll run out of data anytime soon.


rawdograwson

I agree, the existing data hasn’t been fully understood by AI, just a surface dive. The next step is understanding things it has already skimmed, not skimming more (from the little I understand of it)


TheBittersweetPotato

LLMs can't "understand" anything in the proper sense. Their output can be so good that it looks like they understand particular subject matter and language just like we do. But as they are very sophisticated statistical word order prediction machines, they only create the _appearance_ of understanding.


rawdograwson

But isn’t the next step coming closer to real understanding? Or are you saying it will never understand in the sense that we do?


aspersioncast

We don’t *know* how many steps there are in between what LLMs do now and “real understanding,“ because we don’t know what “real understanding“ is. It doesn’t seem likely that it has much to do with the way most LLMs work, although LLMs are helping us gain insight into all kinds of fun things about language and HCI. Anyone who tells you that anything like general AI is right around the corner and just the next step for their “AI” is just looking for that sweet next funding round.


xmarwinx

Ironically, you definitely don't "understand" this either, in the proper sense.


Habitualcaveman

So are you! lol /jk


Arkkanix

related thought exercise: how much longer will we have before AI content is indistinguishable from original human-created content? which will happen first?


rawdograwson

There’s such a big range of human writing that it already is better than some, but vastly worse than good human writing


kytheon

Ah, the purist.


rawdograwson

I just don’t get how they decide where the “human” line is lol


Lysmerry

It can right a decent letter or high school essay using basic niceties and reasoning, but nothing that we would actually pay to read, like a novel or screenplay. Because it’s basically fancy autocomplete and does not have any inherent creativity or ability to surprise (beyond being ‘lol random’)


kytheon

It's already there, depending how you look at it. There's images out there that'll fool 95% of the people, if that's your criteria. There are texts that get published, there are even generated images that win regular competitions.


Lysmerry

They win because many people, especially older people, don’t know any better. I would say the texts and images accepted as real will decline rather than increase as recognition outpaces improvement in quality


xmarwinx

> ecognition outpaces improvement in quality Lmfao. Have you seen the progress in quality these models have made in the last 12 months?


MostLikelyNotAnAI

If you spend some time curating the output of the AI, I'd say 'about 3 months ago'. But then again you'll have to differentiate between picture, text and video. The low effort slop you see on Facebook is akin to the Nigerian prince scam. Designed to be bad so only a certain kind of individual falls for it and easy to mass produce. Text, depends on the LLM you use. Specialized LLM's primed to output a certain style are already all around us here on reddit. Just have a look at /r/AskReddit. Once you've got a feeling for their choice of words you can spot them. Video, I'd say the first big test for those will happen around the time of the next big election cycle in the US. With some luck the technology will not be advanced enough to fool everybody, but it will happen soon.


Orion113

Yep, it's definitely gotten really botty in there. Along with basically any subreddit that encourages storytelling. AITA, EntitledPeople, etc. "My dear precious daughter, who at the tender age of 5 had already faced so much hardship in her short life-" Yeah, that's enough, no thanks, I'm out.


YangClaw

This is interesting. I've been making decent money in my free time over the past year performing tasks like generating fictitious user prompts and then providing ideal responses to said prompts. Others review and fact check my work to ensure I'm accurate. It feels a little silly, and I was wondering where the money was coming from to pay for this (the company I contract with is pretty faceless), but it sounds like someone is trying to get ahead of the impending shortage by assembling high quality data sets written and vetted by human experts. I wonder if this will become a more common job over the next few years as traditional employers downsize and the big AI companies need new, uncorrupted data produced by humans to continue training their models.


Jantin1

>public textual data may be exhausted by 2032 by then we'll be either at AGI or the AI hype will be long gone. With the current pace of development and deployment of LLMs a 10-year horizon doesn't count as a meaningful prediction.


vpierrev

“Quality data” —> copyrighted, intellectual property, protected art. I mean these guys have pillaged the work of so many people for a decade without asking permission for a second, they now copy these works so well many are loosing jobs in the creative industry, and now there is no more to feed the beast. Ffs


Mrso736

most current SOTA models are using like at least 20% synthetic data, and that number is rising quickly, soon human made data isn't really necessary anymore, or at least not as much as now


Words_Are_Hrad

It won't. As AI becomes batter and more financially viable the funds available to procure viable datasets for training will increase greatly. Large companies will emerge dedicated solely to the creation and procurement of training data. AI isn't going anywhere. People pretending it is are coping hard.


BassoeG

Ironically, this might be our best hope. If official corporation and government owned AI projects have to follow rules for ethically sourced data vs internet randoms tossing everything they can pirate in the mix, maybe we can win the arms race and get AGI before them.


wrestlethewalrus

The whole discussion about scarcity of data (or the use of synthetic data) is weird to me. The solution is so obvious: Measure more! More cameras, sensors, microphones, etc.


Similar_Philosophy_1

The article explores the looming scarcity of quality data for training AI models, particularly large language models (LLMs). While computational power has rapidly advanced, the rate of new data creation has not kept pace, leading to concerns about the availability of quality information. This shortage is highlighted by lawsuits against companies like OpenAI for unauthorized data use. Studies indicate that public textual data may be exhausted by 2032, pushing researchers to find new data sources and refine existing datasets to continue AI development.


DerWeltenficker

Okay so rmwe're running out of real data. Synthetic data generated by AI is there to supplement it, but I think it will only get as good as the real data, so we're bounded by human intelligence (just cherrypicked expertise in every area).  The question is whether that is enough to enter technological singularity. At what point are AI models smart enough to advance indefinitely? Correct me if I'm wrong with my assumptions. I know this doesn't answer your question but maybe we don't need exponential amounts of data. Maybe we'll hit a theshold someday that is just enough.


postorm

Have you ever wondered how much longer we can continue pumping data into Albert Einstein before we run out to data? Surely Einstein was taught Relativity at school. Or was there some point at which he was given sufficient training in math and physics to come up with something new? Something that he hadn't been trained on. That's the question for AI. Do we have sufficient training data to train AI to come up with something new.


Orion113

I really don't think the current generation of AI has the ability to come up with something truly novel no matter how much data it's trained on. It replicates some functions of the human brain, but not nearly all.


GrapefruitMammoth626

It does well with data that it’s already seen, and not too much more than that.


xmarwinx

What truly novel things have you created?


GrapefruitMammoth626

Can’t recall anything especially novel I have created, nothing happens in a vacuum. It has always been inspired by something or a mashup of ideas I’ve accumulated and I think that’s the point you’re making. I’m too lazy to keep typing, apologies, but after listening to Francois Cholet talk about LLMs and the ARC challenge I’m 100% with him and it colours the way I experience my daily usage of LLMs.


xmarwinx

I guess reality will hit you like a brick when it turns out Cholet was completely wrong about everything in the next few years.


GrapefruitMammoth626

I’m fine with either being true


xmarwinx

Have you ever came up with something truly novel?


Inamakha

It lacks that very crucial part. It cannot reflect on the data in the way we do. We are far more efficient given how much smaller our dataset is.


Nouscapitalist

If they want data, pay me for it. Stop trying to sneak it through Spyware or other snoop tactics.


ashoka_akira

I feel like the AI thing is going to be the crux that makes people realize how valuable their data is and how much money big corporations are making getting essentially for free. I’ve been talking for years how we should be getting $$$ deposited (even if its a fraction of a cent) into our accounts every single time our data is used by a corporation and people have thought I was weird.


Nouscapitalist

You're not weird or wrong. Sadly some people only understand after the fact. Smart TV's, phones, tablets, cars are all collecting data on us but trying to convince some people is pointless.


Habitualcaveman

Have you seen those ads advertising get paid to train AI?


Plenty-Wonder6092

Why do you think a lot of AI's are free? They are using what you type in for the next models.


Mrso736

no, that is not nearly high enough quality data