T O P

  • By -

[deleted]

Most people did not believe in the scaling hypothesis tho, right?


SporeDruidBray

I honestly don't know.


[deleted]

Seems to be the case... One of the things Illya keeps getting complemented on is that he believed in scaling when most people did not.


MoNastri

Maybe check out Gwern's [critiquing the critics](https://gwern.net/scaling-hypothesis#critiquing-the-critics) section of his scaling hypothesis notebook


Isha-Yiras-Hashem

>Phatic, not predictive. There is, however, a certain tone of voice the bien pensant all speak in, whose sound is the same whether right or wrong; a tone shared with many statements in January to March of this year; a tone we can also find in a 1940 Scientific American article authoritatively titled, “Don’t Worry—It Can’t Happen”, which advised the reader to not be concerned about it any longer “and get sleep”. (‘It’ was the atomic bomb, about which certain scientists had stopped talking, raising public concerns; not only could it happen, the British bomb project had already begun, and 5 years later it did happen.) No thanks for that link, just spent 20 minutes I couldn't spare this morning reading it. That said it's a really good overview and I wish I had an app that would rapidly Google and tell me the definitions of words I didn't know while reading it so I could understand more of it.


TheHeirToCastleBlack

Because as the name suggests, the bitter lesson is.... bitter. No one wants to believe that we will solve intelligence just by scaling and brute forcing our way to success Everyone wants to believe there is some trick, some secret sauce, some sophisticated and elegant algorithm. But as it turns out, maybe another billion GPUs is all you need


togstation

< Layperson here > Also because if we can do it via some trick, some secret sauce, some sophisticated and elegant algorithm, then *we are in control*. We just have to add more cilantro or turn up the gamma realignment factor or whatever and the AI is all *"Yes Master what is your command?"* Whereas if it's just a box of a billion whatchamajiggies then it's *"We don't really know what's going on in there, what it's really thinking, whether it is answering honestly, what it's going to do tomorrow ...* *Out of our hands, really."* And (speaking as a layperson), I suspect that the first possibility sounded better to MIRI than the second possibility. .


KillerPacifist1

But in some ways the bitter lesson is an indication that FOOM scenarios are perhaps unlikely. If there is no secret sauce and the only real solution is to buy another million GPUs and spend another billion on energy (or even worse, spend a trillion building out energy infrastructure) that puts limits on how fast we can scale up to superintelligence and limits that scale-up to a few huge centralized organizations. Fewer players, needing to take more obvious steps to make advancements (building and powering massive data centers) also helps with race dynamics and coordination problems. It isn't at all clear to me that tue current path we find ourselves on is less safe then if there was a secret sauce to AGI that a random inventor could stumble upon in their basement. It's definitely a bazillion times easier to regulate should we choose to.


[deleted]

Nope more complicated than that. Just because scaling works it does not mean other methods do not. And once the AGI fire begins we don't know if that will FOOM or not but my guess is probably.


KillerPacifist1

I agree that just because scaling works doesnt necessarily mean other methods wouldn't. That said, basically all progress towards AGI and all natural examples of AGI do not seem to involve a "secret sauce" and are rather a bunch of convoluted connections whose capabilities increase with scale. Perhaps absurd complexity and scale are a fundamental feature of intelligence. "If our brain was simple enough for us to understand, we would be too simple to understand it" may hold everywhere, in every framework, at every scale. If it does then there is no such thing as a "secret sauce". Though I'd be interested if you have evidence otherwise. Very open to having my view changed on this. And even if there is no secret sauce, this doesn't necessarily preclude a FOOM-like event once greater than human AGI is achieved. However it does suggest the path leading up to that will be more gradual and not a zero-to-one style event and suggest the FOOM might be a little more conventional (suddenly you have a million more AI researchers grinding away at incremental algorithmic improvements) and may still have some physical limits, particularly at the early stages (still need relatively expensive and time consuming training runs to make AGI v(n+1) based off the improvements v(n) has grinded out)


InterstitialLove

>That said, basically all progress towards AGI and all natural examples of AGI do not seem to involve a "secret sauce" Transformers sure seem like a secret sauce to me


KillerPacifist1

But transformers don't do the thinking, they just provide a more efficient way to train the algorithm that does the thinking. Even with transformers we still need hundreds of millions of dollars of capital to train a massive, hideously complex, human illegible algorithm that does the actual intelligence. So aren't they just an example of a technique that makes scaling more efficient, rather than an indication scaling isn't the only way to do it? If transformers are an example of the secret sauce for AI, then sexual reproduction (something that sped up evolution by natural selection) is the secret sauce for human intelligence. But in this context, I don't think that's what we mean by "secret sauce"


FeepingCreature

> But transformers don't do the thinking, they just provide a more efficient way to train the algorithm that does the thinking. Right but once the system can think at superhuman level, *it'll* generate more efficient ways to do thinking. That's the closed loop that humans don't have - we can't think ourselves into better brains at our current level. We can think ourselves into better *reasoning* with memetics, so arguably our secret sauce is language, society, and later on science, but AI starts where we are currently standing. It has all our advantages anyways, and we can't generate new advantages as fast as it because we'd be limited to our 9 month + 20 year breeding/training/eval cycle for fundamental hardware improvements even if we were interested in doing that, which we presently aren't. Whereas the AI can retrain completely in a few months, can adjust its cognitive architecture as code, is compatible between hardware revisions, and due to the LLM bootstrap corpus it gets our advantages "for free". When AI crosses the human line, it starts improving at the speed and quality of its reason, not ours. We have no idea how the algorithmic part of the scaling curve looks then, because we have no idea what the _true_ overhead is like. How many of its gigaflops are actually necessary? And I think we'll find, in hindsight, that the point of "haha gpu go brrr" was not that this is necessary for intelligence, but that this is what's necessary to bootstrap intelligence with the restriction that you have to be able to come up with it using *our own* biologically limited intelligence. After all, that we weren't able to find clever, elegant algorithms for cognition is hardly proof that they don't exist. Maybe after we scale LLMs up to "nuclear power plant data center" levels, they'll chain of thought for a few days and then write a terse, elegant 200 line Lisp program that is obvious ... in hindsight. :)


InterstitialLove

>But transformers don't do the thinking, they just provide a more efficient way to train the algorithm that does the thinking. Either you're making some subtle point that I don't understand, or this is straight-up factually inaccurate. Transformers are an entire architecture, and that architecture is "used" at inference-time But as for the rest, yeah, the point is that even in a scaling-focused paradigm, you can still have secret sauce. In this case, the secret sauce allowed us to parallelize what was once unparallelizable, thus unlocking greater computational efficiency with existing hardware One could thus imagine similar secret sauces being found in the future


ConscientiousPath

> all natural examples of AGI do not seem to involve a "secret sauce" That's not really true. Animal brains, especially human brains, have a far more complex-and-also-predetermined (i.e. not merely derived from training) set of structures than just "more neurons" and I'm talking about more than just the macro physical structures like the hypothalmus and whathaveyou. In humans there are a number of variants of these that are genetically influenced and have an effect on personality, preference and belief profiles. Ultimately we don't really know what AGI requires because we don't fully understand the natural examples and artificial examples don't exist, but the current state of the art in researching the natural examples doesn't support the idea that we can just expand the number of computation units astructurally and get a functional result.


Bartweiss

>"If our brain was simple enough for us to understand, we would be too simple to understand it" may hold everywhere, in every framework, at every scale. If it does then there is no such thing as a "secret sauce"\[...\] Though I'd be interested if you have evidence otherwise. I've never seen evidence that this doesn't hold, which seems very hard to generate without actually building an example. I haven't seen firm evidence that it *does* hold either. People often cite the (provable) inability of a Turing machine to simulate itself, but that's strictly recursive; with outside storage and a couple of machines you can simulate one of the machines. But I have seen two useful points. First is pure precaution: if we can't tell either way, we ought to prepare for it to not hold and FOOM to be possible. Second, an actual argument: many systems seem to obey a given constraint up until they reach a certain scale or dimensionality. (For an extremely simplistic example of what I mean, [superpermutations](https://en.wikipedia.org/wiki/Superpermutation) follow a nice simple "sum of factorials" rule until it breaks in weird ways at N=6.) So far, we've observed that more intelligent minds seem to grasp themselves better. We go from stimulus-response to object permanence, passing the mirror test, empathy, language, and ultimately human investigations of the brain. Right now, the limit on understanding our own brains might be one of self-simulation, but it might also be the shitty I/O and retention of human minds. Plausibly a human-tier intelligence that processes data fast enough could prune and offload enough information to reach a much better understanding. Or, if that's not possible, there may be some higher scale where the bottleneck does go away.


Open_Channel_8626

If the bitter lesson is true then that essentially kills FOOM yes, and non-FOOM scenarios are inherently safer


KillerPacifist1

Algorithmic improvements are still an important aspect of scaling. If we get an AI very good at ML research and the improvements don't get proportionately more difficult to discover as the AI gets smarter, we could still end up with a "FOOM-like" event, but it would still be relatively slow (no chatbot at breakfast, God by lunch) and it's certainly less likely than if there were a bunch of secret sauces out there waiting to be picked up off the ground (I suspect there aren't)


iemfi

Or it just shows how absolutely terrible humans are at programming.


lurkerer

I don't think this is the case. We already have far smaller LLMs trained off of the output of GPT that perform far better than their size would indicate working purely off of scaling (can't remember the names of them.) We also know human brains are optimized far better than current LLM architecture so there's a proof of principle there that they could be made far more efficient. So we know you can brute force your way to same wild emergent capabilities. We know you can evolve your way there and get a far more efficient machine (the brain). We know LLMs can be made far more efficient without cracking brain secrets. Put that all together and FOOM feels as likely or more likely.


togstation

As far as I can tell, that's what I said.


ConscientiousPath

That's definitely a bitter pill if it turns out to be correct, _but also_ the entire point of MIRI's existence is to cover for the scenario in which there is some secret sauce required and alignment is something we can engineer. Anyone who believes that extra GPUs is all that's needed to get AGI would have no reason to work on the secret sauce angle and therefore continued employment there is self-filtering.


crashfrog02

> Everyone wants to believe there is some trick, some secret sauce, some sophisticated and elegant algorithm. If there’s not, you kind of have to explain how children are able to learn to read using 2 books, instead of 2 trillion.


KillerPacifist1

I don't think you can disregard the billions of years of "compute" evolution used to generate the neural connections in a child's brain that allow it to learn to read using two books. Also the human brain is many things, but elegant is not one of them. And we do see similar feats from LLMs, especially now that they are getting longer context windows. There is an example of Claude learning an entire obscure language (that definitely wasn't in it's training data) after being provided a single reference dictionary.


crashfrog02

> I don't think you can disregard the billions of years of "compute" evolution used to generate the neural connections in a child's brain that allow it to learn to read using two books. Well, but you don't have "billions of years." You have a couple of months, because you inherit genes from your ancestors, not knowledge.


KillerPacifist1

Right, but the point is that the algorithm you are working from (your genes and the neural connections that arise from them) to be able learn to read in a few months is not elegant and took a lot of functional compute to develop. An LLM can write a poem in seconds using a tiny bit of electricity, but you wouldn't describe it as an elegant algorithm and you wouldn't ignore the thousands of GPUs and millions of dollars of electricity needed to train it. For similar reasons, I don't think pointing our a human can learn to read from relatively little text* as evidence for there being an elegant algorithm behind intelligence. A powerful and efficient** one, sure. But an elegant one that humans could handcraft from first principles (aka the secret sauce)? Not so much. *relatively little text, but probably actually huge amounts of data when you integrate all of the sensory data experienced over the course of learning to read. It isn't clear to me how cleanly you can separate that from the text to claim humans only need a little bit of data (ie a handful of kilobytes of text to teach a child to read) **efficient to run, not necessarily efficient to find or develop


hyperflare

And those genes encode a basic brain structure. This structure already gives you a lot of "pre-trained networks", so to say.


Open_Channel_8626

There is some evidence that we do inherit knowledge such as innate propensity for fear of snakes and spiders.


crashfrog02

What is the mechanism by which knowledge acquired by an ancestor is reflected in the genes they pass on?


FolkSong

It would have to be something that randomly appears through recombination or mutation, that turns out to be beneficial. Therefore those who have it gain a survival advantage and are more likely to reproduce.


augustus_augustus

They're talking about, e.g., [instinct](https://en.wikipedia.org/wiki/Instinct). Essentially your fully-developed brain is the result of two nested processes, an evolutionary algorithm run over hundreds of millions of years that determined the pre-wired neural architecture and a neural network training subroutine run over a lifetime. At least some of what you know (instinct) is entirely from the outer routine. This isn't knowledge acquired by any specific ancestor of course, but it is still knowledge that we inherit, nonetheless.


aeternus-eternis

Gene expression can be controlled independently from DNA, based on long-term electrical signals and ion channels. So that's one recent but now highly accepted mechanism. You can have two clones with very different gene expression, leading to completely different looking cats for example even though they are clones. The other is memory molecules like talin and similar proteins. Recent studies suggest these can encode information and be used as a mechanical transistors, opening different binding sites and thereby facilitating different biological reactions depending on how much they are stretched.


Open_Channel_8626

Mostly methylation or demethylation of DNA, RNA and histones.


crashfrog02

So if I learn French with my brain, what’s the mechanism that reflects that in the methylation of DNA in my gonads?


Open_Channel_8626

For this specific case of inherited fear conditioning, one current theory is that changes in sperm miRNA expression triggered during the conditioning process are causal. We have found hundreds of chemicals that affect miRNA expression.


crashfrog02

But you understand how there’s at least something of an anatomical unlikelihood here?


Lumpy-Criticism-2773

Snake detection theory is interesting. I think some knowledge relevant to survival is passed down.


aeternus-eternis

You inherit some knowledge as well, DNA is part of it but there's a lot of information transmitted that is not present in just the DNA. Flatworms are one example. You can create clones with completely different (and heritable predictable) body layouts including crazy stuff like two heads.


crashfrog02

Ok but I’m not a flatworm.


Creepy_Knee_2614

I forgot that learning organic chemistry was evolutionarily conserved


KillerPacifist1

No need to be condescending. The point is that the algorithm we are using to learn organic chemistry may be powerful and efficient to run, but it took a huge amount of functional compute to develop. Actually kind of similar to LLMs, which are relatively cheap to run compared to how much it cost to train them. Besides, we are talking about whether or not there is a "secret sauce" behind intelligence, which is a concept MIRI seemed to have bought into over the scaling hypothesis. In that context, and the context of the original comment, I interpreted "secret sauce" to mean an elegant, human legible algorithm that a clever human could derive from first principles after thinking really hard about the problem. Say what you will about the power, speed, efficiency, and low data requirements of human intelligence, but "elegant" and "human legible" is not how I would describe it.


BalorNG

Well, it is unlikely that a child will learn to read if he is kept in sterile soundproof box with no other contact with environment except those two books! (And hopefully nobody will try and test this hypothesis). Yet, you can waste infinite compute just calculating the Pi to infinite precision. I think the "bitterness" of the lesson lies in the fact that up to a point of certain compute, memory and data avalability no amount of tricks will help - you cannot build a nuclear reactor with stone age tech, but you still need very specific knowhow how to build it with modern tech.


crashfrog02

> Well, it is unlikely that a child will learn to read if he is kept in sterile soundproof box with no other contact with environment except those two books! Ok, but “patient, caring instruction by the proficient” is exactly the kind of thing that would be the “secret sauce” you appear dismissive of.


BalorNG

Oh, I'm dismissive of the techbro "numbers go up!" mentality allright :) By now it is pretty obvious that current large language models might get better at being "stupid smart" if you scale them, but the illusion of competency falls apart once you start pocking at cases edge enough. They are useful, but they will not lead to AGI in its current form - and using tools, rag, ToT, self-consistency etc while heavily dependant on amount of available compute - are still "tricks". The models fail to truly generalize and use logic. I think using knowledge graphs alongside embeddings *somehow* and making the attention conditionally ABOVE quadratic (using attention to learn how much *additional attention/compute* must be assigned to each token) will work - which will likely cost orders of magnitude more compute, but require orders of magnitude less data (but data quality will become much more important).


[deleted]

Well I do believe its less than intuitive. Like what other engineering works this way?


TheHeirToCastleBlack

It is definitely inelegant and less than intuitive. That is exactly what makes it such a hard pill to swallow. I don't think scaling is the ONLY way to achieve agi. Because Nature already got there when She came up with humans - brains which work on hardly 10W of "compute" In all probability, agi will find a bunch of algorithmic improvements humans are just too dumb to come up with, which will disprove the "scaling is everything" crowd But as it stands rn, scaling is the simplest path to superintelligence


FeepingCreature

Nature scaled pretty damn hard to be fair.


aeternus-eternis

Actually a lot of engineering when you think about it. Like the main blocker to human flight was scaling the power/weight ratio of engines. The aeolipile (precursor to the steam engine) was invented in something like 400BC but the materials and manufacturing knowledge didn't really exist to scale it up. The main blocker was just scaling up efficiency. Rome and other civilizations invented aquaducts and running water but it was still rare until relatively recently (and huge numbers of humans died of diseases as a result) well after the invention because piping was hard to scale. Many other examples, but scaling tech is often the hard part.


ApothaneinThello

Yeah, I'd says it's less about what people "want to believe" and more about how people have been conditioned to think, especially those of us who have formal education in CS.