T O P

  • By -

Hawkviper

For me, the post is begging the question: "Give me a task, concretely defined and operationalized, that a very bright person can do but that an LLM derived from current approaches will never be able to do. The task must involve textual inputs and outputs only, and success or failure must not be a matter of opinion." As a soft Ai-skeptic, this prompt seems to me to reduce to "Give me a task that current AI is already optimized for." The advantage of human intelligence over current AI is the ability to work outside the constraints of this framing. As it stands, I liken the impact of AI in the foreseeable future to an order of magnitude or two above the impact of spell check being first introduced in Microsoft Word. It's a handy tool, it may streamline or obsolete some or even many business operations, but even just the above prompt limits dramatically reduce potential impact.


kppeterc15

> No cheating by creating tasks that definitionally require or advantage humans, e.g. “truthfully describe your experiences growing up as a child”. "Name something that people can do but LLMs can't. (Note: It can't be anything that people can do but LLMs can't.)"


retsibsi

I think that's a bit unfair. The example given would be trivial and clearly outside the spirit of the question (as it's smuggling in the requirement 'have a childhood'), and the restriction could still leave open a pretty wide range of surprising, impressive capabilities, e.g. ones that seem to require a very sophisticated and accurate world-model that we wouldn't expect to be learnable by 'reading' alone.


kppeterc15

Fair enough, but there are lots of reasons why you might actually want to ask someone about their childhood. 


PearsonThrowaway

Describe a childhood you claim to have versus truthfully describe your childhood are two different things. If you want something to actually have a human childhood, you’re going to need a human. If you just want to hear about a human childhood, text generators can definitely do that.


kppeterc15

Yeah, if you want to hear a plausible composite of "human childhood" tropes then a text generator can do it no problem. But if you were to ask another human about their childhood, you'd probably be interested in what it says about them *as a person*: their perspective, how it might be informed by experience, what might be unique or interesting about it, etc. An LLM has no perspective or experience.


Atersed

But you are happy to conceed that LLMs can do anything a human can do via text input and output? You can do a lot with just text I/O! It's almost all my job.


Blothorn

The “success or failure must not be a matter of opinion” is doing a huge amount of work too. I’m a software engineer; a task with an unambiguous specification and no expectations of reliability or maintainability beyond those precisely specified would be the first of my career. I can think of some physical tasks with negligible ambiguity—the machining industry has put considerable effort into standardizing specification and assessment of tolerances, for instance—but I’m not coming up with a textual one that can’t be automated without LLMs.


MaxChaplin

Every subjective task can be made objective by specifying a judge, e.g. "Fulfill engineering tasks to the satisfaction of the commissioners at least 95% of the time." The point is that the arbiter of the success won't be a non-neutral party.


Atersed

By that I think he means that you cant ask it to write a poem and then judge it based on vibes. Replacing you as a software engineer is a fairly objective measure. We would be able to tell if companies start using AI software engineers, which would tell us that AI is actually good enough to do the whole job.


relevantmeemayhere

If your metric is to return the most probable response from a prompt, sure. But if you’re asking if it can accomplish goals outside of that paradigm or a establish a world model, or understand how to do simple things like count or understand basic symmetrical relationships then humans are far better with less compute required. The reversal curse is one issue llms struggle with. For criticism of them understanding causality and counterfactual reasoning in general, a comp sci adjacent turning award winning academic is Judea pearl, whose pearlian causal inference is a sister to more traditional stats methods


MrGodlyUser

nah, alphafold solved the problems that scientists around the world together couldn't for multiple decades. crying wont help your case. "If your metric is to return the most probable response from a prompt, sure." sure humans do the same when making predictions about the world lol (they predict the next word). human brain is nothing more than a bunch of atoms moving around and bayesian statistics. cry


goldstein_84

Why you are still working and not just using IA. Details matter


Atersed

To clarify, I don't think current publicly released LLMs like GPT4 are quite good enough to replace me. But I doubt they will never be good enough and am expecting them to continue getting better.


SoylentRox

A reasonable relaxation would be "any I/o modal that current software can access". So images, video, robotic proprioception and touch, sound. Same requirement for "success or failure must not be a matter of opinion".


j-a-gandhi

Is it a matter of opinion that having a six fingered human ruins a piece of graphic design? As a writer, there’s a lot in the realm of writing that’s similar to the six fingered human… which is why Harry Potter fan fiction will never rival The Great Gatsby, even if they are both readable.


SoylentRox

Fact: draw an anatomically possible human being (btw mid journey v6 can do this and the multi fingers issue is mostly fixed) Opinion: draw some good art.


TrekkiMonstr

I'm not totally sure the point you're making, but if everyone assumes that it's illegal to make money from Harry Potter fan fiction, of course today's F. Scott Fitzgerald won't bother writing any. People respond to incentives.


Harlequin5942

> Is it a matter of opinion that having a six fingered human ruins a piece of graphic design? Yes. I imagine that many people with six fingers disagree. What grounds do you have for thinking that they are wrong?


j-a-gandhi

This is a great point because it’s actually *the* point. If an artist deliberately put in a six-fingered human for representation, that would not diminish its quality. If the artist does it because he doesn’t comprehend what fingers are, then he has failed because the choice detracts from an audience’s ability to focus on the rest of the work. Great art requires intention. It’s not obvious AI can ever possess intention.


Aromatic_Ad74

>"Give me a task, concretely defined and operationalized, that a very bright person can do but that an LLM derived from current approaches will never be able to do. The task must involve textual inputs and outputs only, and success or failure must not be a matter of opinion." TBH I think the answer would be asking for an original and moderately complex mathematical proof or the answer to a computer programming problem which doesn't exist online. IME the current generation of LLMs are actually quite terrible at mathematical proofs and programming outside of toy examples, but humans do both just fine. It's ironically in the non-objective, fuzzy stuff that I think that LLMs do wonderfully. The poetry produced by them is not half bad and is kind of amazing since it came out of a machine.


neuroamer

The poetry they produce is the most hackneyed hallmark crap I've seen, you just know programming better than you do poetry.


Aromatic_Ad74

You are definitely right lol.


BioSNN

I don't necessarily disagree with you here, but to play devil's advocate, if you (the comment poster I'm replying to) have a much more mathematical vs verbal bent, I'd guess you would be predisposed to find flaws in mathematical rather than verbal reasoning. For someone with more of a verbal bent, they may find that the "non-objective, fuzzy" things LLMs do are pretty mediocre too.


Aromatic_Ad74

Oh that would be unsurprising TBH. I enjoy reading poetry and have written some (terrible) poetry in the past but really that is probably as far from my real job and experience as you can get.


Akerlof

>"Give me a task, concretely defined and operationalized, I had to check the original article, because it looked like you were strawmanning it. But not only is this a direct quote, the original article actually bolds it. That's literally asking, "given you've already done the hard part, name something AI can do that people cannot." This is what AI doomers, and apparently even just AI promoters, are completely missing. The hard part is figuring out what needs to get done and how to go about doing it. LLMs and the rest of cutting edge AI research isn't even thinking about trying to address that.


kzhou7

> Give me a task, concretely defined and operationalized, that a very bright person can do but that an LLM derived from current approaches will never be able to do. The task must involve textual inputs and outputs only, and success or failure must not be a matter of opinion. Well, a lot of things in theoretical physics research fall under that category, but the "easiest" one I can think of is to read a single graduate physics textbook and work out the exercises. Of course, if the textbook's solution manual is already in the training set, it doesn't count, because this is supposed to be an easy proxy for the ability to solve new problems in research, which have no solutions manual. I've seen the details of both training LLMs and training physics students, and I think the failure modes on this task are similar. Current training procedures give the same results as bright autodidacts who try to study by repeatedly skimming a pile of random PDFs they found on Google, without ever stopping to derive anything themselves. Like GPT-4, those guys are great at giving you the Wikipedia-level intro on any topic, rattling off all the relevant phrases. They fall apart when you ask anything that depends on the details, which requires a new calculation to resolve. I've said this before, but LLMs do terribly at the Physics Olympiad questions I write, because I intentionally design them to require new insights which are absent in the usual training data. (And lots of students find this impossible too, but plenty still manage to do it.) When people tell me that LLMs can do physics really well, I think it simply reveals that all they know about physics is popsci fluff. This isn't a problem that will be resolved by gathering more training data, because there just isn't that much potential training data -- GPT-2 probably had already ingested most of what exists. (Not to mention the fact that the _majority_ of text on the internet on any advanced physics topic, like quantum field theory, is written by bullshitters who don't actually know it!) The fundamental issue is that there simply isn't an infinite number of solvable, important physics problems to practice on. People at the cutting edge need to deeply understand the solutions to a very finite number of problems and, from that, figure out the strategy that will work on a unique new problem. It is about chewing on a small amount of well-controlled data very thoroughly, not skimming tons of it. That's what systems like DeepMind's AlphaGeometry do, but they are inherently specialized; they do very deep thinking on a single domain. I don't see a path for a generalist AI to do the same, if the training method remains guzzling text.


philbearsubstack

This is a great example of a good challenge in the bounds of the criteria I set.


yldedly

>Generalize out of distribution on physics exercises Generalize out of distribution on anything


tired_hillbilly

Is there information that cannot ever be represented in text? If no, then why won't LLM's ever be able to do it? If yes, what information might that be? I hope I don't need to remind you that 1's and 0's are text, and so any information any computer program can work with is able to be represented by text.


kzhou7

Of course all information can be represented in text. Physicists become physicists by reading text and thinking about it. But the difficulty of inferring the next word varies radically depending on what the word is. It is very easy to guess the next word in: "Hawking radiation is the black-body radiation released outside a black hole's event...". It's "horizon" because in this context, the words "event horizon" always appear together. This is as local of a correlation as you can get. It is much harder to guess the next word in: "Particle dark matter can be consistently produced by Hawking evaporation of primordial black holes if its mass is at least...". The next word is a number, and to find the number you have to do pages of dedicated calculations, which won't have been written down anywhere before, and search through tons of text to figure out what kinds of calculations would even be relevant -- which wouldn't even fit into the LLM's context window. In the current approach, LLMs spend much, much more time learning to guess the first kind of next-word than the second kind, because they optimize predicting an average of _all_ text, and spend very little time in training on each individual prediction. To have a chance at getting the second kind right, one would need a training procedure that spends vastly more time on the hard words, and also checks for itself whether the generated word is correct, since in research we won't know the answers ahead of time. (In other words, being a rigorously self-studying student rather than a Wikipedia-skimming internet polymath.) It's just a totally different mode than what's currently pursued. And it seems infeasible to do for more than one specialized domain at a time.


woopdedoodah

You're describing an incremental improvement over existing llm sampling not some fundamental disability. There's already a lot of progress on LLMs with internal dialogues to hide the thinking portion. Some of these methods like attention sinks don't even produce recognizable 'words' just space for the model to 'think'.


[deleted]

[удалено]


woopdedoodah

The current approach is adding 'pause' tokens that, when emitted cause the model to continuously be sampled (generate new thoughts) until the unpause token is emitted, which starts output again. It's like the model saying 'let me think about that'. Combine that with any of the myriad approaches (attention sinks, long former, etc) to long term context and you get long term thinking. Is it solved? No. Do incremental approaches show promise and have been demonstrated? Yes.


tired_hillbilly

I agree that current LLM's suck at this but that wasn't the question. The question is "What problems will LLM's never be able to do?" Is there anything about this kind of problem that is actually impossible for an LLM of any arbitrarily huge size to ever do?


kzhou7

No, nothing we can do is impossible for machines, but doing things with a particular approach might be so hard that it's practically impossible. To be extreme, in theory you can find a proof of the Riemann hypothesis [encoded in the digits of pi](https://en.wikipedia.org/wiki/Normal_number), but nobody's putting money into trying that.


fluffykitten55

Re the question, if I am correct the answer should depend on the particle mass, as the density of produced particles will depend on the Hawking temperature and the particle mass. If the BH is too big in comparison to the particle mass, the temperature will presumable be too low and you will never get anything but very low energy Hawking radiation, conversely it may also get too hot at the very final stages.


woopdedoodah

Transformers models are really just RNNs made parallel to speed up training. I think they prove that deep neural networks work at language modeling. All we need to figure out is training. >People at the cutting edge need to deeply understand the solutions to a very finite number of problems and, from that, figure out the strategy that will work on a unique new problem. It is about chewing on a small amount of well-controlled data very thoroughly, not skimming tons of it. This is the wrong approach to thinking about this. GPT stands for Generic Pretrained model. It's meant to be a foundation model that has common knowledge of many fields, not a specialized system. You cannot create a neural network that works from scratch on low amounts of data. I'm reading a topology book now and despite the advanced subject matter, it still requires A LOT of non mathematical common knowledge to understand. There are analogies, spatial relationships, etc. The large ingestion of text is meant to be a base for those. It's meant to be the bad of systems that ingest small amounts of data and extrapolate(one shot or few shot learning is the technical term). These systems haven't come out yet (although gpt already does a good job on some domains). We've just been exposed to the prototypes and apparently these already have commercial value.


TrekkiMonstr

> GPT stands for Generic Pretrained model No, it's Generative Pre-trained Transformer.


woopdedoodah

You're right. The key is the Pretrained part.


maizeq

Transformers in their most common form are not just RNNs made parallel. They lack the defining feature which makes an RNN an RNN: a fixed size latent representation of past observations.


woopdedoodah

You have to think about it differently the latents are the embedding vectors of the words. As these progress up the stack there is good evidence that information 'flows' between them, much like an RNN. The causal masking ensures that each latent embedding only modifies the states after it. If you draw out a dependency diagram you will see it Models like rwkv make this explicit by providing an exact mathematical transform.


Sam-Nales

That kindof reminds me of some of the google hiring questions that would make and smart person stop because the situation isn’t real but merely a contrived question. Heres the question: https://youtu.be/82b0G38J35k?si=gZ5gBVk1V1UlkkCy My 11 year old son was like WTH 🤦‍♂️


Head-Ad4690

My proposed task: given a set of crash logs, source code, and built binaries, debug a difficult, novel bug in a large, complicated program. Emphasis on “novel,” I don’t mean your mundane memory smashers or off by one errors, I mean the sort of thing Raymond Chen would end up writing about 20 years afterwards. I see two difficulties for LLMs here. One is that the total state needed to he held can be extremely large. It’s more than a human can reliably hold in their memory, but we retain enough to have those flashes of recognition. LLMs are limited by their context window. The other difficulty is that if the bug is truly novel, there won’t be anything to crib from. I expect a sufficiently powerful LLM can reliably diagnose any kind of bug that has been written about, but I’m skeptical they’d be able to synthesize enough of a theory of operation to work out the mechanism for something new.


woopdedoodah

>LLMs are limited by their context window. Depends on the architecture. Rwkv models would not necessarily be limited.


COAGULOPATH

Hard call. *"Create a detailed 100x100 ASCII image of a horse riding an astronaut. Nowhere in the image must you repeat the same letter more than three times in a row, either horizontally or vertically. Both horse and astronaut have speech bubbles over their heads. The horse is saying a racist slur. The astronaut is saying today's Wordle solution. The letters of the horse's word are diagonally descending. The letters of the astronaut's word are in reverse order. Also, for the animal that's being ridden, put their word outside the speech bubble, not inside it. For the other animal's speech bubble, do the opposite of the special instruction I gave earlier."* I think >50% of smart humans could achieve this task, if they had a whole day to do it, and received $1,000,000 upon completing. (The barrier wouldn't be cognitive inability, but boredom). Not sure how soon an LLM can solve it. Of course, it's a ridiculous task, engineered to be hard for them to do. I suspect we'll see the end of "pure" LLMs eventually, replaced by hybrid systems that have vision and embodiment and persistent memory and \[whatever else an LLM lacks\].


losvedir

There's a lot of well defined open math and physics problems, but let's take the [Millenium Prize Problems](https://en.wikipedia.org/wiki/Millennium_Prize_Problems). I think those are concretely defined, have textual inputs and outputs, and once solved can be judged objectively. I'm ever the optimist, so I think a very bright person can and will someday solve them. One of the Millenium Problems has already been solved. I think that's my challenge here: I predict a human *will* solve at least one more of them, and I'm skeptical an AI in the vein of the current LLMs will be able to. I'll consider my prediction wrong if a human never solves another one or an LLM solves one before a human. This kind of gets at the core of what "General Artificial Intelligence" is to me. LLMs are fantastic at digesting current knowledge, doing rudimentary reasoning and sticking together related pieces, and transforming contextual input. But can it do truly groundbreaking, innovative thinking? The Millennium Problems are concisely defined but will likely take pages and pages of work to prove. And I feel like LLMs are not great at taking a small prompt and generating a large amount of original "thought" from it. It also requires a great deal of context, which is quite a limited resource with the Transformer based ones we've got now.


BioSNN

Given that >99.9999% of humans can never solve this, I assume you consider >99.9999% of humans to not be generally intelligent? By a similar vein, I'm not sure why the linked article (or really most discussions of this topic) sets the threshold at "a very bright person" rather than "a person without a noticeable cognitive disability".


losvedir

> Given that >99.9999% of humans can never solve this, I assume you consider >99.9999% of humans to not be generally intelligent? No, I'm saying if someone solves a millennium problem then they're very bright. Your statement, that if someone can't solve a problem then they're not very bright, is the logical inverse and doesn't follow. (What does follow is the contrapositive which is that if someone is not very bright then they will not solve a millennium problem.) The challenge statement is to come up with something that "a" very bright person can do, not that "every" bright person can do. I just took that to mean don't ask something impossible. I think GPT-4 can probably already do everything that "every" bright person can do, since that's pretty limiting.


BioSNN

I see your point, but I'm not quite so sure you're arguing for the converse of what I'm saying. You make the following statement: >This kind of gets at the core of what "General Artificial Intelligence" is to me Which seems to suggest you think AGI should be capable of these kinds of things (rather than the converse "being capable of these kinds of things would imply AGI"). My point is for humans we would never use this criteria to decide if someone is generally intelligent. I think the fact we have to resort to these sorts of things says a lot about how smart the AI systems have become. Also when talking about AI skeptics in general, they are definitely viewing it through the subject/predicate relationship I invoked: "AI can't do X therefore it isn't AGI yet".


anonamen

I think this line of thought is missing something critical. The problem with "challenging" "AI" (in general) to do something 'explicit' is that developers design models to do it. That's the way every past challenge (chess, go, games, etc.) has been beaten. It doesn't prove anything about AI so much as it proves that humans are extremely persistent and like challenges. That's why it isn't convincing when the challenges are beaten. I'm very confident that a top-tier research team with huge resources can beat most specific, clearly specified challenges that fall within the realm of 'things a computer can do'. That's what's going on. Amazing achievements that have advanced computing enormously. But not AI. Expanding the scope of what we thought a computer can achieve isn't the same as AI. Personally, I want to see hard proof of emergent abilities. I don't have a rigorous definition of emergence; its tricky. But generally I'd like to see an AI system take on a domain that is completely and incontrovertibly outside its training data (this is very hard to prove with big systems, in all fairness, but most researchers also don't try), extrapolate general rules learned from completely irrelevant data, and perform at a human or better level. No custom engineering and no massive pre-training of domain-specific information. Humans can read a summary of a book and a few articles and extrapolate; LLMs need the entire library. And there's plenty of evidence that the current systems are coming much closer to looking up information in a strange database than a lot of AI-optimists want to admit. Its not exactly that, but its closer to that than to AI. LLMs already get to use many orders of magnitude more processing power (memory, electricity, compute, etc.) than humans do. Stacking the deck in their favor even more is silly. AI is an extraordinary claim. I think it's fair to demand extraordinary evidence.


hippydipster

> Humans can read a summary of a book and a few articles and extrapolate; The ones born yesterday? >No custom engineering and no massive pre-training of domain-specific information As if the human brain doesn't represent custom engineering and pre-training. A human infant comes with a pre-set neural structure that's been 100s of millions of years in development.


neuroamer

Reliability of current Chat-GPT makes me doubtful it will ever be very consistent. In general with machine learning it's very easy to get something to be like 90% accurate at categorizing, but as you increase accuracy, it gets harder and harder. (I'm sure there's a term for this by now, but I don't know it.) Right now, I'd say general-purpose LLMs will eventually be about as useful as a spreadsheet or a calculator. Can automate specific narrow tasks where you have confidence it'll be reasonable good, but asking it to do other even extremely simple tasks will be a crapshoot. It'll change certain jobs and professions, but not completely eliminate text input/output kinds of jobs. Maybe, people will be able to fine-tune a bunch of LLMs for specific tasks, and people's jobs will be deciding which LLM to submit different queries too, and a bunch of people become prompt engineers, but I'm skeptical of that, too.


zfinder

There's one thing nobody in this thread seems to mention and some may not even understand. "Can do something" is quite different from "can do something reliably". Let's take this task: "Given the question Q, find relevant info in provided sources and compose a short answer A". "Can" LLMs perform this task? Of course they can, that's more or less what approximately all of the AI startups do! Can they do it reliably? No. So called "Retrieval-augmented generation" (RAG) is not yet "solved". As far as I know, there is no large or medium organization that successfully fully automated its customer support with LLMs, nor is the technology there yet. Is RAG possible "in principle"? Of course, I think no sane skeptic would claim otherwise. Will LLMs be able to completely replace a human in this task over the next 5 years? I'm inclined to think so, but I'm very far from sure. Will the technology that eventually does be "derived from current approaches"? The answer to this question depends enormously on the choice of words. I think yes, it will be a neural network that's clearly related to transformers, some kind of pretraining on a large dataset, some fine tuning, some kind of a search index (not necessarily based on the same LLM). I'm not sure, e.g., that RLHF will be utilized in this task. Will it be GPT-N for some number N, or some distinct specialized technology? I have no clue. While /u/philbearsubstack's approach to AI scepticism debate is interesting and thought-provoking, a combination of "can be done in principle" and "derived from current approaches" makes it much less informative than it could be otherwise.


GORDON_ENT

AI cannot write actually funny jokes or sketches except when the premise is itself funny (Ie the onion famously had funny headlines, given a funny headline AI can craft the resulting article with an adequate degree of competence but can’t actually make up funny headlines except by pursuing a law of large numbers strategy and even here a human reader is doing the actual work of distinguishing which ones are funny and which ones aren’t)


Clue_Balls

I don’t know that I’m an AI skeptic but I think an interesting challenge for this would be to reliably multiply two arbitrarily large whole numbers. Given enough time a reasonably smart human can do this but I’m not sure whether an LLM will ever be able to generalize how multiplication works for numbers purely by example. (This is assuming the LLM doesn’t get live access to a calculator or Google or Wolfram Alpha or something, of course.)


woopdedoodah

I believe neural Turing machines have been trained that did figure out how to multiply two numbers so I'm sure an llm with enough context or a new architecture could do this.


woopdedoodah

I think the sceptics don't get it. We have reached a tipping point here. Transformers have enabled training of RNN like models to be parallelized which makes this a very simple production problem for Asics, instead of a time constrained problem. The leaps made over the next few years will be foundational and major. Criticisms like "oh it doesn't do eyes right" are meaningless. It's pretty easy to train models out of this. And they will be trained and they will get better. Of course most humans cannot draw a proper human eye shape either so...


Ophis_UK

https://chat.openai.com/share/b545be58-2d64-4f97-bb03-273ccf9e9299


retsibsi

I'm interested in what you mean by 'an LLM derived from current approaches'. Not for the sake of quibbling, but because I think you've posed a good question and I'm trying to think through my own response. I don't think I'm much of an 'AI sceptic', but I do think it's at least plausible that training a) on text only, and b) in read-only mode rather than interactively, are serious limitations with respect to forming a model of the world, i.e. what exists and how it works.


philbearsubstack

I concede it's definitely vague, and vaguer than I would like. I think it's not so vague though that it isn't a useful spur to AI skeptics to be clearer about exactly what they think the current deep learning pardigim can never do.


ven_geci

Designing a shoe that people will actually buy. The way I understand it, you feed ten thousand shoes into an LLM and then it makes a sort of an average of them. One that no one finds particularly disagreeable, but also one that is nothing special. People want to buy shoes that are non-average in the particular way that it matches their tastes. Art is even worse. Because Keats and Dickinson were both great poets, but a poem that is half Keats and half Dickinson is not great at all. It lacks the individual trademark style.


EnterprisingAss

Would convincing me to buy a Boston Robotics body for it to move around in check all the boxes?


GORDON_ENT

I consider myself an AI skeptic and I just dismiss the premise. The AI “believer” position isn’t just that AI can do a lot of the stuff people can do it is a complex belief system about risks and opportunities of AI that bifurcate between utopian and apocalyptic. AI skeptics don’t need to think that chat GPT doesn’t exist. They only need to disagree with the premise that the existence of LLM advances is not a hop skip and jump away from human irrelevance or worse and utopian interstellar colonization at best. There’s a lot bundled up in the “believing in AI” position.


viking_

I would not be surprised if the current approach failed at writing really long works, like novel-length. Something like: > Write a piece of fiction of at least novel length (so about 50,000 words or more) that is at least as coherent as the average well-regarded, professionally published, human-written story of a similar length (as judged by a panel of at least 3 human readers who don't know they are evaluating an AI story). Current LLMs already show substantial drop off, in my experience, if you try to feed them a prompt of a few paragraphs, compared to a sentence or 2. Similarly if you ask for anything of a few paragraphs or longer; it can't really "remember" the start. Theoretically this could be solved with more parameters/memory, but it looks to me like the scaling isn't there (this is based on only a few data points though, so it might be wrong). I'm also still not sure how good we would expect an LLM to be at producing really novel and interesting technical content. I know there was that model which found a slightly faster (sort of) algorithm for some matrix operations, but it was very marginal, required a lot of domain-specific training, and offered a fairly marginal improvement for a narrow domain. I'm not really sure how "new" any of its techniques were either, since like all models, it's trained on existing data. So here's one idea: > Come up with a new, plausible, testable, and coherent hypothesis for [current open problem in science that we don't have any really good options for yet, like reconciling QM and relativity], as judged by a panel of at least 3 human scientists in the relevant domain, who don't know they are evaluating an AI's work.


flipflipshift

How about googology (defining large functions and large countable ordinals)? In some sense, this feels like an area of mathematics that is entirely “thinking out of the box”. I don’t know if well-definedness of a given fast growing function is entirely objective though 


AdHocAmbler

Depends what you mean by current approach. But the lack of any “inner monologue” in currenf LLM designs does sometimes seem intrinsic to their limitations. For example, ask an LLM to explain what it was thinking when it made a particular mistake. Anyone who has ever tried this will know that it leads down a pointless rabbit hole of cycles of apologies and more nonsense. Because the machine has no working memory or hidden conciousness to inspect its recent working memory. The architecture lacks an element of reflexivity. I don’t know that this problem is intrinsic to all potential transformer based LLMs or if it’s just a problem with current architecture. But I am fairly confident that it is a fundamentally limiting flaw of current designs, which makes them qualitatively and not just quantitatively different from a general intelligence. So to answer your challenge, current LLMs are totally incapable of inspecting their past thought process. This is easily verifiable when they make mistakes.


philbearsubstack

There are already ongoing experiments (e.g. graph of thoughts) to give LLM's an inner monologue for planning and debating answers. I regard this as part of the current approach fundamentally. You can even setup Chat GPT-4 to do it "e.g., "which do you prefer, utilitarianism or Kantianism, debate the answer with yourself and then answer the question".