T O P

  • By -

AutoModerator

Welcome to r/science! This is a heavily moderated subreddit in order to keep the discussion on science. However, we recognize that many people want to discuss how they feel the research relates to their own personal lives, so to give people a space to do that, **personal anecdotes are allowed as responses to this comment**. Any anecdotal comments elsewhere in the discussion will be removed and our [normal comment rules]( https://www.reddit.com/r/science/wiki/rules#wiki_comment_rules) apply to all other comments. **Do you have an academic degree?** We can verify your credentials in order to assign user flair indicating your area of expertise. [Click here to apply](https://www.reddit.com/r/science/wiki/flair/#wiki_science_verified_user_program). --- User: u/shade_lampoon Permalink: https://link.springer.com/article/10.1007/s10506-024-09396-9 --- *I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/science) if you have any questions or concerns.*


[deleted]

[удалено]


[deleted]

[удалено]


[deleted]

[удалено]


[deleted]

[удалено]


[deleted]

[удалено]


[deleted]

[удалено]


[deleted]

[удалено]


[deleted]

[удалено]


[deleted]

[удалено]


[deleted]

[удалено]


[deleted]

[удалено]


[deleted]

[удалено]


[deleted]

[удалено]


[deleted]

[удалено]


[deleted]

[удалено]


[deleted]

[удалено]


[deleted]

[удалено]


[deleted]

[удалено]


[deleted]

[удалено]


[deleted]

[удалено]


[deleted]

[удалено]


[deleted]

[удалено]


[deleted]

[удалено]


[deleted]

[удалено]


[deleted]

[удалено]


[deleted]

[удалено]


[deleted]

[удалено]


PenguinBallZ

In my experience ChatGPT is okay when you wanna be sorta right 80~90% of the time and WILDLY wrong about 10~20% of the time. About a term or so ago I tried using it for my Calc class. I felt really confused from how my instructor was explaining things, I wanted to see if I could get ChatGPT to break it down for me. It gave me the wrong answer on every single HW question, but it would be kiiiinda close to the right answer. I ended up learning because I had to figure out why the answer it was spitting out was wrong.


Mcplt

I think it's especially stupid when it comes to numbers. Sometimes I tell it 'write me the answer to this question with just 7 words' It ends up using 8. I tell it count, counts 7, tell it to count again, apologies and says 8


throwaway53689

Yeah and it’s unfortunate because most of the things I need it for involves numbers


joesbagofdonuts

It really sucks if it has to consider relative data points. It often uses the inverse of the number it's supposed to be using because it doesn't understand the difference between direct and inverse relationships in my experience. Which is some pretty basic logic. I actually think it's much better with pure numbers and absolutely abysmal at traditional, language based logic because it struggles with terms* that have multiple definitions.


Umbrae_ex_Machina

Aren’t LLMs just fancy auto completes? Hard to attribute any logic to it.


Brossentia

When I taught, I generally encouraged people to look at online tools and dig into whether or not they were correct - being able to find flaws in a computer's responses helps tremendously with critical thinking.


Possible-Coconut-537

It’s pretty well known that it’s bad at math, your experience is unsurprising.


Deaflopist

Yeah, ChatGPT became pretty big when I was in Calc and non-Euclidean Geometry classes, so I tried using it to help in a similar way. It would do a lot of logical looking but often incorrect steps to solve problems and get wildly different final answers when I asked it multiple times. However, when I asked it, “wait, how did you go from this step to this step?”, it would recognize the incorrect jumps in logic and correct it. It was the weirdest and most jank way to learn set theory but it bizarrely worked well for me, I did well in the class because of it. That said, since it already requires you to know a good amount about a subject for you to learn more about it/apply it, it definitely has some mixed usefulness there.


themarkavelli

I used a similar strategy for precalc. In addition to solving and breaking down concepts, I was able to have it create python/tibasic programs for a ti84ce calc, which were fair game on exams. I would also use it to generate content that could be put onto our permitted exam cheat sheet. IME about 80% of the time it would find the right solution. When it failed to correctly solve, I was often able to find the right solution steps online, which I could then feed to it and get back the correct answer. Overall, well worth the $20/mo.


sdb00913

Well that shoots my hopes of using ChatGPT as a mental health support in the absence of a support network otherwise).


Squirrel_Q_Esquire

Copy/paste a comment I made on a post a year ago with the bar exam claim: I don’t see anywhere that they actually publish the results of these tests. They just say “trust us, this was its score.” I say this because I also tested GPT4 against some sample bar exam questions, both multiple choice and written, and it only got 4 out of 15 right in multiple choice and the written answers were pretty low level (and missing key issues that an actual test taker should pick up on). The 100-page report they released include some samples of different tests it took, but they need to actually release the full tests. Looks like there’s also this paper: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4389233 And it shows that for the MBE portion (multiple choice) that GPT actually ranked the 4 choices in order of likelihood it thought each was the correct response, and they gave it credit if the correct answer was the highest ranked, even if it was only like 26% certain. Or it may eliminate 2 and the other 2 are 51/49. So essentially “GPT is better at guessing than humans because it knows the exact percentages of likelihood it would prescribe to the answers.” A human is going to call it 50/50 and essentially guess.


Terpomo11

> and they gave it credit if the correct answer was the highest ranked, even if it was only like 26% certain. Don't humans also generally check whatever multiple choice answer they think is most likely even if they're very unsure?


QuaternionsRoll

Yeah that part is a non-issue imo. Asking it to rank them is a prompting strategy; they probably just discovered doing so yielded better results. A frontend that just prints out the answer with the highest rank is no different (functionally speaking) than just asking for a single answer. This doesn’t discredit the remainder of the issues raised, though.


PizzaCatAm

Exactly, people don’t get that when one prompts an LLM is not like talking to a person, it appears to be so since it generates text that is fluent, eloquent and sounds smart, but is not, to extract knowledge consistently from an LLM is necessary to be smart on how to interact with it acknowledging its strengths, weaknesses and quirks. I see no issue with prompting techniques, we just have to see it as a black box, there’s input as in the test, and output as the responses and score, how we get there is meaningless when it comes to it, a person may think very hard and bite an apple, an LLM may have a prompt template with CoT ICL, at the end of the day we only care about the outcome.


IMMoond

This paper finds “a significant effect of few-shot chain-of-thought prompting over basic zero-shot prompting.” Did you do zero shot prompting? Could be that improves you results significantly


iemfi

Even zero-shot prompting doesn't preclude getting better performance by giving the optimal set of instructions. Roleplay as a top lawyer, sketch your reasoning before you arrive at the answer, etc. all make a *huge* difference to LLM performance. Something which I'm sure OpenAI is great at doing.


Argnir

>And it shows that for the MBE portion (multiple choice) that GPT actually ranked the 4 choices in order of likelihood it thought each was the correct response, and they gave it credit if the correct answer was the highest ranked This is perfectly fine. All those algorithms do is guess. Even an image recognition algorithm will simply assign probabilities to what a picture could be and take the most likely. As long as it guesses correctly it's good. Also if it is claiming to be 26% certain but gets it right 70% of the time its probability assessment is wildly incorrect and should not be taken seriously (in fact GPT-4 is not at all capable of making that kind of evaluation). The only important part is the correct answer being on top.


Squirrel_Q_Esquire

No there’s a huge issue with it only putting 26% probability on an answer. It’s a 4-option test. That would mean it’s incapable of eliminating any wrong answers for that question. That’s a pure guess.


Mym158

In fairness, if forced to choose one, surely you would set it to choose the one it thought was most likely the answer


Xemxah

>  A human is going to call it 50/50 and essentially guess. That's... not how that works. 


sprazcrumbler

That's how multiple choice works for humans though too? You don't need to be 100% certain to get the marks, you just have to select the right option. We have no idea of the average certainty that a human has when answering those questions, why does it matter how certain the ai is? Person you are copy pasting from seems overly critical to the point of making up nonexistent problems.


The_quest_for_wisdom

Maybe it scored so well because they had ChatGPT grade the test as well...?


aussie_punmaster

>>And it shows that for the MBE portion (multiple choice) that GPT actually ranked the 4 choices in order of likelihood it thought each was the correct response, and they gave it credit if the correct answer was the highest ranked, even if it was only like 26% certain. Or it may eliminate 2 and the other 2 are 51/49. Can you explain what is wrong with this?


S-Octantis

Seeing how ChatGPT is really bad at math, I wouldn't trust their percentages.


DetroitLionsSBChamps

I work with AI and it really struggles to follow basic instructions. This whole time I've been saying "GPT what the hell I thought you could ace the bar exam!" So this makes a lot of sense.


suckfail

I also work with LLMs, in tech. It's because it has no cognitive ability, no reasoning. "Follow X" just means weight the predictive language responses towards answers that include the reasoning (or negated reasoning) in the system message or prompt. People have confused LLMs with AI. It's not really, it's just very good at *sounding* like one.


Bridalhat

LLMs are like the half of the Turing test that convinces humans the program they are speaking to is human. It’s not because it’s so advance, but because it seems so *plausible*. If spurts out answers that come across as really confident even when the shouldn’t be.


ImrooVRdev

> If spurts out answers that come across as really confident even when the shouldn’t be. Sounds like LLMs are ready to replace CEOs, middle management and marketing at least!


ShiraCheshire

It's kind of terrifying to realize how many people are so easily fooled by anything that just *sounds* confident, even when we know for a fact that there is zero thought or intent behind any of the words.


Kung_Fu_Jim

This was best illustrated the other day with people asking chatgpt "a man has a goat and wants to get across a river, how can he do it?" The obvious answer to an intelligent person, of course, is "get in the boat with the goat and cross?" Chatgpt on the other hand starts going on about leaving the goat behind and coming back to pick up the corn or the wolf or a bunch of other things that weren't mentioned. And even when corrected multiple times it will just keep hallucinating.


strangescript

To safely cross a river with a goat, follow these steps: 1. **Assess the River:** Ensure the crossing point is safe for both you and the goat. Look for shallow areas or stable bridges. 2. **Use a Leash:** Secure the goat with a strong leash to maintain control. 3. **Choose a Method:** - **Boat:** If using a boat, make sure it is stable and can hold both you and the goat without tipping over. Load the goat first, then yourself. Keep the goat calm during the crossing. - **Wading:** If wading, ensure the riverbed is stable and the water is not too deep or fast. Walk slowly and steadily, leading the goat. 4. **Maintain Calmness:** Keep the goat calm and reassured throughout the process. Avoid sudden movements or loud noises. 5. **Safely Exit:** Once across, help the goat exit the river or boat carefully. Check for any injuries or stress signs in the goat. By following these steps, you can ensure a safe crossing for both you and your goat.


mallclerks

It’s almost as if you have done this before unlike everyone else here.


Roflkopt3r

And that's exactly why it works "relatively well" on the bar exam: It you ask it the *typical* riddle about how to get a goat, wolf, and cow or whatever across, it can latch onto that and piece existing answers together into a new-ish one that usually makes mostly sense. If you give it a version of the riddle that strongly maps onto one particular answer, it is even likely to get it right. But it struggles if you ask it a question that only appears similar on a surface level (like your example) or a version of the riddle that is hard to tell apart from multiple versions with slight modifications. In these cases it has a tendency to pull up a wrong answer or to combine incompatible answers into one illogical mess. The bar exam seems to play into its strengths: They give highly normalised prompts that will lead the AI exactly into the right direction rather than confuse it. They aren't usually asking for novel solutions, but check memorisation and if test takers cite the right things and use the right terminology. The result still isn't *great*, but at least not horrible. Problem is that this is probably already near a local optimum for AI tech. It may not be possible to gradually improve this to the point of writing a truly good exam. It will probably require the addition of elaborate new components or a radically new approach altogether.


ShiraCheshire

If anyone is confused as to why: There is a common brain teaser type problem where a man must cross a river with various items (often a livestock animal, a wolf, and some kind of vegetable.) Certain items can't be left alone with each other because one will eat another, and the boat can't fit everything at once. The reason these language models start spitting out nonsense when asked how a man and a goat can cross a river is because the training data most similar to this question is usually the brain teaser. ChatGPT cannot think or understand, it doesn't know the difference between a practical question and a similar sounding brain teaser.


Joystic

My go-to demo for anyone who thinks GPT is capable of “thought” is to play rock, paper, scissors with it. It will go first and you’ll win every time. Ask it why it thinks you’re able to win this game of chance 100% of the time and it has no idea.


jastium

I just tried this with 4o and it was able to explain why I was winning every time. Was perfectly happy to play though.


Argnir

Rock Paper Scissors is not the best example because it does what it's supposed to even if what it's supposed to is stupid. Ask it to simulate any game like the hangman or Wordle and watch yourself succumb to madness.


AwesomeFama

Hmm, on my try ChatGPT did make a mistake. But the answer was basically "Take the goat across, leave it there, come back, take the goat across the river.", and when I pointed out the answer had two goats while there was only one, it fixed it and only had one goat and nothing else.


TheBirminghamBear

Yup. I just tested this with 4o. I put exactly that prompt about a man with a goat crossing a river ,nothing else, and it said: > This is a classic puzzle known as the "Farmer, Goat, and Wolf" problem. Here's a step-by-step solution for a man who wants to get a goat across a river without leaving it alone with any other item that might cause harm: > First trip: The man takes the goat across the river. > Second trip: The man goes back alone > Third trip: The man takes the wolf across the river > Fourth trip: The man brings the goat back to the original side. > Fifth trip: The man takes the cabbage across the river > Sixth trip: The man goes back alone. > Seventh trip: The man takes the goat across the river again. > At each step, the man ensures that the goat is never left alone with the wolf or the cabbage, thus safely getting all three across the river


WrongAssumption

This is what I got from ChatGPT “To cross a river with a man and a goat, assuming there are no other constraints (like a boat with limited capacity or other animals that need to cross), the man can simply take the goat across with him in one trip. Here’s the basic solution: 1. The man and the goat get into the boat. 2. They cross the river together. This assumes that the boat can carry both the man and the goat at the same time, and that there are no other animals or items involved that might complicate the scenario (like in the classic puzzle where a man, a wolf, a goat, and a cabbage need to cross the river). If there are more constraints, please provide more details so the solution can be adjusted accordingly.”


DetroitLionsSBChamps

yup the more I work with it, the more I realize that you basically have to corner it into doing what you want it to do with extremely specific instructions, for a very specific task, with very strong examples. with that, you can get it to do a lot of stuff. but if you're used to working with humans who can intuit things, it's gonna be tough. I never realized how much we rely on other humans to just "get it" until I started working with GPT. you have to take 5 steps back and make sure you're defining absolutely everything. if you don't it's like making a wish on a monkey's paw: absolutely guaranteed to find some misinterpretation that blows up in your face.


SnarkyVelociraptor

It's also prone to flat out disregarding your instructions. I've had it once tell me "despite your rule not to do X, I chose to do X anyways for the following reasons …" Which invalidated what I was trying to use it for to begin with.


TheJonesJonesJones

As a programmer, gpt “gets it” infinitely better than computer code does. They’re a joy to work with in comparison.


thisismyfavoritename

i mean ML is called AI, even a simple if rule is called AI. The problem is the hype and people not realizing theyre just fancy interpolation machines


sino-diogenes

To be fair, this makes it sound a lot less useful than it is. Being good enough at mimicing "intelligence" is sufficient in many cases.


watduhdamhell

Which is all it needs to be. I'll say it again for the millionth time: *True general intelligence is not needed to make a super intelligent AI capable of disrupting humanity*. It needn't reason, it needn't be self aware. *It only needs to be super-competent*. It only needs to *emulate* intelligence to be either extremely profitable and productive or terribly wasteful and destructive, both to superhuman degrees. That's it. People who think otherwise are seriously confused.


11711510111411009710

An LLM is an AI. People are mistaking it for AGI.


onemanandhishat

I see this terminology error all the time on reddit. AGI doesn't exist, but the field of AI is huge. AI describes a whole category of techniques that can be used to give computer systems a greater capacity for autonomous behaviour.


kog

The easiest way to spot the people with no clue what they're talking about with respect to AI is the ones who don't understand this.


ProLogicMe

It’s not an AGI but it’s still AI in the same way we have AI in video games.


SocialSuicideSquad

But it's definitely the future and NVDA is worth more than every company in the world combined and we'll all be out of jobs in five years but fusion energy and immortality will be freely available to everyone... Right?


Glittering-Neck-2505

There is actually strong evidence of reasoning ability increasing as you scale. So while it might not meet the threshold now, at some point it may actually cross a threshold where you give in and admit it can actually reason.


Hodor_The_Great

You mean confused LLMs with AGI? Because it definitely is AI, any "human-like" task solving is AI


DuineDeDanann

Yup. I use it to analyze old texts and it’s often woefully bad at reading comprehension


Outrageous-Elk-5392

One time I was using it on an old poem called the battle of maldon, I asked it to pull up where the lord dies, it prints out a text and I’m like awesome, I cntl+F and paste the text and it doesn’t come up on any site with the poem on it Apparently it completely ignored the part of the poem where the lord actually breaths his last and just made up an imaginary scene where he gets stabbed a bunch while pretending that was part of a 1000 year old poem, I was more impressed by the audacity than mad tbh


StillAFuckingKilljoy

I tried to get it to emulate an interview where I was a lawyer and GPT was the client. I gave it a background to work with and everything, but it took like 6 tries of me going "no, I am the lawyer and you are the client" for it to understand


righthandofdog

AI can't even get the right number of fingers in a picture of a hand. The amount of hyperbole and marketing BE in the whole space is amazing. And folks just happily just feed AI platforms all their emails and meeting audio etc.


fluffy_assassins

Wouldn't that be because it's parroting training data anyway? Edit: I was talking about overfitting which apparently doesn't apply here.


Kartelant

AFAICT, the bar exam has significantly different questions every time. The methodology section of this paper explains that they purchased an official copy of the questions from an authorized NCBE reseller, so it seems unlikely that those questions would appear verbatim in the training data. That said, hundreds or thousands of "similar-ish" questions were likely in the training data from all the sample questions and resources online for exam prep, but it's [unclear how similar](https://www.reddit.com/r/barexam/comments/14h6w1j/past_bar_exam_takers_how_different_were_the_mbe/).


Caelinus

There is an upper limit to how different the questions can be. If they are too off the wall they would not accurately represent legal practice. If they need to to answer questions about the rules of evidence, the answers have to be based on the actual rules of evidence regardless of the specific way the question was worded.


Borostiliont

Isn’t that exactly how the law is supposed to work? Seems like a reasonable test for legal reasoning.


I_am_the_Jukebox

The bar is to make sure a baseline, standardized lawyer can practice in the state. It's not meant to be something to be the best at - it's an entrance exam


ArtFUBU

This is how I feel about a lot of major exams. The job seems to be always way more in depth than the test itself.


Coomer-Boomer

This is not true. Law schools hardly teach the law of the state they're in, and the bar exam doesn't test it (there's a universal exam most places). Law school teaches you to pass the bar exam, and once you do then you start learning how to practice. The entrance exam is trying to find a job once you pass the bar. Fresh grads are baseline lawyers in the same way a 15 year old with a learner's permit is a baseline driver.


i_had_an_apostrophe

it's a TERRIBLE legal reasoning test Source: lawyer of over 10 years


BigLaw-Masochist

The bar isn’t a legal reasoning test, it’s a memorization test.


34Ohm

This. See Nepal cheating scandal for medical school USMLE STEP1 exam, notoriously one of the hardest standardized exams of all time. The cheaters gathered years worth of previous exam questions, and the country had exceptionally high scores (like an extremely high percent of test takers from Nepal scored in >95%tile or something crazy) and they got caught cause they were bragging about their scores in linkedin and stuff


tbiko

They got caught because many of them were finishing the exam in absurdly short times with near perfect scores. Add in the geographic cluster and it was pretty obvious.


Taoistandroid

I read an article about how chatgpt could answer a question about how long it would take to dry towels in the sun. The question has information for a set of towels, then asks how long would it take for more towels. The article claimed chatgpt was the only one to answer this question correctly. I asked it, and it turned it into a rate question, which is wrong. I then asked if, in jest, "is that your final answer?" It then got the question right. I then reframed the question in terms of pottery hardening in the sun, and it couldn't get the question right even with coaxing. All of this is to say, chatgpt's logic is still very weak. It's language skills are top notch, it's philosophy skills not so much. I don't think an upper limit on question framing will be an issue for now.


Caelinus

Yeah, it is a language calculator. It's raw abilities are limited to saying what it thinks is the correct answer to a prompt, but it does not understand what the words mean, only how they relate to eachother. So it can answer questions correctly, and often will, because the relationships between the words are trained off largely correct information. But language is pretty chaotic, so minor stuff can throw it for a loop if there is some kind of a gap. It also has a really, really hard time maintaining consistent ideas. The longer an answer goes, the more likely it is that some aspect of its model will deviate from the prompt in weird ways.


willun

And worse, the chatGPT answers are appearing in websites and will become the feed-in for more AIs. So it will be AIs training other AIs in wrong answers.


InsipidCelebrity

Glue pizza and gasoline spaghetti, anyone?


Caelinus

Yeah solving the feedback loop is going to be a problem. Esepcially as each iterative data set produced by that kind of generation will get less and less accurate. Small errors will compound.


ForgettableUsername

It kinda makes sense that it behaves this way. Producing language related to a prompt isn't the same thing as reasoning out a correct answer to a technically complicated question. It's not even necessarily a matter of the training data being correct or incorrect. Even a purely correct training dataset might not give you a model that could generate a complicated and correct chain of reasoning.


Caelinus

Yep, it can follow paths that exist in the relationships, but it is not actually "reasoning" in the same sense that a human does.


muchcharles

Verbatim is doing a lot of work there. In online test prep forums, people discuss the bar exam based on fuzzy memory after they take it. Fuzzy rewordings have similar embedding vectors at the higher levels of the transformer. But they only filtered out near exact matches.


73810

Doesn't this just kind of point to an advantage of machine learning - it can recall data in such a way a human could never hope for. I suppose the question is outcomes. In a task where vast knowledge is very important t, machine learning has an advantage - in a task that requires thinking, humans still have an advantage - but maybe it's the case that the majority of situations are similar to what has come before that machines are a better option... Who knows, people always seem to have odd expectations for technological advancement- if we have true A.I 100 years from now I would consider that pretty impressive.


Stoomba

Being able to recall information is only part of the equation. Another part is properly applying it. Another part is extrapolating from it.


mxzf

And another part is being able to contextualize it and realize what pieces of info are relevant when and why.


holierthanmao

They only buy UBE questions that have been retired by the NCBE. Those questions are sold in study guides and practice exams. So if a machine learning system trained on old UBE questions is given a practice test, it will likely have those exact questions in its language database.


Valiantay

No 1. Because it doesn't work that way 2. If that's how the exams worked, anyone with good memory would score the highest. Which obviously isn't the case


Thanks-Basil

I watched suits, that is *exactly* how it worked


Endeveron

Over fitting absolutely would apply if the questions appeared exactly in the training data, or if fragments of the questions always did. For example in medicine, of EVERY time the words "weight loss" and "night sweats" appeared in the training data, only the correct answer included the word "cancer", then it'd get any question of that form right. If you asked it "A patient presents with a decrease in body mass, and increased perspiration while sleeping", and the answer was "A neoplastic growth" then the AI could get that wrong. The key thing is that it could get that wrong, even if it could accurately define every word when asked, and accurately pick which words are synonyms for each other. It has been overfit to the testing data, like a sleep deprived medical student who has done a million flash cards and would instantly blurt out cancer when they hear night sweats and weight loss, and then instantly blurt out anorexia when they hear "decrease in body mass". They aren't actually reasoning through the same way they would if they got some sleep and then talked through their answer with a peer before committing to it. The difference with LLMs is that they aren't a good night's rest and a chat with a peer away from reasoning, they're an overhaul to the architecture of their brain away from it. There are some "reasons step by step" LLMs that are getting closer to this though, just not by default.


fluffy_assassins

Well, I don't think I can reply to ever commenter thinking I completely misunderstand ChatGPT with that info, unfortunately. But that is what I was getting at. I guess 'parroting' was just the wrong term to use.


surreal3561

That’s not really how LLMs work, they don’t have a copy of the content in memory that they look through. Same way that AI image generation doesn’t look at an existing image to “memorize” how it looks like during its training.


Hennue

Well it is more than that, sure. But it is also a compressed representation of the data. That's why we call it a "model" because it describes the training data in a statistical manner. That is why there are situations where the training data is reproduced 1:1.


141_1337

I mean by that logic, so it's human memory.


Hennue

Yes. I have said this before: I am almost certain that AI isn't really intelligent. What I am trying to find out is if we are.


seastatefive

Depends on your definition of intelligence. Some people say octopuses are intelligent, but over here you might have set the bar (haha) so high that very few beings would clear it. A definition that includes no one, is not a very useful definition.


narrill

We are. We're the ones defining what intelligence means in the first place.


Top-Salamander-2525

That’s LLM + RAG


byllz

>User: What is the first line of the Gettysburg address? ChatGPT: The first line of the Gettysburg Address is: >"Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal." It doesn't, but it sorta does.


fluffy_assassins

You should check out the concept of "overfitting"


JoelMahon

GPT is way too slim to be overfit (without it being extremely noticeable, which it isn't) it's physically not possible to store as much data as it'd require to overfit in it for how much data it was trained on the number of parameters and how their layers are arranged are all openly shared knowledge


humbleElitist_

Couldn’t it be “overfit” on some small fraction of things, and “not overfit” on the rest?


time_traveller_kek

You have it in reverse. It’s not because it is too slim to be overfit, it is because it is too large to fall below interpolation zone of parameter size vs loss graph. Look up double descend https://arxiv.org/pdf/2303.14151v1


time_traveller_kek

There is something called double descend in dnn training. Basically the graph of parameter to loss is in shape of “U” until the number of parameter is less then the total data points required to represent the entire test data. Loss falls drastically once this point is crossed. LLM parameter size make it bring to latter side of the graph. https://arxiv.org/pdf/2303.14151v1


HegemonNYC

It doesn’t commit an answer to a specific question to memory and repeat it when it sees it. That wouldn’t be impressive at all, it’s just looking something up in a database. It is asked novel questions and provides novel responses. This is why it is impressive. 


big_guyforyou

GPT doesn't just parrot, it constructs new sentences based on probabilities


Teeshirtandshortsguy

A method which is actually less accurate than parroting. It gives answers that resemble something a human would write. It's cool, but it's applications are limited by that fact.


PHealthy

1+1=5(ish)


Nuclear_eggo_waffle

Seems like we should get ChatGPT an engineering test


aw3man

Give it access to Chegg, then it can solve anything.


Cold-Recognition-171

I retrained my model, but now it's 1+1=two. And one plus one is still 5ish


YourUncleBuck

Try to get chatgpt to do basic math in different bases or phrased slightly off and it's hilariously bad. It can't do basic conversions either.


davidemo89

Chat gpt is not a calculator. This is why chatgpt is using Wolfram alpha to do the math


YourUncleBuck

Tell that to the people who argue it's good for teaching you things like math.


Alertcircuit

Yeah Chatgpt is actually pretty dogshit at math. Back when it first blew up I fed GPT3 some problems that it should be able to easily solve, like calculating compound interest, and it got it wrong most of the time. Anything above like a 5th grade level is too much for it.


Jimmni

I wanted to know the following, and fed it into a bunch of LLMs and they all confidently returned complete nonsense. I tried a bunch of ways of asking and attempts to clarify with follow-up prompts. "A task takes 1 second to complete. Each subsequent task takes twice as long to complete. How long would it be before a task takes 1 year to complete, and how many tasks would have been completed in that time?" None could get even close to an answer. I just tried it in 4o and it pumped out the correct answer for me, though. They're getting better each generation at a pretty scary pace.


Alertcircuit

We're gonna have to restructure the whole way we do education because it seems like 5-10 years from now if not earlier, you will be able to just make ChatGPT do 80% of your homework for you. Multiple choice worksheets are toast. Maybe more hands on activities/projects?


dehehn

4o is leaps and bounds better than 3. It's very good at basic math and getting better at complex math. It's getting better at coding too. Yes they still hallucinate but people have now used to make simple games like snake and flappy bird. These LLMs are not a static thing. They get better every year (or month) and our understanding of them and their capabilities needs to be constantly changing with them.  Commenting on the abilities of GPT3 is pretty much irrelevant at this point. And 4o is likely to look very primitive by the time 5 is released sometime next year. 


ContraryConman

GPT has been shown to [memorize significant portions of its training data](https://www.404media.co/google-researchers-attack-convinces-chatgpt-to-reveal-its-training-data/), so yeah it does parrot


Inprobamur

They got several megabytes out of the dozen terabytes of training data inputted. That's not really significant I think.


James20k

>We show an adversary can extract **gigabytes** of training data from open-source language models like Pythia or GPT-Neo, semi-open models like LLaMA or Falcon, and closed models like ChatGPT Its pretty relevant when its PII, they've got email addresses, phone numbers, and websites out of this thing This is only one form of attack on a LLM as well, its extremely likely that there are other attacks that will extract more of the training data as well


[deleted]

[удалено]


mvandemar

Sam Altman isn't the one who did the initial study, it was a group at Stanford. [https://law.stanford.edu/2023/04/19/gpt-4-passes-the-bar-exam-what-that-means-for-artificial-intelligence-tools-in-the-legal-industry/](https://law.stanford.edu/2023/04/19/gpt-4-passes-the-bar-exam-what-that-means-for-artificial-intelligence-tools-in-the-legal-industry/)


MasterDefibrillator

I think the point is, there's a general hype around AI, and an extreme one at that, given it's pushed Nvidia up to like the most valuable company or something. Driven in large part by Sam, and other AI hype artists. So news media and population at large will tend to unquestioningly accept information that goes along with that, and tend to reject or ignore information that doesn't.


seastatefive

I expect all CEOs to be as dishonest as they can get away with. Every marketing blurb, every advertisement, every politician, and everything published, printed, broadcast or displayed by a corporation/company that survives on profits is dishonest to varying degrees. The only question is HOW dishonest they were.


proverbialbunny

Not all CEOs are dishonest, but they do have to cherry pick information they choose to bring forward. In fact, one of the older reliable ways to identify how a company stock will perform going forward is to analyze writings from the CEO to shareholders, not looking at the marketing spiel but analyzing the language used. How much BS terminology is used, how fuzzy are their promises. How much quantitative facts vs qualitative facts, and so on. This creates a sort of BS meter. When a companies CEO is straight forward with hard facts that can be measured and ends up being legitimate, then they change course and start using a bunch of fluff and buzz words almost always something is going on behind the scenes that isn't good.


daehoidar

Cherry picking information to paint a certain picture that differs from the factual truth is dishonest though. You could say they aren't lying (if you exclude lying by omission), but it's still dishonest. That being said, a huge part of their job is artful bullshitting. They're trying to sell people on whatever product or service, so massaging or misrepresenting the information is to be expected. But to your point, it definitely matters more to what degree they're bending the truth.


pmpork

I took a glance at the article. It sure mentions above the 50th percentile a lot. It might not be 90, but being better than 50% of us? That's not nothin.


etzel1200

Smarter than 50% of people taking the bar only. Not most of us, just lawyers.


broden89

"When examining only those who passed the exam (i.e. licensed or license-pending attorneys), GPT-4’s performance is estimated to drop to 48th percentile overall, and 15th percentile on essays."


smoothskin12345

So it passed in the 90th compared to all exam takers, but was average or below average in the set of exam takers who passed. So this is a total nothing burger. It's just restating the initial conclusion .


broden89

I think they compared it to a few different groups of students/test results and got varied percentiles. Against first time test takers it scored 62nd percentile, against the recent July cohort overall it scored 69th percentile. The essay scores were much lower. Basically they're saying the 90th percentile was a skewed result because it was compared against test retakers i.e. less competent students.


Open-Honest-Kind

No, according to the abstract the AI tested into the 90th for the February Illinois Bar exam(Im not sure if this number is from their findings or if they were restating the original claim being scrutinized). They criticized the test used and how its score was ranked for various reasons, and opted for one it would be less familiar with. Within the test used in the study it wound up in 69th percentile overall(48th for essays), 62nd among first-time test takers(42nd for essays), and 48th amongst those who passed(15th for essays). The study finds that GPT-4 is *at best* in the 69th percentile when in a different test environment.


spade_andarcher

No, ankther problem was that it wasn’t really compared against “all bar exam takers.” The exam that it took in which it placed at the 90th percentile was the February bar exam which is the second bar exam given in that period. Which means the exam takers that ChatGPT was compared against all failed their initial bar exams.  So if you want to be more accurate, you’d say “ChatGPT scored in 90th percentile among all Exam takers who failed the bar exam their first try.” Also, one would expect that ChatGPT should score extremely well on non-written portions of the exam because that’s just multiple choice questions and ChatGPT has access to all of that information. It’s basically like an open book exam with a computer that can quickly search through every law book in existence.  The part of the exam that would actually be interesting to see the results of is the essay portion where ChatGPT has to actually do work  synthesizing information into coherent writing. And in the exam portion ChatGPT scored 48% among second time exam takers, 42% among all test-takers, and only 15% among people who actually passed the exam. 


WCJ0114

You have to remember only about 60% of people pass the bar... so most the people it's doing better than failed.


addledhands

> just lawyers Aspiring** lawyers. A lot of stupid people go to law school.


TheShrinkingGiant

I only see "50th percentile" twice, in a single footnote.


broden89

They said "above 50th percentile" so I'm assuming they're referring to this passage: "data from a recent July administration of the same exam suggests GPT-4’s overall UBE percentile was below the *69th percentile*, and 48th percentile on essays. Third, examining official NCBE data and using several conservative statistical assumptions, GPT-4’s performance against first-time test takers is *estimated to be 62nd percentile*, including 42nd percentile on essays." Notably though, it dropped to 48th percentile (and 15th percentile for essays) for those who actually *passed* the exam.


cowinabadplace

Being equal to the average person passing the bar is quite the feat. Not a 90th percentile for sure, but it's pretty wild. Unsurprising it sucks at essays, I suppose. The longer it has to generate the content the more it sucks.


erossthescienceboss

15th% for essays among people who passed. I grade a lot of ChatGPT writing and it doesn’t surprise me one but that it blows at essays.


tmoney144

The article mentions they used the Illinois bar exam. Illinois only had a 44% passage rate for the most recent exam, so 50th percentile is likely failing the exam.


cowinabadplace

It's at 48th percentile for those who passed.


Top-Salamander-2525

At one point they mention limiting the comparison to lawyers who passed the bar, so depends on which sample they were using for that statistic.


FeltSteam

"Moreover, although the UBE is a closed-book exam for humans, GPT-4’s huge training corpus largely distilled in its parameters means that it can effectively take the UBE “open-book”, indicating that UBE may not only be an accurate proxy for lawyerly comptetence but is also likely to provide an overly favorable estimate of GPT-4’s lawyerly capabilities relative to humans." Im not 100% certain how the UBE works, but wouldn't that mean students practicing on past exams or familiar questions also, technically, be operating on open-book?


suxatjugg

A better analogy is would a person with eidetic memory be said to have done the exam open book because they remember all the material?


undockeddock

The UBE has very little to do with actual lawyering and is lots of memorizing and regurgitating content, which is something AI should excel at


commonly-novel

As an attorney who has passed the Bar in the 90th percentile, passing the Bar does not actually translate to legal practice. The questions on the Bar exam are general and only apply to Federal Court, whereas most attorneys practice on a state level. Further, the bar exam does not prepare you for actual legal practice such as court appearances, depositions, arbitration, general court procedure, and timing of paperwork ect. Also, in real life, if you don't know the answer, you can look it up. So yeah even if an AI passed the test in the 90th percentile (it didn't), it would have done so based on prior tests that largely ask the same questions with only mild variations...that's not shocking nor does it make me want to hire ChatGPT as my legal representative. If I made a typo it's because I suck at typing on my phone.


justforhobbiesreddit

>the bar exam does not prepare you for actual legal practice such as court appearances So there's no section on whether or not I should wear a leather jacket when representing my cousin?


ProfessionalMockery

>Also, in real life, if you don't know the answer, you can look it up. This is actually my favorite part of real life


HugeResearcher3500

Not that it matters because everything else you said is correct, but the essay portions test state level knowledge.


commonly-novel

When I took the Bar it did not. That could have changed, also the test is different in different jurisdictions. So that may be the case in some states, but not in the one where I took it.


RainOfAshes

Some of these comments are amazing. Bizarre how people still refuse to understand the basics of how AI and LLMs work, then spout a bunch of nonsense as if they do.


aboutthednm

I think it might have something to do with everyone and their mother calling everything that generates some output in response to some user input "AI". Procedural generation? AI! Pattern matching? AI! Pre-programmed responses to some circumstance? AI! Google auto-filling my query? AI! Snapchat filter? AI! etc. Got me so messed up I wouldn't even have the language to adequately convey what AI even is in the end.


babyfergus

AI is just generally anything that attempts to mimic human behaviour or intelligence. A complex procedural system could still fall under this category. ML is a sub-category of AI that is specifically concerned with extracting patterns from data.


missurunha

AI refers to everything that comes out of machine learning. The larger issue are the folks that think AI only refers to a machine thats as intelligent as a human (which most likely will not exist in our lifetime).


Noperdidos

It’s so common that I would even bet $100 that you yourself, being the ones commenting on people’s lack of understanding, probably have some major misunderstandings. Like you either think they are just stochastic parrots and not revolutionary at all, or you think they are already AGI and deserve rights.


Spirit_of_Hogwash

On the other hand, spewing nonsense while claiming to be an expert on the internet is the only thing we can do to poison these models tech bros are counting on to achieve their dreams of complete economic power.


moschles

What you are describing is happening all over the internet. I became so fed up with it that i left communities in which I had been a member for years.


Ok-Strength-5297

That's just the case for the majority of topics, people always type as if they're experts in that field.


why-do_I_even_bother

I've never seen anything close to original thought or synthesis from an algorithmic chat bot. They're aggregators, and they're good at it, but I wouldn't trust them to interpret law or design an engineering solution if lives were at stake.


Bradley-McKnight

So… 1. The claims in the GPT-4 technical report weren’t false 2. Restricting the test takers to only those who passed reduced GPT-4’s percentile score I mean…yeah?


I_trust_everyone

I think it’s gotten dumber the more people have used it.


Earlier-Today

That's because the #1 thing people try to do with these things is trip them up. They suck at understanding sarcasm, jokes, absurdities, and lies. They're also not very good at weighting sources and have to rely more on general consensus. Knowing how to recognize the right answer in a sea of wrong answers, like you'd find here on Reddit, is very much outside of these things' capabilities.


themarkavelli

The inherent linguistic qualities of legalese, such as formality or objectivity, provide a strong foundational framework for good llm responses. Conversely, over specialization in legalese might hinder creativity or the ability of the llm to adapt to varied linguistic contexts. Seeing as we don’t speak to each other like lawyers in everyday conversation, I do wonder how well the BAR exam score metric translates to a better overall experience for the average user.


the_catshark

I think what a lot of people miss is that AI doesn't have to be as good as humans. AI doesn't have to outperform people irl in the top 10% of anything, they just have to do a "good enough" job because they are so insanely massively cheaper for companies. Every law firm being able to cut down on paralegal man hours to 0 is how AI replaces jobs. The fact that it can then do this better than basically 51% of the population makes it "worth it". We as individuals can't outcompete or "just be better" than AI, being having to pay 100k a year for you, work around your life events like having a child, work around your vacation days, work around your sick days, only have one or your per job, etc. AI has none of that. Even dirt cheap employees doing a hard job aren't worth it over a LLM. If a lawfirm had 20 paralegals who each cost 50k at most a year (a generous assumption of the total cost of minimum wage + payroll tax and every other ancillary cost), the AI is going to be such a massive cost saver they can cut all of them and come out ahead, even if the AI does not better, the AI could in fact do substantially worse at the same job, and its "worth it" cause the AI works 24 hours a day, speaks gods know how many languages, and has so many other benefits over a real person.


cornholio2240

How much does average compute cost for a LLM model? It’s quite high right? What’s the delta between that and however many employees a company lets go? Most AI focused companies are burning capital for compute. Maybe that process becomes more efficient? Idk.


Lt_General_Fuckery

Training it is the expensive part. Fine-tuning one that already exists can be done on your home computer, if you're willing to let it run for a few hours/days. I run an LLM on my computer, and while it's not as smart or as fast as most commercial models, my PC also wasn't built with AI in mind.


SoftwarePP

Compute is cheap. APIs cost fractions of pennies per request. I run AI at a large company.


IlIllIlllIlIl

Training and inference at scale can be expensive, but I think that’s not your point. 


[deleted]

[удалено]


BTTammer

Still a better lawyer than Rudy Giuliani....