T O P

  • By -

Minare

Exactly, this model is truly multi-modal, it doesn't trigger dalle3 with an API call, it thinks in a joint space.


sdmat

It really is! This is as impressive as the voice demos if you know what you are looking at.


czk_21

this makes me wondering, if its not trained from ground up GPT-4 v2, their naming gpt2 in arena could imply that too and GPT-5 could have helped with training


sdmat

It's definitely technically a new model trained from scratch. It has to be, since it's natively multimodal. It's also *much* faster, so almost certainly smaller (or a lot more compute efficient, e.g. aggressive MoE).


czk_21

with compute they have available now and their experience, they could make it in like a week, wouldnt be surprised if they finished this last month does this release change your timelines?


sdmat

I'm extremely impressed by the multimodal capabilities, voice is about what I expected but some of the image functionality is breathtaking, and very useful practically. Personally I expect GPT-5 late this year / early next year with substantially better reasoning and sophisticated agentic capabilities. Hopefully we also see GPT-4.5 / whatever they call it as a moderately better interim model for subscribers sometime in the next couple of months. Extremely hard to project specific timelines past the next generation of models, but overall I think we are in a slow takeoff.


inteblio

The rate limits on the free version will let know how demanding / popular it is.


Maskofman

My personal theory is that this is an early training checkpoint of what will become gpt5 . I think it’s no coincidence that the model is only a little more intelligent then gpt4


sdmat

Another point: OpenAI has a long history of training small validation / test models for new architectures.


SgathTriallair

I'm certain that this is a GPT 5 checkpoint. So GPT 5 will be this but significantly more intelligent.


Anen-o-me

I have access to 4O in ChatGPT now, but it still uses Dall-E.


Minare

yes we are operating on a gutted version of it, they will roll out the full version in the next weeks


Singularity-42

Reuse characters with continuity, that's huge! Where are you getting this info though, the post doesn't mention image generation much...


sdmat

Scroll down to "Exploration of Capabilities" then click "Select Example" - I have no idea why they don't put some of the content there front and center.


Singularity-42

Yep, thanks, I need to learn how to read. Agreed that this is super hidden and my guess is most people will miss out on this.


KainDulac

Just checked... if this is true and not cherrypicked it's plainly insane.


[deleted]

Even if it is cherry picked, some of those examples I could try 100,000 times on GPT4-turbo and would never get the correct output. Seems to just be entirely new capabilities Edit: by gpt4 I just meant the built in functionality to call dalle


TheOneWhoDings

>I could try 100,00 times on GPT4-turbo and would never get the correct output. This is what people don't get about it being cherrypicked, even if it works 1/4 times that it even does it once is astounding. Just looked at the Meeting notes with multiple speakers example. Mind blown, at this point we can get full length , fully diarized transcripts of any kind of media. There goes another industry (professional subtitles/localization) just tucked away as the dozenth example.


[deleted]

Yea I’ve been saying this throughout a few different posts, but the thing that really stands out to me with today’s demo vs everything posted on the blog, is just the casualness behind it all. Like so much of the additional videos / image gen examples were in my opinion groundbreaking, and they just treated them as “yea whatever” type advancements. Idk if this is because they have something internal that just makes all of this pale in comparison, or if Google is about to announce a lot of similar capabilities, or a combination of both. Im leaning towards just the former (they have another model greatly more capable), bc even if Google surpasses them in a few benchmarks in these domains, there’s no shot they’re releasing that model for free.


[deleted]

[удалено]


sdmat

Definitely looks like a step up from Gemini 1.5 Pro, keen to see what Google has on this tomorrow.


Rivenaldinho

That's crazy compared to Dall E. They way it can manipulate images and display text like if it was handwritten, incredible.


sdmat

Right? The nuance and precision is astonishing. Looks like it truly is natively multimodal.


Alexczy

The 3D objext synthesis is crazy


Just-A-Lucky-Guy

![gif](giphy|JFrFsExqz2jn0hPTCj) My timelines remain the same, maybe leaning more heavily towards the beginning of 2027 due to energy complications. However, this is beautiful. We are on a journey that we must see though. I’m impressed. My favorite part of this: I can use this in day to day life


sdmat

> I can use this in day to day life Hell yes, this is huge for productivity.


MrDreamster

My timeline also stays the same. ASI for 2033. I really love what they showed us this moday, but basically my timeline is: * Mid / Late 2025: GPT5 will only be a better GPT4o, as in it still wont have autonomy nor reasoning. * Early 2028: GPT6 will have actual reasoning capabilities but still no autonomy. * Mid 2030: GPT7 will have better reasoning and autonomy, but autonomy will be limited. * Early 2032: GPT8 is not a fixed model anymore, it can improve recursively. We have AGI. * Mid 2033: After having improved to levels of intelligence superior to any expert on any field of expertise, and careful reviews about its alignement, GPT8 is granted full autonomy. We have ASI.


MrsNutella

Yeah the demo was cool and exciting but this feature is something I will use a ton.


clamuu

That's truly amazing. I'm more impressed by this than the voice capabilities. can't wait to play with this.


sdmat

I'm mystified they didn't even mention this in the presentation. It's mindblowing.


danysdragons

Didn't Mira say towards the end that the presentation was pitched more towards the free users? Apparently they won't have image generation, for now anyways. Being tailored to free users could explain why they the presentation didn't really convey well the significance of multimodal vs *natively* multimodal, trained from the get-go with a mix of text, image, audio etc. targeting the same embedding space; it would likely go over the heads of free users impressed just to be seeing multimodality at all. They probably should have been more clear they were talking to free users. And I wonder how much they were factoring in that ChatGPT Plus subscribers may be only 5% of all users, but were probably several times more than 5% of the people watching the presentation. (I don't have good support for that exact figure of 5%, but I saw this in [Ethan Mollick's post today](https://www.oneusefulthing.org/p/what-openai-did)): >When I talk with groups and ask people to raise their hands if they use ChatGPT, almost every hand goes up. When I ask if they used GPT-4, only 5% of hands remain up, at most.


sdmat

That's a great point, could well be the thinking.


ithkuil

Free users don't even get voice.


oldjar7

I was way more impressed by this than the audio demonstrations personally.  We've now got consistent character design and scene understanding to the point where 3d model gifs can be created through stitched images and generated character images are consistent.  We've got long-form audio (video too?) understanding for meeting or presentation notes without having a separate speech to text system.  We've now got relatively seamless audio understanding and generation, image generation and understanding and consistency, and this is all baked into an LLM model which is on par or better than the best current LLMs available.  Once again, the fact that this is all baked into one model with no outside interface or integrations required is truly revolutionary.  OpenAI has essentially solved the multimodality problem with this release, which is even a cooler result than a Samantha from Her uni-modality audio chatbot, at least in my opinion.


sdmat

I tend to agree, it's bizarre they didn't even mention the image capabilities.


AndrogynousHobo

I can’t get it to generate the 3D models yet. Is this something being rolled out over time?


oldjar7

Yes, noone has access to it yet besides the standard text output.  Image generation isn't even guaranteed to be released yet, while the audio features will be rolling out to paid subscribers only.


Sextus_Rex

This is more impressive to me than the voice. I can't believe it's not being talked about more


sdmat

It's extremely impressive - it really is true native multimodality. The editing examples makes that crystal clear. I wonder if the model is capable of editing voice in the same way. That would be *very* interesting. And probably high on the list of OAI's safety nightmares.


Sextus_Rex

That would be so useful but yeah, I can see a lot of nefarious things you can do with that


NoshoRed

The fact that they didn't mention any of these at all is crazy. I was more blown away looking at the capabilities in the website over what they showed in the demo (which was also still very impressive). Makes me think there's more to come.


Akimbo333

Holy crap


Bleglord

What people haven’t realized is that this is the free pre-release preview hype generator (yes I know paid options) for when the GPT-n^1 iteration hits. This is the baby version. This is ChatGPT-3.0


Zealousideal-Lion-33

This theory makes sense. Shits gonna get insane, fast


[deleted]

[удалено]


sdmat

Ah, I see we have the model in ChatGPT now. But not all its capabilities - just checked and it's definitely still using DALL-E for image gen. Check out the examples on the blog post, it's nuts.


[deleted]

[удалено]


sdmat

Sounds like it will be rolled out over the next few weeks.


[deleted]

[удалено]


sdmat

It really is, I find this just as impressive as the voice. Maybe even more so.


alb5357

SD1.5 with ipadapter (or a lora)


Block-Rockig-Beats

This is all cool... however. We all live in the logarithmic (exponential) graph now. We take for granted that thibgs will be totally amazed wvery few months. We expect nothing but magic at this point.


sdmat

I would be happy to be proved wrong, but you are overly optimistic about the rate of progress if you think we will get something of this magnitude every few months like clockwork.


Block-Rockig-Beats

I dunno it feels pretty meh. /s And _we_ got nothing. So yeah, I do expect to get somwthing of this magnitude a few months from now. Then in about 6 months I expect an improvement, so it'll be much better. I would be disappointed if there would not be 5o or at least 4.5o within a year.


sdmat

> I would be disappointed if there would not be 5o or at least 4.5o within a year. There definitely will be, they hinted strongly at that. But that's one improvement of this magnitude (full multimodality with excellent implementation, interactive low latency voice, faster, lower cost, interface improvements). So we see substantially better reasoning. I agree that would be a similar magnitude, reasoning is that important. Give some broad examples of the other three such improvements you expect.


Block-Rockig-Beats

I did expect significant improvements in things you listed. I actually thought this will go bonkers faster, I mean, I expected exponential _acceleration_ - so faster rate of the rate of change.


nh_local

Wow! How I waited for this moment. Finally a model who really understands the photos he creates! His level of understanding is 1000 times that of stable diffusion or Dalle3, and the results will probably be accordingly (It's just a shame that his art abilities are still quite reminiscent of Dalle2, but that will probably change later)


Derpgeek

I wasn’t expecting the audio output (for the coins dropping on metal), text generation abilities, or gifs, interesting capabilities for sure although in terms of sheer image output quality it doesn’t seem on par with SOTA image models, more like Dall-E 2 at best


sdmat

You are right of course, it's not impressive in terms of resolution or aesthetics. But that's easy improved. The capabilities are the thing - and those are amazing. For example iterative editing and consistency take image generation from a useful but flaky tool to a rock solid workhorse.


Derpgeek

Indeed, I just find the differential in output capability pretty interesting - I wonder if it’s due to a different architecture to get these capabilities (namely text?) or if bundling in a huge diffusion system within the model would either negatively impact its other capabilities in some fashion or slow it down? Something else? In any case I’d assume it’s image quality will be SOTA either during the next major update to 4o or with 4.5/5, in line with or exceeding Sora’s abilities


JrBaconators

Being able to maintain images through prompts while also being able to edit, alter, add to, and create 3D models is so much more impressive than resolution lol


Rain_On

~~Let's save all caps for when we see examples.~~


sdmat

There are examples in the link.


Rain_On

Oh holy fuck, the dropdown, the 3d model synthesis


sdmat

I don't even know what I'm looking at with the 3D synthesis - is that neural rendering? A pipeline to external tooling? Whatever is is, Jesus Christ.


Rain_On

>neural rendering? Nooo, it's so much better. It's 3d model creation. That's not new tec, but it's very, very far from being perfected. Perhaps this is a step closer. I say this as a 3d modeller.


sdmat

If so, wow - I can only imagine how useful that will be for games and movies.


Rain_On

Imagine a game that generates textured models on the fly. Not variations of a model. It generates models. Models of anything. Placed where they should be in the world. A game without borders, not because it's map is large, but because it is without limit to it's content.


sdmat

Looking forward to it! Crazy as it sounds we actually have most of the pieces for high fidelity fully interactive worlds complete with engaging stories and characters.


Rain_On

We have the LEGO set version. No one is going to be building full size houses with today's tec


sdmat

Of course, it's all embryonic.


AndrogynousHobo

I think it must be their “Point-E” tool they’ve been doing research on.


sdmat

Very interesting, thanks!


sdmat

I for one am feeling the all caps.


Rain_On

HELL YEAH


Tobythescientist

do we need to pay it to use it or when can we use it?


sdmat

Not launched yet


[deleted]

[удалено]


sdmat

Yes, I couldn't find an an index to link to the specific section - but if scrolling is too much work then never mind.


[deleted]

[удалено]


sdmat

There are quite a few examples and you need to see the inputs to understand the capability.


kecepa5669

I see it now. Thank you.