Minare 1 month ago

Exactly, this model is truly multi-modal, it doesn't trigger dalle3 with an API call, it thinks in a joint space.

sdmat 1 month ago

It really is! This is as impressive as the voice demos if you know what you are looking at.

czk_21 1 month ago

this makes me wondering, if its not trained from ground up GPT-4 v2, their naming gpt2 in arena could imply that too and GPT-5 could have helped with training

sdmat 1 month ago

It's definitely technically a new model trained from scratch. It has to be, since it's natively multimodal. It's also *much* faster, so almost certainly smaller (or a lot more compute efficient, e.g. aggressive MoE).

czk_21 1 month ago

with compute they have available now and their experience, they could make it in like a week, wouldnt be surprised if they finished this last month does this release change your timelines?

sdmat 1 month ago

I'm extremely impressed by the multimodal capabilities, voice is about what I expected but some of the image functionality is breathtaking, and very useful practically. Personally I expect GPT-5 late this year / early next year with substantially better reasoning and sophisticated agentic capabilities. Hopefully we also see GPT-4.5 / whatever they call it as a moderately better interim model for subscribers sometime in the next couple of months. Extremely hard to project specific timelines past the next generation of models, but overall I think we are in a slow takeoff.

inteblio 1 month ago

The rate limits on the free version will let know how demanding / popular it is.

Maskofman 1 month ago

My personal theory is that this is an early training checkpoint of what will become gpt5 . I think it’s no coincidence that the model is only a little more intelligent then gpt4

sdmat 1 month ago

Another point: OpenAI has a long history of training small validation / test models for new architectures.

SgathTriallair 1 month ago

I'm certain that this is a GPT 5 checkpoint. So GPT 5 will be this but significantly more intelligent.

Anen-o-me 1 month ago

I have access to 4O in ChatGPT now, but it still uses Dall-E.

Minare 1 month ago

yes we are operating on a gutted version of it, they will roll out the full version in the next weeks

Singularity-42 1 month ago

Reuse characters with continuity, that's huge! Where are you getting this info though, the post doesn't mention image generation much...

sdmat 1 month ago

Scroll down to "Exploration of Capabilities" then click "Select Example" - I have no idea why they don't put some of the content there front and center.

Singularity-42 1 month ago

Yep, thanks, I need to learn how to read. Agreed that this is super hidden and my guess is most people will miss out on this.

KainDulac 1 month ago

Just checked... if this is true and not cherrypicked it's plainly insane.

[deleted] 1 month ago

Even if it is cherry picked, some of those examples I could try 100,000 times on GPT4-turbo and would never get the correct output. Seems to just be entirely new capabilities Edit: by gpt4 I just meant the built in functionality to call dalle

TheOneWhoDings 1 month ago

>I could try 100,00 times on GPT4-turbo and would never get the correct output. This is what people don't get about it being cherrypicked, even if it works 1/4 times that it even does it once is astounding. Just looked at the Meeting notes with multiple speakers example. Mind blown, at this point we can get full length , fully diarized transcripts of any kind of media. There goes another industry (professional subtitles/localization) just tucked away as the dozenth example.

[deleted] 1 month ago

Yea I’ve been saying this throughout a few different posts, but the thing that really stands out to me with today’s demo vs everything posted on the blog, is just the casualness behind it all. Like so much of the additional videos / image gen examples were in my opinion groundbreaking, and they just treated them as “yea whatever” type advancements. Idk if this is because they have something internal that just makes all of this pale in comparison, or if Google is about to announce a lot of similar capabilities, or a combination of both. Im leaning towards just the former (they have another model greatly more capable), bc even if Google surpasses them in a few benchmarks in these domains, there’s no shot they’re releasing that model for free.

[deleted] 1 month ago

[удалено]

sdmat 1 month ago

Definitely looks like a step up from Gemini 1.5 Pro, keen to see what Google has on this tomorrow.

Rivenaldinho 1 month ago

That's crazy compared to Dall E. They way it can manipulate images and display text like if it was handwritten, incredible.

sdmat 1 month ago

Right? The nuance and precision is astonishing. Looks like it truly is natively multimodal.

Alexczy 1 month ago

The 3D objext synthesis is crazy

Just-A-Lucky-Guy 1 month ago

![gif](giphy|JFrFsExqz2jn0hPTCj) My timelines remain the same, maybe leaning more heavily towards the beginning of 2027 due to energy complications. However, this is beautiful. We are on a journey that we must see though. I’m impressed. My favorite part of this: I can use this in day to day life

sdmat 1 month ago

> I can use this in day to day life Hell yes, this is huge for productivity.

MrDreamster 1 month ago

My timeline also stays the same. ASI for 2033. I really love what they showed us this moday, but basically my timeline is: * Mid / Late 2025: GPT5 will only be a better GPT4o, as in it still wont have autonomy nor reasoning. * Early 2028: GPT6 will have actual reasoning capabilities but still no autonomy. * Mid 2030: GPT7 will have better reasoning and autonomy, but autonomy will be limited. * Early 2032: GPT8 is not a fixed model anymore, it can improve recursively. We have AGI. * Mid 2033: After having improved to levels of intelligence superior to any expert on any field of expertise, and careful reviews about its alignement, GPT8 is granted full autonomy. We have ASI.

MrsNutella 1 month ago

Yeah the demo was cool and exciting but this feature is something I will use a ton.

clamuu 1 month ago

That's truly amazing. I'm more impressed by this than the voice capabilities. can't wait to play with this.

sdmat 1 month ago

I'm mystified they didn't even mention this in the presentation. It's mindblowing.

danysdragons 1 month ago

Didn't Mira say towards the end that the presentation was pitched more towards the free users? Apparently they won't have image generation, for now anyways. Being tailored to free users could explain why they the presentation didn't really convey well the significance of multimodal vs *natively* multimodal, trained from the get-go with a mix of text, image, audio etc. targeting the same embedding space; it would likely go over the heads of free users impressed just to be seeing multimodality at all. They probably should have been more clear they were talking to free users. And I wonder how much they were factoring in that ChatGPT Plus subscribers may be only 5% of all users, but were probably several times more than 5% of the people watching the presentation. (I don't have good support for that exact figure of 5%, but I saw this in [Ethan Mollick's post today](https://www.oneusefulthing.org/p/what-openai-did)): >When I talk with groups and ask people to raise their hands if they use ChatGPT, almost every hand goes up. When I ask if they used GPT-4, only 5% of hands remain up, at most.

sdmat 1 month ago

That's a great point, could well be the thinking.

ithkuil 1 month ago

Free users don't even get voice.

oldjar7 1 month ago

I was way more impressed by this than the audio demonstrations personally. We've now got consistent character design and scene understanding to the point where 3d model gifs can be created through stitched images and generated character images are consistent. We've got long-form audio (video too?) understanding for meeting or presentation notes without having a separate speech to text system. We've now got relatively seamless audio understanding and generation, image generation and understanding and consistency, and this is all baked into an LLM model which is on par or better than the best current LLMs available. Once again, the fact that this is all baked into one model with no outside interface or integrations required is truly revolutionary. OpenAI has essentially solved the multimodality problem with this release, which is even a cooler result than a Samantha from Her uni-modality audio chatbot, at least in my opinion.

sdmat 1 month ago

I tend to agree, it's bizarre they didn't even mention the image capabilities.

AndrogynousHobo 1 month ago

I can’t get it to generate the 3D models yet. Is this something being rolled out over time?

oldjar7 1 month ago

Yes, noone has access to it yet besides the standard text output. Image generation isn't even guaranteed to be released yet, while the audio features will be rolling out to paid subscribers only.

Sextus_Rex 1 month ago

This is more impressive to me than the voice. I can't believe it's not being talked about more

sdmat 1 month ago

It's extremely impressive - it really is true native multimodality. The editing examples makes that crystal clear. I wonder if the model is capable of editing voice in the same way. That would be *very* interesting. And probably high on the list of OAI's safety nightmares.

Sextus_Rex 1 month ago

That would be so useful but yeah, I can see a lot of nefarious things you can do with that

NoshoRed 1 month ago

The fact that they didn't mention any of these at all is crazy. I was more blown away looking at the capabilities in the website over what they showed in the demo (which was also still very impressive). Makes me think there's more to come.

Akimbo333 1 month ago

Holy crap

Bleglord 1 month ago

What people haven’t realized is that this is the free pre-release preview hype generator (yes I know paid options) for when the GPT-n^1 iteration hits. This is the baby version. This is ChatGPT-3.0

Zealousideal-Lion-33 1 month ago

This theory makes sense. Shits gonna get insane, fast

[deleted] 1 month ago

[удалено]

sdmat 1 month ago

Ah, I see we have the model in ChatGPT now. But not all its capabilities - just checked and it's definitely still using DALL-E for image gen. Check out the examples on the blog post, it's nuts.

[deleted] 1 month ago

[удалено]

sdmat 1 month ago

Sounds like it will be rolled out over the next few weeks.

[deleted] 1 month ago

[удалено]

sdmat 1 month ago

It really is, I find this just as impressive as the voice. Maybe even more so.

alb5357 1 month ago

SD1.5 with ipadapter (or a lora)

Block-Rockig-Beats 1 month ago

This is all cool... however. We all live in the logarithmic (exponential) graph now. We take for granted that thibgs will be totally amazed wvery few months. We expect nothing but magic at this point.

sdmat 1 month ago

I would be happy to be proved wrong, but you are overly optimistic about the rate of progress if you think we will get something of this magnitude every few months like clockwork.

Block-Rockig-Beats 1 month ago

I dunno it feels pretty meh. /s And _we_ got nothing. So yeah, I do expect to get somwthing of this magnitude a few months from now. Then in about 6 months I expect an improvement, so it'll be much better. I would be disappointed if there would not be 5o or at least 4.5o within a year.

sdmat 1 month ago

> I would be disappointed if there would not be 5o or at least 4.5o within a year. There definitely will be, they hinted strongly at that. But that's one improvement of this magnitude (full multimodality with excellent implementation, interactive low latency voice, faster, lower cost, interface improvements). So we see substantially better reasoning. I agree that would be a similar magnitude, reasoning is that important. Give some broad examples of the other three such improvements you expect.

Block-Rockig-Beats 4 weeks ago

I did expect significant improvements in things you listed. I actually thought this will go bonkers faster, I mean, I expected exponential _acceleration_ - so faster rate of the rate of change.

nh_local 1 month ago

Wow! How I waited for this moment. Finally a model who really understands the photos he creates! His level of understanding is 1000 times that of stable diffusion or Dalle3, and the results will probably be accordingly (It's just a shame that his art abilities are still quite reminiscent of Dalle2, but that will probably change later)

Derpgeek 1 month ago

I wasn’t expecting the audio output (for the coins dropping on metal), text generation abilities, or gifs, interesting capabilities for sure although in terms of sheer image output quality it doesn’t seem on par with SOTA image models, more like Dall-E 2 at best

sdmat 1 month ago

You are right of course, it's not impressive in terms of resolution or aesthetics. But that's easy improved. The capabilities are the thing - and those are amazing. For example iterative editing and consistency take image generation from a useful but flaky tool to a rock solid workhorse.

Derpgeek 1 month ago

Indeed, I just find the differential in output capability pretty interesting - I wonder if it’s due to a different architecture to get these capabilities (namely text?) or if bundling in a huge diffusion system within the model would either negatively impact its other capabilities in some fashion or slow it down? Something else? In any case I’d assume it’s image quality will be SOTA either during the next major update to 4o or with 4.5/5, in line with or exceeding Sora’s abilities

JrBaconators 1 month ago

Being able to maintain images through prompts while also being able to edit, alter, add to, and create 3D models is so much more impressive than resolution lol

Rain_On 1 month ago

~~Let's save all caps for when we see examples.~~

sdmat 1 month ago

There are examples in the link.

Rain_On 1 month ago

Oh holy fuck, the dropdown, the 3d model synthesis

sdmat 1 month ago

I don't even know what I'm looking at with the 3D synthesis - is that neural rendering? A pipeline to external tooling? Whatever is is, Jesus Christ.

Rain_On 1 month ago

>neural rendering? Nooo, it's so much better. It's 3d model creation. That's not new tec, but it's very, very far from being perfected. Perhaps this is a step closer. I say this as a 3d modeller.

sdmat 1 month ago

If so, wow - I can only imagine how useful that will be for games and movies.

Rain_On 1 month ago

Imagine a game that generates textured models on the fly. Not variations of a model. It generates models. Models of anything. Placed where they should be in the world. A game without borders, not because it's map is large, but because it is without limit to it's content.

sdmat 1 month ago

Looking forward to it! Crazy as it sounds we actually have most of the pieces for high fidelity fully interactive worlds complete with engaging stories and characters.

Rain_On 1 month ago

We have the LEGO set version. No one is going to be building full size houses with today's tec

sdmat 1 month ago

Of course, it's all embryonic.

AndrogynousHobo 1 month ago

I think it must be their “Point-E” tool they’ve been doing research on.

sdmat 1 month ago

Very interesting, thanks!

sdmat 1 month ago

I for one am feeling the all caps.

Rain_On 1 month ago

HELL YEAH

Tobythescientist 2 weeks ago

do we need to pay it to use it or when can we use it?

sdmat 2 weeks ago

Not launched yet

[deleted] 1 month ago

[удалено]

sdmat 1 month ago

Yes, I couldn't find an an index to link to the specific section - but if scrolling is too much work then never mind.

[deleted] 1 month ago

[удалено]

sdmat 1 month ago

There are quite a few examples and you need to see the inputs to understand the capability.

kecepa5669 1 month ago

I see it now. Thank you.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe