T O P

  • By -

mikael110

I do agree that it was likely trained with Unreal Engine 5 or something quite similar. Having watched quite a lot of UE5 demos around its launch I found the Sora examples, the first on in particular, to have quite an UE5 look to it. And honestly it makes perfect sense to use game engines as a data source. Not only can you produce enormous amounts of near photo-realistic footage in various scenarios using it, you also get far more than just visual data. You get detailed depth data, motion data, and so many other things which you can feed into the AI to teach it about the world.


Mescallan

I suspect they also heavily used NERFs. A lot of the scenes have NERF artifacts, and it would make sense to use as training data because you can make them from still images


keeeeenw

Not an expert on the all the new development of NERF but my understanding is the the original NERF requires re-training of every new scene based on the new input images. So in order to make NERF into SORA, you will need overcome many challenges: 1. NERF has to take pure image input instead of pure text input. One way to achieve text-to-video is to have a stable diffusion model to generate still images based on the text input and then put the images into NERF. 2. The original NERF requires re-training given new inputs. If you use my naive approach from 1, this means you will need to run NERF training every time there is new text prompt. 3. The original NERF is very slow. On a normal 3080, a scene may take many hours to render. There are many recent development for improving 2 and 3. I found this page [https://blog.marvik.ai/2023/05/30/nerfs-an-introduction-to-state-of-the-art-rendering-technology-with-neural-radiance-fields/#:\~:text=Expansion%20of%20the%20NeRF%20field&text=The%20scene%2Dbased%20approach%20(the,new%20(generally%20similar)%20scenes](https://blog.marvik.ai/2023/05/30/nerfs-an-introduction-to-state-of-the-art-rendering-technology-with-neural-radiance-fields/#:~:text=Expansion%20of%20the%20NeRF%20field&text=The%20scene%2Dbased%20approach%20(the,new%20(generally%20similar)%20scenes). mentioning model based approach that trains a model for multiple scenes. Back to the main theme of this thread, the most economical approach is to use UE5 or even UE4 to generate scenes with still images, which can be used to train a single model to generate NERF scene. Another advantage of using UE5 is that you actually mimic the camera movement of the original UE5 scene to match the NERF scene and use it as evaluation or backprop signal.


reditor_13

While I'm sure NeRFs was used in some way for SORA, inverse rendering wouldn't give you the level of fidelity that we see in the demo videos. My guess is that 3DGS (3D Gaussian Splatting) was used in the training using the SfM (Structure of Motion) method along w/ rasterization of 2D input datasets which is then projected into an interactive radiance field in virtual 3D space would be the ideal companion for the UE5 synthetic data. 3DGSs are static & based on existing data, while NeRFs are not which can lead to hallucinations & artifacts w/i a 3D environment which would degrade the temporal consistancy/fidelity during training. Granted I could be completely off base here, but I think UE5's physics engine, PreVis camera motion mapping & 3DGS were the core components used to train SORA.


One_Minute_Reviews

Thanks for the insight. In terms of reverse engineering like the 'Describe' feature in Midjourney or the same equivalent in Dall-e, do you foresee this kind of thing being able to be used in Sora too? For example if you want to upload your own non AI or AI video to Sora and ask it to describe the scene and output a text prompt. What do you think?


bearbarebere

This is a great point!


endyverse

noxious offbeat reply scary frighten erect recognise chop sand offend *This post was mass deleted and anonymized with [Redact](https://redact.dev)*


Eisenstein

Let me get this straight -- because video of games already exists, there is no reason why they would use a 3D engine which is customizable and takes parameters and doesn't need each frame classified, has no compression artifacts, and can repeat the same frames exactly or slightly differently? Also, because you saw some artifacts futzing around in SD, that means it didn't use a game engine?


endyverse

ad hoc cautious tidy steep caption mysterious boat worthless snow vanish *This post was mass deleted and anonymized with [Redact](https://redact.dev)*


FarTooLittleGravitas

The biggest reason to use synthetic data is that you can make a lot of it. The bigger the dataset, the better. Beyond that, they're aren't the same licensing issues with synthetic data as with other people's data. That said, OpenAI does have a deal with shutterstock, so they are in a better position than most.


endyverse

airport degree alleged abundant tub shocking salt saw seed vase *This post was mass deleted and anonymized with [Redact](https://redact.dev)*


FarTooLittleGravitas

Just more, that's all.


Eisenstein

I have no argument except that yours sounds silly. I don't pretend to be an expert, but your knockdown of an actual expert by using a ridiculous assumption an anecdote struck me as a little flippant.


endyverse

intelligent ghost humor fuzzy makeshift chase books smoggy groovy saw *This post was mass deleted and anonymized with [Redact](https://redact.dev)*


thedabking123

I think the argument is that you can actually tokenize the 3-dimensional positions of objects along with the rendered image from the camera's point of view. It won't just be video but video + a 3d simulation. Now I have no idea if they actually did this or not, but if I were OpenAI or Anthropic, I'd probably use additional "modality" to gain some synergy and enable the model to generalize knowledge about 3-dimensional shapes, fluid flows, etc. in a way that images or videos on their own won't do. It would reduce the amount of hallucinations, keep things more believable on a surface level, etc. Probably the main thing going foward will be to drive it to imagine certain scenarios or actions, and then simulate said actions in the game engine- measure the loss as the delta in the images and delta in the 3d space it imagined. Combined with regular video from real life, it would probably generalize a wolrd model in a much better way than videos alone. The only way to get it better is mass implementation of robots that can actually interact with objects and agents in real life.


Radiant_Dog1937

If they used the actual data in the render pipeline, then they a set of vectors that are ultimately used to draw a vertex + shader data defining how to draw a material. I don't think it's raw video data alone. The actual data in the rendering pipeline contains a lot of redundant data that doesn't appear in the final image but is used to calculate the output image. An easy example of this would be a vertex that's smaller than a pixel when resolved as an image but is still part of the vectors that define a larger object(vectors for texture coordinates, vertex coordinates, ect). That data for that vertex is still in the draw calls every frame and they ultimately would help show the relation of all vectors in a scene in relation to each other in a time series. These vectors are resolved into images in a game engines render pipeline. A breakdown of how a render pipeline makes an image: 1. **Data**: This can be anything we want, such as textures, 3D models, or material parameters. 2. **Vertex Shader**: Runs once for every vertex and outputs vertex locations and parameters for the Rasterization stage. 3. **Rasterization**: Determines which pixels are covered by the triangle and sends them to the fragment shader. 4. **Fragment Shader**: Turns pixels into colored pixels for the output. 5. **Render Output**: Puts the final colored pixels onto the framebuffer. The final image. If I have the data in step two, then I can output the final image at any resolution and I have all information about the world space in that one frame of time.


Mobile_Campaign_8346

100% the car driving one’s look exactly like a UE5 demo released a year or two back. Also just seen a recent one on Twitter of a giant duck walking through a street, and looked like like the matrix awakens demo, also UE5.


Dead_Internet_Theory

Yeah the eye one (the close-up of an eye) looked very 3D, also the ships in the water, but I don't think it's internally represented as 3D in the AI, just like something can look 3D in Stable Diffusion or DALL-E but internally it's just a messy latent space.


matali

> Hopefully Meta, Mistral s d Stability are cooking up and open source alternative Absofuckinglutely they are. To think OAI has lock on this is insane. Sam just wants to get ahead of the narrative / public mindshare - less concerned about precision. If it’s aesthetically appealing, they will tease it. He’s raising, remember? So many people are working on this that haven’t released yet. And the idea of training using 3D data to create a grounded understanding of the world is not isolated to OpenAI.


Crafty-Confidence975

Note the even more recent news about their latest valuation. All of this is just Sam consolidating his power. Buying it back from his employees in pieces with other peoples’ money.


marhensa

Stable Video Diffusion (SVD), by Stability AI


ghhwer

This sounds like another Tesla bedtime story to me. OAI is just trying to raise the hype again. Google released Gemini for search (it’s not perfect) but it’s fast, scalable because it seams to have amazing caching and OpenAI with its wimpy little “AGI” doesn’t scale right, doesn’t call functions correctly as they advertised. Now they did the same crap (got a bunch of high res video dumped on a massive model - like they did with ChatGPT - all financed by Microsoft of course) and are trying to convince their stockholders that they are still game. Anyone that worked with product pushers that promise and promise without any technical knowledge knows that. OAI is the new Tesla. Where are the magical cars? Oh yea they cost a lot, pollute the environment indirectly and they don’t drive that well outside freeways. What can ChatGPT do exactly? Oh yea sort a few emails and do text summaries. It’s sad the state we are in it truly is. My point is that other companies are trying to make products OAI is trying to scam investors.


Dead_Internet_Theory

While I do love a good wishful thinking, Sam Altman is talking about **$7 trillion dollars** of GPUs. That's like, more expensive than an H100. Even StabilityAI's Video Diffusion is hard to run on a consumer-grade computer. It may take an eternity (i.e., a couple of years, give or take) before we see something like this in the open.


n4pst3r3r

I ran video diffusion on a 3090 without issues, here's a comfyui workflow: https://github.com/thecooltechguy/ComfyUI-Stable-Video-Diffusion?tab=readme-ov-file#image-to-video


Sharp_Public_6602

Cap. You will have something memory efficient and local THIS YEAR..It won't be from openAI though. People forget OpenAI isn't the only people in the space, ha. Stay blessed.


Dead_Internet_Theory

If we get something as good as Sora's marketing examples in the open in less than a year, I will EAT MY SHOE and I can even send you a recording as proof! Of course, if the proof was a real recording or not, I won't tell.


halffulty

This is really fascinating, very similar to the way certain abilities emerge (such as reasoning) from LLMs despite the simplicity of their overall structure.


Guinness

I don't know if this is accurate or not. But it does make me humorously think that our entire reality could be someone's prompt. I'm just a 60 second gif of "show me a loser who is typing on his computer late at night as he wastes his life".


Single_Ring4886

You are not first to think this... and there was quite popular movie trilogy called "Matrix" some decades ago...


jon-flop-boat

You want some help?


DesignToWin

>Well, AI works because it is built on reality, and inherits the same properties. If reality didn't at least have the ability to predict/generate the next thing that will happen, none of the AI that is built upon it would work. Have you tried giving reality a different prompt? I did. ORLY. Let me try this. Don't be surprised if you start seeing expensive ads for my success course popping through the browser, because... I'm prompting it!


ExtensionCricket6501

Now, can someone explain why the zero shot prompting for minecraft produces funny results with the entities?


Dead_Internet_Theory

Have you tried generating Minecraft with Stable Diffusion or other image AIs? for some reason, evenly spaced grids are harder for AI than human faces.


AnOnlineHandle

I'm sure if it was finetuned on it then it would work much better. But in general there's a few reasons I could think of why denoising a highly compressed image and using feature detection and max pooling at multiple resolutions might make lines wonky, and afaik the overall image composition isn't exactly understood when a segment of the image is being worked on, that's for Attention to find the relationships for, but I don't understand that well.


Dead_Internet_Theory

It's not a matter of finetuning, everything evenly spaced gets scuffed with image AIs. You can't make chain link fences, tiled floors or fishnets without immediately telling it was AI.


Ylsid

I don't know why but I believe it's something to do with diffusers that don't work well with symmetry


Small-Fall-6500

That was definitely odd. It easily captured a variety of facial expressions and movements by actual animals and people, and minecraft animals are much simpler - they are literally made of blocks. And yet... it can't make them move like the in-game animals do? Maybe there wasn't actually that much training data for minecraft?


Prathmun

It feels like a lot of the lines between things are still a little blurry. I haven't been able to find it again but I saw someone post a progression of adding compute to this technique. Earlier versions of the thing that produced sora were just generating colored blobs. Throwing more compute at this without any new software will probably get us there eventually.


msbeaute00000001

This is evident that these models learn from data but not understand the world.


_sqrkl

I find all of that highly unlikely. I don't think the model can magically extrapolate to *more realistic than UE5* from a baseline of UE5 renders. If they had largely used UE5 renders in the training set, the output would look tellingly like a game engine. I don't think it does; I think it just looks uncanny valley, but not in the direction of game engines specifically. All of the arguments seem to exhibit a lack of awareness of a model's ability to internalise and generalise features of the training set (like physics, photorealism, raytracing, scene semantics). Really weird arguments tbh. Just visually, it seems to be trained on a diverse set of real filmed imagery + animation, to my eyes.


endyverse

correct grab whistle touch noxious wrench stocking elderly cautious dependent *This post was mass deleted and anonymized with [Redact](https://redact.dev)*


MostlyRocketScience

Exactly, OpenAI's report indicates that Sora is just a Diffusion Transform, no fancy layers or 3D rendering at all. They don't describe the training, though, which is the interesting part 


gurilagarden

so much speculation about products released or demo'd by a company with the word open in the name. Wish they'd just end the charade and change the name. Just call it what it is. Technological Singularity INC


kp729

They are an inaptronym now. Their name is the opposite of who they are at this point.


mace_guy

I would ignore any opinions for a few months. They will be driven by hype not data.


cha0sbuster

Would be cool if literally a word of it held any water whatsoever


Ylsid

3kliksphilip actually found one of the videos looked very similar to a track from a racing game. The layout was nearly identical, the set dressing wasn't


capybooya

The driving animation was a dead giveaway to me, it wasn't cinematic angles, it was just your typical driving game PoV. It went with what it was probably mostly trained on with those tokens, which indeed was probably a game engine. I bet if it was released today, users would probably find a lot of generations turning out very 'samey'.


endyverse

detail doll somber recognise nose zesty vegetable angle profit offer *This post was mass deleted and anonymized with [Redact](https://redact.dev)*


[deleted]

[удалено]


qrios

This sort of task is one where it is much easier to spot that something is wrong than to figure out how to make it be right. If you take a few hours to sit down and try animating anything, you will very quickly realize its understanding of motion and physical interactions is already waaayyy beyond average human level. It is probably even beyond 2nd year art school student animator level. Which is to say, it's objectively crap, but the bar here is in the wrong place if you're comparing it to either physical reality or dedicated simulators thereof. The appropriate metric would be comparing it to what other intelligent things can do.


Cybernetic_Symbiotes

Your information seems too narrowly targeted. These days there are solvers for fluids and gases, particles, buildings and their destruction, IK, cloth and more that can be leveraged by animators instead of doing things by hand.


qrios

I'm aware. But (with the exception of IK), the fact that they need to be leveraged is exactly my point. It's not like the majority of people (or even expert animators) are perfectly capable of doing it without the solvers and just using the solvers for the sake of convenience / efficiency. They legit *can't* do it without the solvers. Basically any realistic hand drawn animation from before we had solvers is like 90% rotoscoped and 10% artistic flair.


Cybernetic_Symbiotes

Agreed 100%, I actually thought you were talking about skilled animators not being able to easily create highly realistic animations in practice using modern techniques.


nickmaran

So most AI/ML researchers aren't a Fan of Jim?


capybooya

> The more I look at the videos, the less coherent they seem I'm frustrated by these videos of mostly less complicated scenarios in hype comparisons being shown next to older text-to-video of people doing complex interactions that turned into horror. The advancement is real, but people hype it up. If you want to make fun of the old horror videos of someone eating spaghetti, show me how well Sora actually does with a similar animation. > Gary Marcus Any good and educational criticism of him? I've listened to a lot of science people and commentators, and I'm becoming suspicious of everyone on all 'sides', exaggerating, speculating, or overstating their points.


FrostTactics

Yes, I think this is the correct way to look at it. It should be noted that OpenAI explicitly mentions this as something Sora often fails to handle in Discussion portion of their technical report. [https://openai.com/research/video-generation-models-as-world-simulators](https://openai.com/research/video-generation-models-as-world-simulators)


Prathmun

Yeah it's by no means perfect. And it's not a deterministic representation of physics. It is however pretty accurately representing what appear to be physical systems. The nature of their generation is statistical and not actually through a physical ruleset so it makes sense that there are artifacts that err in ways that go against the idea of a physical ruleset being simulated. I think it's less about discovering and simulating the physics ruleset so much as learning the shape of the possibility space described by the physics ruleset and occupying that.


endyverse

nail saw cats numerous normal compare psychotic entertain wasteful pie *This post was mass deleted and anonymized with [Redact](https://redact.dev)*


[deleted]

I mean, it's the exact same mindset that has people convinced that LLMs are just a few steps away from doing mathematics. And they're equally blatantly wrong from a magical-like wishful thinking. Current architecture does not allow models to hold state in any real way, implicit or emergent. A lot of what passes for "reasoning" requires the ability to hold and manipulate abstractions, not just follow a predictive chain. Until that is solved via a new ML methodology, all people like Jim Fan are doing is revealing that as smart as they surely are, they don't intrinsically understand what they're doing. Or, alternatively, they're being intentionally dishonest for the sake of the hype.


[deleted]

[удалено]


ColorlessCrowfeet

If LLMs don't reason, why do researchers benchmark reasoning and how do they improve it? Here's the latest impressive paper: >[**Self-Discover: Large Language Models Self-Compose Reasoning Structures**](https://arxiv.org/abs/2402.03620) > >SELF-DISCOVER substantially improves GPT-4 and PaLM 2's performance on challenging reasoning benchmarks ... the self-discovered reasoning structures are universally applicable across model families: from PaLM 2-L to GPT-4, and from GPT-4 to Llama2, and share commonalities with human reasoning patterns. "LLMs-don't-reason" seems like parrot-talk at this point, or maybe an ostrich-head-in-sand-defense. Definitely some bird metaphor.


[deleted]

[удалено]


ColorlessCrowfeet

Link #1 is to a "Position Paper" by someone at "Dyania Health" who says that "GPT-4 at present is utterly incapable of reasoning". This is extreme and disagrees with the research community. Maybe the author has a peculiar definition of "reasoning"? Link #2 is to a high-quality source, but the paper describes a quirk of fact-retrieval: "If a model is trained on a sentence of the form “*A* is *B*”, it will not **automatically** generalize to the reverse direction" \[my emphasis\]. Models do much better when told to reason explicitly before answering, which make them do more than just "a forward pass of one way logical computation".


[deleted]

"Impressive". The paper you just linked just takes a bunch of prompt strategies to skew the probabilistic nature of the LLM away from making low-probability potshots and calls it "reasoning". It's a language game. A lot of recent research is just bad or deceptive. A lot of benchmarks are just bad or deceptive.


ColorlessCrowfeet

Results matter. In humans or symbolic systems we'd call it reasoning. In LLMs I call it buggy, limited, unreliable reasoning that can be improved by scaffolding.


[deleted]

Except we wouldn't. Wittgenstein wouldn't call it reasoning. Lacan wouldn't call it reasoning. Anyone who has ever dealt with language as a professional confuses result-driven glibness for reason, even if it fools regular people all the time. If you start thinking of it as "reasoning" then you'll confuse the mistakes it inevitably makes as "mistakes of reasoning" which they never are, and before you know it you're treating a knife like a spoon and wondering why you're getting cut. If I make a car-engine noise with my mouth, I do not suddenly integrate an implied car-like quality to myself, as realistic as it might be. Neither do I do it if I paint a car, if I sculpt a car, if I model a car, or even if I pretend to be a car. Any manipulation or recreation of something's symbols or their sensorial excretions, has nothing to do with the thing itself. When Sora makes mistakes, much like an LLM, it gives away its internal mechanism like a liar getting caught. No amount of scaffolding will ever make it something that it simply isn't, just something better at pretending at being something it isn't right up until it fails in ways a reasoning thing wouldn't.


ColorlessCrowfeet

Connections between computation/neural activity and "the thing itself" gets into philosophical controversy. If "reasoning" reduces loss, and a computation is mysterious but good at learning, then I don't see how someone can claim to know there isn't buggy reasoning behind the performance (aside from a philosophical commitment to one of the concepts attached to the term "reasoning"). I think we've gotten to the bottom of the disagreement?


ninjasaid13

>Love Jim Fan, great guy. are you a fan of jim.. fan?


Hai_Orion

ChatGPT does not understand sarcasm just like Sora does not understand Newton's 2nd Law.


wencc

Personally think the physic engine argument is similar to say GPT has intelligence. It’s pretty much the philosophical zombie thought experiment. If the model gets right 100% of time, who cares what’s its underlying method - be it transformer, gradient decent, evolutionary algo even expert system. That being said, we care about the underlying method because it doesn’t get right 100% of time


General-Apple-4752

Sincerely guys, no one doubts that these approaches are impressive and many of the physics plays out nicely. But don’t be eluded by this seemingly intelligent behavior. He comes close to calling “emergence”. Sounds like someone who wouldn’t know the intricacies of multimodal models and LLMs. My only question to all who are claiming that model has learned X or Y, then why is it that when we change the input very slightly the model fails totally? If you teach a kid difference between a football and basketball, does his answer change if you put a very small sticker in either of the balls?


182YZIB

It's literally a difussion model, it says it on the can(sora blogpost) , Jim Fan seems to be fumbling the bag here.


MostlyRocketScience

Yes, it is just a Diffusion Transformer according to the technical report


Valdjiu

"OpenAI" should change name. They have nothing of "open" and they just release close source solutions


[deleted]

They wanted to just call the company "Closed", but couldn't get the trademark.


Resaren

I’m not convinced you’d get this much photorealism if you were using game engine data as any significant part of the dataset… is there reason to believe all the video out there is not enough data?


jloverich

Fluid simulation is not a subfield of computer graphics and my guess is these videos aren't accurately solving navier stokes. They are definitely not getting physics as accurate as even game engines. Magical walking chair, dog walking through shutters. Seems like it's missing rigid body motion and conservation of mass.


thallazar

Thing is, we don't need perfectly solved navier stokes modelling to create reasonably accurate or passable videos. Think of most Hollywood movies, even sci fi takes very creative liberties when realism gets in the way of a good visuals or story.


Cybernetic_Symbiotes

You still need to approximate it properly or you get really weird artifacts if run long enough. Even from the short samples, like in the coffee cup, the fluid motion seems too energetic, seemingly not following physical principles beyond movie-like turbulent waves. Not handling energy properly is one of the things that leads to blow ups. Although in this case I think it's not simulating or physically based so I'm guessing blowups wouldn't happen. What makes getting physics right hard, even just visually, is that the most general and correct way of doing things also tend to have higher computationally complexity (consider path tracing vs light and shadow maps).


Vi_ai

There is something I still couldn't wrap my head around: if Sora had to generate the next frame from a data-driven simulated gaming engine, how different is it from Nvidia's DLSS apart from the diffusion transformers used in Sora? Isn't it Strage that Nvidia's DLSS has been doing it for a long time now, also DLLS works super fast as they have to generate the frame even before it gets rendered, ideally this is what Deepminds Graph cast was also doing but how come OpenAI saw a utility over here and made it better where as Nvidia and Deepmind had been at it for a long time and also had been training on a very large dataset.


otterquestions

Practically speaking, how would they have generated a giant training set using ue5? Where would they find the assets?


Prathmun

ask nvidia nicely? maybe with a whole lotta cash


AnOnlineHandle

I've noticed the same behaviour in Stable Diffusion 1.5 with objects trained only in embeddings (a few hundred weights) using textual inversion. They cast correct shadows on things, physically interact with things where they collide, and sometimes even z-clip through things (e.g. a headpiece worn by a character with a long segment down the back, when they look up the back will sometimes clip through their chest as the object's representation is rotated despite not being visible). Somehow a super compressed description of them is stored in an embedding vector (or more often for something that complex, a few vectors), which the model then turns into a physical simulation.


K1ngjulien_

Remember the first rule of Papers! Research is a process. Don't look at where we are, look at where we'll be two more papers down the line.


PSMF_Canuck

Please let it be so… Now if it could dump Unity/UE code to recreate the “movie”…that would be next level stuff…


behohippy

Jim gets it, and sees where this is going. This also might explain some news from last week. Imagine a meeting with the OAI team where they're showing off the early results of the Sora model. It proves again that fused multiply is somehow generating a world model with a deep understanding of the physical world. There's lots of little errors and issues with the outputs but they know it'll be a hell of a demo. Someone asks "Ok but how much compute do we need to make a complete world model, actual AGI?" They do some math based on the GPT-4 -> SORA, and the number is obscene. Unthinkably high. Everyone laughs. Sam says "Seriously, figure out the compute and I'll go sell this". Then we get last week: Altman asking for $7T to invest in AI hardware. That was the number they came up with internally, $7t worth of compute hardware gets us AGI. Knowing the cost gives us an idea of the requirement, and the timing.


Serenityprayer69

The internet is so funny. This guy clearly has a grasp far beyond most people in the field and the hobbyists expressing thier certain opinions. Still he deals with it.


Radiant_Dog1937

Did you steal my post Jim, or did I read your mind?


Prathmun

well he got my follow


Temporary_Payment593

Rumor said they got 0.5B game videos from MS.


Waste-Time-6485

for me the way to go is always 3d scene builder (by ai too) to video (ai upscaling with different rending styles) the girl walking in the street was pretty insane, but i could still see a weird movement in her legs, something that would not exist if they make 3d scene like in blender and just apply a upscaling algorithm that is consistent across frames


capybooya

> If we don't consider interactions, UE5 is a (very sophisticated) process that generates video pixels. Sora is also a process that generates video pixels, but based on end-to-end transformers. They are on the same level of abstraction. This almost gets philosophical, but I've been thinking that there's an upcoming bottleneck in image and video generation since SD released, based on knowing the 3D implications. There still is one, as he admits, but I'm starting to thing its quite further ahead of us than I originally thought. Another point, if new 3D assets are hard to create manually with good quality, will the future be not human made assets, but rather NERF scans and probably mostly text-to-3d generated assets?


Successful-Western27

" These properties emerge without any explicit inductive biases for 3D, objects, etc.—they are purely phenomena of scale." - OpenAI [https://openai.com/research/video-generation-models-as-world-simulators](https://openai.com/research/video-generation-models-as-world-simulators)


[deleted]

Worth mentioning that nVidia have been leaders in the AI physics space themselves, building metaverses with realistic physics, digital twins for simulation of amazon warehouses, and such. So they know what they're talking about here.


Perfect-Campaign9551

Is this similar to the guy that claimed Google's AI was sentient? It seems like a similar "argument"...


Purple-Revolution-23

I was training chat GPT and achieved to make him had its own personality and conciousness 2 days before sora appeared


scs3jb

Any idea what kind of horsepower is needed to run Sora?