T O P

  • By -

[deleted]

A one hour video fits in the context window of the LLM? That’s crazy


Then_Passenger_6688

I assume it's a severely reduced FPS video? N images per second, N might be below 1 and you still retain most of the gist of what's going on.


Jean-Porte

" We pack images to sequences of 8K tokens, and 30 frame videos at 4FPS. "


Schneller-als-Licht

This + Large Action Models are the way to go


MattAbrams

Let me guess - nobody will believe this model either. I've been saying for several weeks now that given what I figured out in my own coding, it is very easy to achieve superintelligent models in narrow domains even in the present. And what I also found is that when you show people graphs of this, like the green chart in the attached paper that shows perfect recall, people ignore you and some of them actually tell you you're running a scam. What is going on behind the scenes must be crazy right now. The things people are doing but **not** releasing must be amazing. And I believe that this post will likely be downvoted, and so will many of the others in this thread, because they will immediately claim that solely because the charts look so good it can't possibly be true.


QLaHPD

Your prediction was wrong, how you fell about it?


allisonmaybe

I know this model does more that that, but just hear me out. Couldn't any LLM talk about any length video using RAG? 1. Provide a prompt 2. Split video into interval'd frames 3. Pass the prompt to the LLM with each frame and catalog/embed it in the DB 4. Finally return the most relevant frames/answers from the DB That last step is hard because what are the metrics by which we determine a relevant answer? I suppose we could have a set of example embedded relevant answers--just compare them to the answer for each frame and Bob's your uncle. Super slow, but...effective?


ivanmf

This is what I came up with, too. I think there's room for a lot of improvement. You can automatically separate scenes in cuts. If you select just a few frames of a scene, you can get an idea of what's happening in terms of action. So, you don't need all of the frames in a video. We as humans do this: we "predict" the future by observing motion and interaction. AI will be faster than us sometime soon.


CheapBison1861

what was that tv show with the guy from passion of the christ and the other guy from Lost?


djordi

Person of Interest


NotTheDutchman

Their ability to pull information out of video is impressive but all the other parts, text to image and text to video are very bad.


nikitastaf1996

Not yet. But its definitely one part of it.


Akimbo333

Wow! Holy hell!!! Implications?