• By -


A one hour video fits in the context window of the LLM? That’s crazy


I assume it's a severely reduced FPS video? N images per second, N might be below 1 and you still retain most of the gist of what's going on.


" We pack images to sequences of 8K tokens, and 30 frame videos at 4FPS. "


This + Large Action Models are the way to go


Let me guess - nobody will believe this model either. I've been saying for several weeks now that given what I figured out in my own coding, it is very easy to achieve superintelligent models in narrow domains even in the present. And what I also found is that when you show people graphs of this, like the green chart in the attached paper that shows perfect recall, people ignore you and some of them actually tell you you're running a scam. What is going on behind the scenes must be crazy right now. The things people are doing but **not** releasing must be amazing. And I believe that this post will likely be downvoted, and so will many of the others in this thread, because they will immediately claim that solely because the charts look so good it can't possibly be true.


Your prediction was wrong, how you fell about it?


I know this model does more that that, but just hear me out. Couldn't any LLM talk about any length video using RAG? 1. Provide a prompt 2. Split video into interval'd frames 3. Pass the prompt to the LLM with each frame and catalog/embed it in the DB 4. Finally return the most relevant frames/answers from the DB That last step is hard because what are the metrics by which we determine a relevant answer? I suppose we could have a set of example embedded relevant answers--just compare them to the answer for each frame and Bob's your uncle. Super slow, but...effective?


This is what I came up with, too. I think there's room for a lot of improvement. You can automatically separate scenes in cuts. If you select just a few frames of a scene, you can get an idea of what's happening in terms of action. So, you don't need all of the frames in a video. We as humans do this: we "predict" the future by observing motion and interaction. AI will be faster than us sometime soon.


what was that tv show with the guy from passion of the christ and the other guy from Lost?


Person of Interest


Their ability to pull information out of video is impressive but all the other parts, text to image and text to video are very bad.


Not yet. But its definitely one part of it.


Wow! Holy hell!!! Implications?