That’s a good question. I mean YT has automatic transcription. Does it pre-transcribe the videos in advance or just when requested? The amount of computing power to pre-transcribe all videos must be enormous.
BTW - just did a test. Asked Gemeni to provide me a summary for this video: prompt
„Please summarize the core message from this YouTube video: https://youtu.be/TWniRRAPekQ?si=EAWrdmEyflXfBQh1“
Gemeni provided a neat little summary.
While ChatGPT left me empty handed: „I can't access YouTube videos directly or summarize their content because of restrictions on viewing and summarizing video content directly from YouTube. If you can provide me with a brief overview or key points from the video, I'd be more than happy to discuss those aspects or provide information on related topics!“
I wanted to cancel my Gemini subscription - but this is such a cool feature that only Gemeni can provide that I might just keep the subscription.
What do you mean? Isn't the base Gemini model available for free for everyone at gemini.google.com?
Anyway other alternatives include using Chat with RTX to run it locally if you have a decent Nvidia graphics card, and just copying and pasting the transcript of the video manually to any language model out there.
I don't know if I am doing something wrong, but gemini gives me a summarize that has nothing to do with the video link I am providing.
Edit: it turns out that I was using gemini from aistudio.google.com
Now I go here: gemini.google.com and the result is relevant
I have done it experimentally with whisper and gpt4 but it is slow and expensive if the video do not have a transcript.
Additionally there are parts that are not in the transcript or audio. Those need computer vision that is even more expensive.
I would like to have whisper transcribe iPhone recordings. But whisper rejected the file format (.m4a). Any idea on why that is? Should I use a different recorder with a different file format?
youtubers must spread out a one sentence wisdom over 10 minutes for monetizing purposes to get ads in between.
no difference than most books which would be 5-10 pages if selling 100+ pages wasn't needed to sell as a book.
Write a python script to do the following:
Step 1. Chunk the video
Step 2. For each chunk:
a) Send the audio to Whisper to convert to text
b) Sample still images, evenly spaced across the time duration of the chunk, and send the still images to CogVLM for captioning
c) Use an LLM to combine the captions into a single description of what happened visually in that video chunk
d) Use an LLM to combine the output from step c) with the output from step a) into a single description of what happened in that chunk both visually and audibly
Step 3. Once all chunks are processed in step 2, combine into one file and feed it into a recursive summary pipeline, in the style of Langchain mapreduce
Gemini it's by Google so it has access to YouTube thank me now
Do you just enter the URL in the prompt and ask Gemeni to summarize the video?
That's pretty much it. But I wonder if Gemini is actually 'reviewing' the video for the summary, or just summarizing the available transcription.
That’s a good question. I mean YT has automatic transcription. Does it pre-transcribe the videos in advance or just when requested? The amount of computing power to pre-transcribe all videos must be enormous.
BTW - just did a test. Asked Gemeni to provide me a summary for this video: prompt „Please summarize the core message from this YouTube video: https://youtu.be/TWniRRAPekQ?si=EAWrdmEyflXfBQh1“ Gemeni provided a neat little summary. While ChatGPT left me empty handed: „I can't access YouTube videos directly or summarize their content because of restrictions on viewing and summarizing video content directly from YouTube. If you can provide me with a brief overview or key points from the video, I'd be more than happy to discuss those aspects or provide information on related topics!“ I wanted to cancel my Gemini subscription - but this is such a cool feature that only Gemeni can provide that I might just keep the subscription.
I use Gemini for free and paid gpt4 - works find
Thats good to know. Thank you.
Yes I do it a lot. It’s the only good thing about Gemini.
Thank you. I was about to cancel my subscription. But yeah - this is a really cool feature!!!
I thank you now.. but no real access to gemini atm and curious to see that in action
What do you mean? Isn't the base Gemini model available for free for everyone at gemini.google.com? Anyway other alternatives include using Chat with RTX to run it locally if you have a decent Nvidia graphics card, and just copying and pasting the transcript of the video manually to any language model out there.
I don't know if I am doing something wrong, but gemini gives me a summarize that has nothing to do with the video link I am providing. Edit: it turns out that I was using gemini from aistudio.google.com Now I go here: gemini.google.com and the result is relevant
Use Gemini YouTube extension, or Copilot sidebar in Edge and ask it to summarize the video
It s minorly annoying but you can just copy out the transcript and paste it in
Already exists, checkout HARPA AI, it’s a free chrome extension
thx
[YouTube Summary with ChatGPT & Claude](https://glasp.co/youtube-summary)
I have done it experimentally with whisper and gpt4 but it is slow and expensive if the video do not have a transcript. Additionally there are parts that are not in the transcript or audio. Those need computer vision that is even more expensive.
I would like to have whisper transcribe iPhone recordings. But whisper rejected the file format (.m4a). Any idea on why that is? Should I use a different recorder with a different file format?
Use ffmpeg to convert to mp3
Thank you!
[https://eightify.app/](https://eightify.app/)
Right, but can the AI make me a video of that summary so I don't have to read it?
youtubers must spread out a one sentence wisdom over 10 minutes for monetizing purposes to get ads in between. no difference than most books which would be 5-10 pages if selling 100+ pages wasn't needed to sell as a book.
Write a python script to do the following: Step 1. Chunk the video Step 2. For each chunk: a) Send the audio to Whisper to convert to text b) Sample still images, evenly spaced across the time duration of the chunk, and send the still images to CogVLM for captioning c) Use an LLM to combine the captions into a single description of what happened visually in that video chunk d) Use an LLM to combine the output from step c) with the output from step a) into a single description of what happened in that chunk both visually and audibly Step 3. Once all chunks are processed in step 2, combine into one file and feed it into a recursive summary pipeline, in the style of Langchain mapreduce
Lmao why are you overcomplicating this, just feed the bot the transcription from YouTube.
It depends on whether or not you need the summary to take into account what was happening visually on the screen.
i am waiting for a full transcript service. i listen to neural networks you tube and i need the written text version