spirosgyros.net

Creating an AI-Powered Video Summarizer in Just 20 Minutes

Written on

With the sheer volume of new content uploaded to YouTube, it can be overwhelming to keep up with everything on my watchlist. I often find myself adding more videos to my "watch later" list, which just keeps getting longer without actually viewing any of them. Fortunately, advanced language models like ChatGPT and LLaMA present a viable solution to this issue.

A video summarizer can distill hours of footage into concise summaries, allowing us to grasp the essence of a video without watching it in its entirety. After developing my own web application, I frequently use it to assess whether a particular video, especially tutorials or talk shows, is worth my time based on its summary.

There are several methods to build a video summarizer utilizing powerful language models:

  • One approach is to create or use ChatGPT Plugins that can interface directly with YouTube in real time. However, access to these plugins is limited to a select group of commercial developers, making this option less accessible for many, including myself.
  • Another method involves downloading the video transcript and using it as a prompt for the language model. The main drawback of this technique is that it cannot handle transcripts exceeding 4096 tokens, which typically translates to about seven minutes of a typical talk show.
  • A more effective strategy is to implement in-context learning techniques, where the transcript is vectorized and the summarization prompt is sent to the language model using those vectors. This method can yield precise summaries regardless of the video's length.

If you're keen on creating your own in-context learning application, my earlier article on developing a Chatbot for document interaction serves as a great foundation. By making minor adjustments, we can apply the same principles to design our own video summarizer. This article will walk you through the development process step-by-step, enabling you to create your own video summarizer.

1. Block Diagram

In this Video Summarizer project, we utilize llama-index as a foundation and develop a Streamlit web application that allows users to input a video URL and view screenshots, transcripts, and summary content. With the llamaIndex toolkit, we can sidestep the complexities of API calls to OpenAI, as its internal architecture and task management effectively address concerns related to prompt size and embedding usage.

You might wonder why I opted for multiple queries rather than a single one for transcript processing when requesting a summary from the LLM. The reasoning lies in the in-context learning process. When a document is inputted into the LLM, it is segmented into smaller pieces or nodes based on its size. These segments are transformed into embeddings and stored as vectors.

When a user query is made, the model searches the vector store for the most relevant segments and generates responses based on those. For instance, if you ask, “Summarize the article,” on a lengthy document, like a 20-minute video transcript, the model may only summarize the last five minutes, as that chunk is deemed most relevant to the context of the summary.

To further clarify this concept, consider the illustration below:

By crafting multiple queries, we can prompt the LLM to provide a more exhaustive summary encompassing the entire document. I will delve deeper into structuring these queries later in this article.

In the subsequent chapters, I will cover the foundational knowledge and typical usage of all modules involved in this project. If you're eager to start coding the Video Summarizer application right away without diving into these technical details, feel free to skip to Chapter 6.

2. YouTube Video Transcript

The initial step in summarizing a YouTube video is to obtain its transcript. The open-source Python library, youtube-transcript-api, is perfectly suited for this task.

After installing the module:

!pip install youtube-transcript-api

You can easily download a JSON-format transcript using the following code:

from youtube_transcript_api import YouTubeTranscriptApi

from youtube_transcript_api.formatters import JSONFormatter

srt = YouTubeTranscriptApi.get_transcript("{video_id}", languages=['en'])

formatter = JSONFormatter()

json_formatted = formatter.format_transcript(srt)

print(json_formatted)

The get_transcript() method requires the mandatory input of an 11-digit video ID, which can be found in the URL of any YouTube video following v=: https://www.youtube.com/watch?v=<b>hJP5GqnTrNo</b>. If the video offers subtitles in languages other than English, you can include them in the languages parameter as a list.

This library also provides formatter methods for generating transcript data in a specified format. In this case, we only require the JSON format for the subsequent steps.

Executing the above code will yield a transcript that appears as follows:

[

{"text": "So anyone who's been paying attentionnfor the last few months", "start": 4.543, "duration": 3.878},

{"text": "has been seeing headlines like this,", "start": 8.463, "duration": 2.086},

{"text": "especially in education.", "start": 10.59, "duration": 2.086},

{"text": "The thesis has been:", "start": 12.717, "duration": 1.919},

...

]

3. OpenAI API Key

LlamaIndex is designed to be compatible with various LLMs, defaulting to OpenAI's GPT model for embedding and generation tasks. Thus, we need to provide our OpenAI API Key when implementing the Video Summarizer based on OpenAI's models.

To insert our key, simply set it through an environment variable:

import os

os.environ["OPENAI_API_KEY"] = '{your_api_key}'

4. LlamaIndex

LlamaIndex is a Python library that facilitates interaction between users' private data and large language models (LLMs). It offers several features beneficial to developers, such as connecting to various data sources, managing prompt limitations, creating indices over language data, inserting prompts into data, segmenting text, and providing an interface for querying the index. With LlamaIndex, developers can leverage their existing data with LLMs without needing to convert it, effectively managing LLM interactions with data and enhancing performance.

For complete documentation, visit the official site.

Usage

Here are the general steps for utilizing LlamaIndex:

Install the package

!pip install llama-index

Step 1 — Load the document files

from llama_index import SimpleDirectoryReader

SimpleDirectoryReader = download_loader("SimpleDirectoryReader")

loader = SimpleDirectoryReader('./data', recursive=True, exclude_hidden=True)

documents = loader.load_data()

SimpleDirectoryReader is one of the file loaders in LlamaIndex's toolkit. It supports loading multiple files from a specified folder, in this case, the sub-folder './data/'. This loader function can parse various file types like .pdf, .jpg, .png, .docx, etc., so manual text conversion is unnecessary. In our application, we will load a text file (.json) containing the video transcript data.

Step 2 — Construct index

from llama_index import LLMPredictor, GPTSimpleVectorIndex, PromptHelper, ServiceContext

from langchain import ChatOpenAI

# define LLM

llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo", max_tokens=500))

service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)

index = GPTSimpleVectorIndex.from_documents(

documents, service_context=service_context

)

When invoking this method, LlamaIndex interacts with the LLM you have defined to build the index. In this demo, it employs OpenAI's gpt-3.5 chat model for embedding methods.

Step 3 — Query the index

With the index constructed, querying is straightforward and requires no contextual data.

response = index.query("Summarize the video transcript")

print(response)

5. Web Development

Similar to previous projects featured in my articles, we will use the convenient Streamlit library to build the Video Summarizer application.

Streamlit is an open-source Python library that simplifies the creation of interactive web applications. Its primary use is for data scientists and machine learning engineers to share their work easily. Developers can build applications with minimal coding, and deploying them to the web is accomplished with a single command.

Streamlit offers various widgets for creating interactive applications, including buttons, text boxes, sliders, and charts. For comprehensive widget usage, refer to the official documentation: https://docs.streamlit.io/library/api-reference

A typical Streamlit code snippet for a web application might look like this:

!pip install streamlit

import streamlit as st

st.write("""

# My First App

Hello world!

""")

To run the website online, simply type the following command:

!python -m streamlit run demo.py

If it runs successfully, the URL for users to access will be displayed:

You can now view your Streamlit app in your browser.

Network URL: http://xxx.xxx.xxx.xxx:8501

External URL: http://xxx.xxx.xxx.xxx:8501

6. Complete Video Summarizer Application

To implement the full workflow for summarizing a YouTube video, the user experience is straightforward.

Step 1 — The user inputs the URL of a YouTube video.

In this step, we create a text input widget using Streamlit's st.text_input() method to capture the user's video URL input.

Step 2 — The application downloads the screenshot and transcript file of the video and displays them in the sidebar.

Upon successfully retrieving the video ID from the URL, we create a sidebar area to display the screenshot (saved as ./youtube.png) using the html2image library, and showcase the transcript text (saved as ./data/transcript.json) via LlamaIndex's SimpleDirectoryReader(). A progress bar from Streamlit widgets indicates the remaining time, particularly for longer videos during the summarization process.

Step 3 — The application generates the summary of the entire video, with detailed descriptions for every 5 minutes of the video.

To ensure that the language model captures critical information from the entire video, we implement a loop that queries summary sections every five minutes. This prevents exceeding the maximum limit of 4096 tokens for the prompt with vectors, avoiding the need to split into chunks. It's important to note that the five-minute interval is a rough estimate. An st.expander() widget is created to contain the summaries for these five-minute sections, while an st.success() widget displays the final summary compiled from the section summaries.

The prompt I use for summarizing the five-minute segment is:

> Summarize this article from “{start_text}” to “{end_text}”, limited to 100 words, starting with “This section of the video.”

The start_text and end_text refer to the content in the text field corresponding to the start field in the transcript JSON.

Here’s the complete demo code for your reference:

!python -m pip install openai streamlit llama-index langchain youtube-transcript-api html2image

import os

os.environ["OPENAI_API_KEY"] = '{your_api_key}'

import streamlit as st

from llama_index import download_loader

from llama_index import GPTSimpleVectorIndex

from llama_index import LLMPredictor, PromptHelper, ServiceContext

from langchain import OpenAI

from langchain.chat_models import ChatOpenAI

from youtube_transcript_api import YouTubeTranscriptApi

from youtube_transcript_api.formatters import JSONFormatter

import json

import datetime

from html2image import Html2Image

doc_path = './data/'

transcript_file = './data/transcript.json'

index_file = 'index.json'

youtube_img = 'youtube.png'

youtube_link = ''

if 'video_id' not in st.session_state:

st.session_state.video_id = ''

def send_click():

st.session_state.video_id = youtube_link.split("v=")[1][:11]

index = None

st.title("Yeyu's Video Summarizer")

sidebar_placeholder = st.sidebar.container()

youtube_link = st.text_input("Youtube link:")

st.button("Summarize!", on_click=send_click)

if st.session_state.video_id != '':

progress_bar = st.progress(5, text=f"Summarizing...")

srt = YouTubeTranscriptApi.get_transcript(st.session_state.video_id, languages=['en'])

formatter = JSONFormatter()

json_formatted = formatter.format_transcript(srt)

with open(transcript_file, 'w') as f:

f.write(json_formatted)

hti = Html2Image()

hti.screenshot(url=f"https://www.youtube.com/watch?v={st.session_state.video_id}", save_as=youtube_img)

SimpleDirectoryReader = download_loader("SimpleDirectoryReader")

loader = SimpleDirectoryReader(doc_path, recursive=True, exclude_hidden=True)

documents = loader.load_data()

sidebar_placeholder.header('Current Processing Video')

sidebar_placeholder.image(youtube_img)

sidebar_placeholder.write(documents[0].get_text()[:10000] + '...')

# define LLM

llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo", max_tokens=500))

service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)

index = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context)

index.save_to_disk(index_file)

section_texts = ''

section_start_s = 0

with open(transcript_file, 'r') as f:

transcript = json.load(f)

start_text = transcript[0]["text"]

progress_steps = int(transcript[-1]["start"] / 300 + 2)

progress_period = int(100 / progress_steps)

progress_timeleft = str(datetime.timedelta(seconds=20 * progress_steps))

percent_complete = 5

progress_bar.progress(percent_complete, text=f"Summarizing...{progress_timeleft} left")

section_response = ''

for d in transcript:

if d["start"] <= (section_start_s + 300) and transcript.index(d) != len(transcript) - 1:

section_texts += ' ' + d["text"]

else:

end_text = d["text"]

prompt = f"summarize this article from "{start_text}" to "{end_text}", limited to 100 words, starting with "This section of video""

#print(prompt)

response = index.query(prompt)

start_time = str(datetime.timedelta(seconds=section_start_s))

end_time = str(datetime.timedelta(seconds=int(d['start'])))

section_start_s += 300

start_text = d["text"]

section_texts = ''

section_response += f"{start_time} - {end_time}:nr{response}nr"

percent_complete += progress_period

progress_steps -= 1

progress_timeleft = str(datetime.timedelta(seconds=20 * progress_steps))

progress_bar.progress(percent_complete, text=f"Summarizing...{progress_timeleft} left")

prompt = "Summarize this article of a video, start with "This Video", the article is: " + section_response

#print(prompt)

response = index.query(prompt)

progress_bar.progress(100, text="Completed!")

st.subheader("Summary:")

st.success(response, icon="?")

with st.expander("Section Details: "):

st.write(section_response)

st.session_state.video_id = ''

st.stop()

Save the code to a Python file named “demo.py”, create a ./data/ folder, and run the command:

!python -m streamlit run demo.py

Your Video Summarizer is now ready, capable of performing its task with both simplicity and effectiveness.

Note: It's advisable to start testing with a short video, as longer videos can incur substantial costs on OpenAI API usage. Additionally, ensure the video allows for a transcript before proceeding.

Thank you for reading.

  • If you enjoyed this article, consider joining Medium membership for unlimited access to valuable resources in AI, data science, and Python development. If you'd like to support my writing, please use my referral link to help me earn a small commission at no extra cost to you.
  • For questions or requests, connect with me on Twitter (now X) and Discord for active support on development and deployment.
  • If you're looking for exclusive support, consider joining my Ko-fi membership.

Level Up Coding

Thank you for being part of our community! Before you go:

  • ? Clap for the story and follow the author ?
  • ? View more content in the Level Up Coding publication
  • ? Free coding interview course ? View Course
  • ? Follow us: Twitter | LinkedIn | Newsletter

? Join the Level Up talent collective and find an amazing job

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

Corporate Greed and the Loss of Intelligence: A Hidden Crisis

Over 840 million IQ points lost due to corporate actions, revealing a hidden crisis in public health and intelligence.

# Exploring the Dangers and Opportunities of AI Technology

Examining Jacob Ward's insights on AI's impact on human choices and behavior, emphasizing the need for ethical oversight.

# The Role of Dopamine in Your Travel Planning Experience

Explore how dopamine influences your travel planning and experiences, enhancing anticipation and creativity during your adventures.

Voices in Podcasting: Exploring Key Thought Leaders in the Medium

An exploration of influential voices in podcasting, highlighting key figures shaping the industry today.

The Fascinating Chemistry of Firework Colors Explained

Explore the chemistry behind the vibrant colors of fireworks and how they light up the night sky during celebrations.

Goodbye Insomnia: Your Comprehensive Guide to Sound Sleep

Explore effective strategies and insights to overcome insomnia and enjoy restful nights.

You Can Embrace Your True Self During Transitions

Discover how to maintain your authentic self while navigating life changes and personal growth.

The Fascinating Origin of the Term

Explore the intriguing history behind the term