In the digital age, where content is king, the ability to quickly digest and understand vast amounts of video and audio data is invaluable.

This is where AssemblyAI, a leader in AI-driven speech recognition and processing, plays a pivotal role. AssemblyAI offers cutting-edge solutions for video and audio summarization, making it easier for businesses, content creators, and educators to extract meaningful insights from their media files.

This article explores the innovative ways AssemblyAI approaches to video and audio summarization and the different types of summaries it can generate.

This is an example of the final application:


You can get the complete source code here.

The Challenge of Video/Audio Summarization

Video and audio content is rich in information but often requires significant time to consume and analyze.

Summarization technologies aim to condense this content into shorter, manageable formats while retaining the core message and valuable insights.

This process involves complex natural language processing (NLP) and machine learning algorithms to understand context, identify key points, and generate coherent summaries.

How AssemblyAI Tackles Summarization

AssemblyAI employs advanced AI models that have been trained on a vast corpus of data, enabling it to accurately transcribe, understand, and summarize spoken content from videos and audio.

The platform supports various summarization tasks, from extracting highlights to creating comprehensive summaries. Here’s how AssemblyAI stands out:

Automatic Transcription: First, AssemblyAI converts speech to text, providing a highly accurate transcription of the video or audio file. This transcription serves as the basis for all subsequent summarization processes.

Natural Language Understanding: Through sophisticated NLP, AssemblyAI analyzes the transcribed text to discern the structure, themes, and important points, differentiating between crucial information and filler content.

Summarization Algorithms: Leveraging state-of-the-art machine learning models, the platform generates summaries that are concise and coherent, maintaining the essence of the original content.

Types of Summaries with AssemblyAI

AssemblyAI offers a versatile range of summarization options (model and type) tailored to various needs and preferences, ensuring that users can extract the most pertinent information from their audio and video content in a format that suits them best.

The summary model determines the style and tone of the summary, with options:

  • Informative: Default model, ideal for single-speaker files.
  • Conversational: Suited for two-person conversations.
  • Catchy: Best for creating engaging titles.

Let's explore the summarization types in more detail:

Bullets (Default)

The default summarization option provided by AssemblyAI is the bullet summary.

This format succinctly lists the most critical points from the transcription text, making it ideal for users who need a quick overview without delving into the details.

The bullet summary is perfect for highlighting key insights, actions, or findings from lengthy audio or video content, allowing for easy comprehension and quick reference.

Bullets (Verbose)

For those who seek a more comprehensive understanding without reading the entire transcription, this option offers a longer list of bullet points that cover the full scope of the transcription text.

This extended bullet summary provides a more detailed overview, capturing a broader range of points and nuances from the content, making it suitable for thorough analysis or when more context is necessary to understand the key takeaways.


The headline summarization condenses the essence of the audio or video content into a single, impactful sentence.

This type of summary is designed to capture the overarching theme or most significant point of the content, offering a snapshot that can be instantly grasped.

It's particularly useful for users who need to quickly discern the main message or for content that is being cataloged or indexed for easy browsing.


Taking conciseness to the next level, the gist summarization provides a few words that capture the core message of the entire transcription text.

This ultra-condensed summary form is excellent for generating tags, titles, or quick labels that encapsulate the primary focus or topic of the content.

It's an ideal choice for tagging or sorting content, enabling users to categorize and search through large volumes of audio or video files efficiently.


For those who prefer a more narrative form, the paragraph summarization option combines the convenience of a concise summary with the flow of continuous prose.

This single paragraph captures the critical information and key points from the transcription, woven into a coherent narrative.

This format is well-suited for readers who appreciate a brief yet comprehensive overview that reads smoothly and provides a clear, consolidated insight into the content's main arguments or stories.

Building a Summarization Application with AssemblyAI and Streamlit

Streamlit is an open-source Python library that simplifies the process of creating and sharing beautiful, custom web apps for machine learning and data science.

Designed for speed and ease of use, it allows developers and data scientists to turn data scripts into shareable web apps in minutes, without requiring extensive web development experience.

Streamlit's intuitive API and workflow make it a popular choice for building interactive and visually appealing data-driven applications.

AssemblyAI API Key

Before we start creating the application, you will need to create an account with AssemblyAI and create an API key.

You can create an account here. Keep in mind that to use summarization you will need to add payment details and top-up credits to your account, $8 is the minimum.

With your API key ready, you start now by creating the .env file to securely store it:


Creating the Streamlit Application

First, you will need to start installing the necessary packages:

pip install streamlit assemblyai python-decouple

Now you can create the, which is going to be the main application file with Streamlit:

import streamlit as st
import assemblyai as aai
from decouple import config

# Set page title and icon and sidebar settings
st.set_page_config(page_title="Video and Audio Summarization with AssemblyAI", page_icon="πŸ“Ή",
                   layout="centered", initial_sidebar_state="expanded")
# Set page title
st.title('Video and Audio Summarization with AssemblyAI')
# Set page description
st.write('This app demonstrates how to use AssemblyAI to summarize video and audio files.')
# Set page instructions (upload video or audio file)
st.write('To get started, upload a video or audio file.')

# File upload
uploaded_file = st.file_uploader('Upload Video or Audio File', type=['mp4', 'wav', 'mp3'])

# Button to summarize video or audio file
if st.button('Summarize') and uploaded_file:
    # Write uploaded file to disk
    with open(, 'wb') as f:

This code snippet outlines the setup for a Streamlit web app designed to use AssemblyAI for summarizing video and audio files.

It sets the page configuration (including title, icon, layout, and sidebar state), displays a title and description for the app, and provides instructions for users to upload video or audio files.

The app supports file uploads in MP4, WAV, and MP3 formats.

Additionally, it includes a button that, when clicked, checks if a file is uploaded and writes the uploaded file to disk, preparing it for further processing, which you will implement next.

Integrate AssemblyAI

To integrate AssemblyAI and perform the summarization, you will need to add the following code to the

Tagged in: