Video Captioning and Translating with Python and Streamlit

Video captioning is crucial for making content accessible and understandable across different languages.

By combining transcription, translation, and the creation of subtitle files (SRT), you can offer a smooth experience for users to consume video content.

In this guide, I'll show you how to create a strong video captioning and translating tool using Python and Streamlit.

We'll go through the process step-by-step, write the code, and understand each implementation part.

The result will be a functional web app where users can upload a video and automatically get captions in multiple languages.

Introduction to Video Captioning and Translating

Adding captions to videos helps make them more accessible for people who are deaf or hard of hearing.

It also helps non-native speakers understand the content better. Translating these captions into multiple languages can make your video content reach a global audience.

We’ll use several powerful libraries:

Streamlit: For creating the web interface.
MoviePy: To handle video and audio extraction.
Faster Whisper: For speech-to-text transcription on the CPU.
Translate: To handle language translations.

You can get the complete source code at:

Get Source Code

Setting Up the Environment

Before you start coding, make sure your environment is properly set up. Here are the steps you need to follow to get everything ready.

Install the Required Libraries

You'll need to install several Python libraries. These include streamlit, moviepy, faster-whisper, and translate. You can install these using pip.

pip install streamlit moviepy faster-whisper translate

With these libraries installed, you can move on to the next steps.

Are you tired of writing the same old Python code? Want to take your programming skills to the next level? Look no further! This book is the ultimate resource for beginners and experienced Python developers alike.

Get "Python's Magic Methods - Beyond __init__ and __str__"

Magic methods are not just syntactic sugar, they're powerful tools that can significantly improve the functionality and performance of your code. With this book, you'll learn how to use these tools correctly and unlock the full potential of Python.

Building the Video Captioning and Translating Tool

In this section, we'll build the video captioning and translating tool step by step. We'll break down the code into segments for better understanding.

Importing the Libraries

First, import the necessary libraries:

import streamlit as st
import datetime
from faster_whisper import WhisperModel
from moviepy.editor import VideoFileClip
from translate import Translator

These libraries work together to enable the extraction, transcription, and translation of video content, ultimately generating captions in various languages.

Extracting Audio from Video

To process the video for captioning, we first need to extract the audio. MoviePy is an excellent tool for this.

Here's how you can do it:

# Extract audio from video with MoviePy
def extract_audio(video_path, audio_path):
    # Load the video file
    video = VideoFileClip(video_path)
    # Extract the audio from the video
    audio = video.audio
    # Save the audio to the output path
    audio.write_audiofile(audio_path)
    # Close the audio file
    audio.close()

The extract_audio function performs the following steps:

Loads a video file from the specified video_path.
Extracts the audio track from the video.
Saves the extracted audio to the specified audio_path.
Closes the audio file to release resources.

Transcribing Audio to Text

The next step involves transcribing the extracted audio to text. For this, we use the Whisper model:

# Set up the Whisper model
model_size = "medium.en"
model = WhisperModel(model_size, device="cpu", compute_type="int8")


# Transcribe an audio file
def transcribe_from_video(audio_path):
    segments, _ = model.transcribe(audio_path, )
    # Return the segments
    return segments

This code does the following:

Sets up the Whisper model with a medium-sized English model, configured to run on the CPU with int8 computation type.
Defines a function transcribe_from_video that transcribes an audio file specified by audio_path using the initialized Whisper model.
Returns the list of transcription segments from the audio file.

Function to Format Time for SRT

The SubRip Subtitle (SRT) format uses a specific timestamp format. We need a utility function to convert seconds into this format:

Video Captioning and Translating with Python and Streamlit

Introduction to Video Captioning and Translating

Setting Up the Environment

Install the Required Libraries

Building the Video Captioning and Translating Tool

Importing the Libraries

Extracting Audio from Video

Transcribing Audio to Text

Function to Format Time for SRT

About the Author

Nuno Bispo Netherlands

Level Up Your Development Skills with Premium Resources

Forget pip and Poetry - uv Is the Future of Python Development

Introduction to Video Captioning and Translating

Setting Up the Environment

Install the Required Libraries

Building the Video Captioning and Translating Tool

Importing the Libraries

Extracting Audio from Video

Transcribing Audio to Text

Function to Format Time for SRT

This article is for paid members only

Join us for more articles about Python, Django and AI

About the Author

Nuno Bispo Netherlands

Related Articles