In this article, we're diving into the fascinating intersection of Speechmatics' advanced speech recognition technology and the dynamic world of YouTube content.

Our journey is about making a Streamlit app that's going to be a big help for content creators, marketers, and analysts.


Speechmatics: Unleashing the Power of Speech Recognition

Have you ever wondered how the spoken words in a video can be transformed into actionable insights?

Enter Speechmatics, a state-of-the-art technology that's redefining the boundaries of speech recognition.

This tool doesn't just transcribe; it's an intelligent system capable of understanding diverse dialects and accents, distinguishing between speakers, and grasping the context behind words.

It's like having a super-powered assistant that listens, interprets, and analyses audio content for you!


If you prefer, you can also check out the video version of this blog post:

Check out the video version of this article

You can get the complete source code here:


Why Content Analysis on YouTube Matters

For content creators and marketers on YouTube, Speechmatics provides a tool to unravel the intricacies of their video content.

By accurately transcribing speech, regardless of dialect or accent, and offering features like summarization and speaker recognition, it unlocks a wealth of insights.

This can include the effectiveness of the dialogue, the resonance of the content with diverse audiences, and even the subtleties of how different speakers contribute to the overall narrative.

With Speechmatics, the focus shifts from traditional analytics to a more nuanced understanding of content. It aids in identifying not just how many people are watching, but what they are connecting with on a deeper level.

This understanding is invaluable for anyone looking to refine their content strategy, target their messaging more effectively, and ultimately create YouTube videos that truly resonate with their audience.


Merging Technologies: Streamlit and Speechmatics

Imagine the synergy of combining Speechmatics' advanced speech recognition with the streamlined functionality of Streamlit.

This integration forms the backbone of our project: a Streamlit-based application specifically designed for YouTube content analysis.

This application provides an intuitive interface where users can input YouTube videos for analysis through Speechmatics, delivering comprehensive insights from accurate transcriptions to auto-generated chapters.

It's a fusion of technology and creativity, offering rich, detailed content insights in a user-friendly format, perfect for tech enthusiasts, content creators, or anyone intrigued by the digital landscape's potential.


Understanding Speechmatics' Features

Speechmatics is a powerhouse in the field of speech recognition and analysis, offering a suite of features that are particularly valuable for analyzing YouTube content.

Let's explore these features and understand how they can transform the way we interact with and comprehend YouTube videos.

Transcription with "Global English" ASR

At the heart of Speechmatics' capabilities is its transcription service, powered by the "Global English" Automatic Speech Recognition (ASR) model.

This model is adept at transcribing English spoken in a wide range of accents and dialects, making it an invaluable tool for a platform as diverse as YouTube.

This feature ensures that content from around the world is accessible and comprehensible to a global audience, breaking down linguistic barriers and broadening reach.

Summarization

Beyond transcription, Speechmatics offers a summarization feature that condenses spoken content into concise summaries.

This is particularly useful for viewers who want to quickly grasp the essence of lengthy YouTube videos.

For content creators and marketers, this summarization tool provides a quick way to analyze video content, identify key themes and messages, and strategize accordingly.

Auto Chapters

Another innovative feature of Speechmatics is Auto Chapters, which automatically segments a video into chapters based on the spoken content.

This segmentation provides a clear, structured overview of the video, enhancing the viewing experience by allowing users to navigate easily to sections of interest.

Speaker Labels

In videos with several speakers, the speaker label feature from Speechmatics stands out.

It identifies and labels different speakers throughout the video, making it easier for viewers to follow the conversation and understand the dynamics between speakers.

This feature is particularly useful for content analysis, as it allows for an in-depth examination of speaker contributions, and interactions within the video.


Introduction to Streamlit

In web app development, especially for projects that use a lot of data, Streamlit has become important and helpful.

This section introduces Streamlit and explores its unique advantages in the context of building and deploying web applications rapidly.

What is Streamlit?

Streamlit is an open-source Python library designed to turn data scripts into beautiful, interactive web applications with minimal hassle.

It's specifically tailored for data scientists and engineers who want to showcase their data analysis, machine learning models, or any data-driven insights without getting bogged down in the complexities of web development.

User-Friendly and Intuitive

One of the most appealing aspects of Streamlit is its user-friendly nature.

It requires minimal coding to create impressive and interactive applications. Streamlit supports various Python libraries and frameworks, making it versatile for a wide range of data visualization and analysis tasks.

The simplicity of turning a Python script into a web app by adding a few Streamlit commands is a huge draw for professionals who may not have extensive web development experience.

Advantages for Rapid Prototyping and Deployment

Streamlit shines when it comes to rapid prototyping and deployment of applications.

It allows developers and data scientists to quickly create functional prototypes of their ideas and share them for feedback. This rapid iteration cycle is invaluable in today's fast-paced development environments where agility and speed are key.

The library also simplifies the deployment process, removing many of the traditional barriers to taking a web application from concept to live product.

With features like easy sharing of apps and the ability to integrate with major data processing and visualization tools, Streamlit enables professionals to focus on the data and its story, rather than the intricacies of web development.


Setting Up the Project

To embark on creating a Streamlit application that leverages Speechmatics for YouTube content analysis, you first need to set up your project environment.

This involves preparing your Python environment and configuring necessary APIs. Let's walk through these steps.

Environment Setup

Python Environment and Library Installation:

  • Python Installation: Ensure you have Python installed on your system. If not, download and install it from the official Python website. The recommended version is 3.11.
  • Virtual Environment: Create a virtual environment for your project. This keeps your project dependencies separate from other Python projects. In your command line, navigate to your project folder and run:
python -m venv venv

Activate the virtual environment:

    • On Windows: venv\\Scripts\\activate
    • On MacOS/Linux: source venv/bin/activate
  • Install Streamlit and Other Libraries: With your virtual environment activated, install Streamlit and other necessary libraries like the Speechmatics SDK and the additional packages that are needed.

You can run:

pip install streamlit
pip install speechmatics-python
pip install python-decouple
pip install yt-dlp
pip install langchain
pip install openai
pip install chromadb
pip install tiktoken

API Configuration

Configuring Speechmatics API:

  • Sign Up for Speechmatics: If you haven't already, create an account with Speechmatics to get your API credentials. Visit their official website and sign up.
  • Retrieve API Key: Once you have an account, locate your API key in the Speechmatics dashboard. This key is essential for authenticating your requests.

Configuring OpenAI API:

  • OpenAI Account: Similar to Speechmatics, you need an account with OpenAI. Go to OpenAI's website and sign up.
  • API Key Access: After setting up your OpenAI account, navigate to the API section to create a new API key.

Handling API Keys and Authentication:

  • Secure Storage: Store your API keys securely. Avoid hardcoding them directly into your code. You can use environment variables or a configuration file that is not included in your version control.
  • Environment Variables: Set up .env file for the environment variables for your API keys.
SPEECHMATICS_API_KEY=<YOUR SPEECHMATICS API KEY>
OPENAI_API_KEY=<YOUR OPENAI API KEY>

Developing the Streamlit Application

Starting with the UI

To create a Streamlit interface for inputting YouTube URLs, you can use the st.text_input function, which allows users to input text.

Here's a simple code snippet that demonstrates how to set up this feature in your Streamlit app, you can create a main.py file:

import streamlit as st

# Set page title
st.set_page_config(page_title="YouTube Content Analysis using Speechmatics", layout='wide')

# Title for your app
st.title('YouTube Content Analysis using Speechmatics')

# Input field for YouTube URL
youtube_url = st.text_input('Enter YouTube URL', placeholder='Paste YouTube URL here...')

# Check if the URL is entered and display it
if youtube_url:
    pass

Here's a short breakdown of the code:

  • A browser title is set with st.set_page_config
  • A page title is set with st.title
  • An input file is created with st.text_input to collect the YouTube URL

Integrating YouTube

Now you can integrate with YouTube to download the audio file from the video since Speechmatics expects an audio file. This can be achieved with the library yt-dlp.

Let's add that code to our existing main.py:

... previous imports ...
import yt_dlp  # new import

... existing code ...
if youtube_url:
    with st.status('Processing YouTube URL...', expanded=True):
        st.write('Downloading YouTube video...')
        # Get the video ID
        video_id = yt_dlp.YoutubeDL().extract_info(url=youtube_url, download=False)['id']
        filename = f'{video_id}.mp3'
        # Set the options
        ydl_opts = {
            'format': 'bestaudio/best',
            'outtmpl': filename
        }
        # Download the video
        with yt_dlp.YoutubeDL(ydl_opts) as ydl:
            ydl.download([youtube_url])

Here's a short breakdown of the code:

  • A progress message is set with st.status and st.write
  • The options for yt-dlp are set with a dictionary, with the format and filename
  • The audio file is downloaded with ydl.download

Implementing Speechmatics Features

With the UI defined and the YouTube logic to download the audio file from the video, you can now write the code to integrate with the Speechmatics API.

For this, will you create a new file called speechmatics_client.py, to keep the logic organized and not overload the main file:

from decouple import config
from speechmatics.models import ConnectionSettings
from speechmatics.batch_client import BatchClient
from httpx import HTTPStatusError
import streamlit as st

# API key and language
AUTH_TOKEN = config("SPEECHMATICS_API_KEY")
LANGUAGE = "en"

# API Settings
settings = ConnectionSettings(
    url="<https://asr.api.speechmatics.com/v2>",
    auth_token=AUTH_TOKEN,
)

# Define transcription parameters
conf = {
    "type": "transcription",
    "transcription_config": {
        "language": LANGUAGE,
        "diarization": "speaker",
        "operating_point": "enhanced"
    },
    "summarization_config": {
        "content_type": "informative",
        "summary_length": "detailed",
        "summary_type": "paragraphs"
      },
    "auto_chapters_config": {}
}

First, you started with the imports and then the necessary API configurations:

  • Defining the language and API key
  • Setting the API endpoint and authentication
  • Defining the transcription parameters, which include the diarization, summarization and auto_chapters and the Enhanced operation point (provides more accuracy)

Then you can proceed to define the main function that calls the API:

# Transcribe the audio file
@st.cache_data(show_spinner=False)
def transcribe(file_path):
    # Open the client using a context manager
    with BatchClient(settings) as client:
        try:
            # Submit the job
            job_id = client.submit_job(
                audio=file_path,
                transcription_config=conf,
            )
            print(f"job {job_id} submitted successfully, waiting for transcript")
            # Wait for the job to complete
            transcription = client.wait_for_completion(job_id, transcription_format="json")
            summary = transcription["summary"]["content"]
            chapters = transcription["chapters"]
            transcript = transcription["results"]
            # Return the transcript, summary and chapters
            return transcript, summary, chapters
        except HTTPStatusError:
            print("Invalid API key")

Let's take a closer look at the code:

  • A BatchClient is created using the previously defined settings
  • A job is submitted to the API with client.submit_job
  • Then the code waits for the job to complete, with client.wait_for_completion. This means that this code will wait for the API to process the file, which can take between a couple of seconds to possibly minutes. Speechmatics also allows for notifications.
  • When the job is finished, the summary, chapters and transcript are extracted from the response
  • The @st.cache_data decorator enables the response from the API call to be cached in Streamlit, which prevents repeated API calls when asking questions with the ChatBot later on.

Finally, a couple of helper functions that you can add to the file so to use them to format the API results when displaying in UI later on:

# Format the chapters
def format_chapters(chapters):
    chapter_list = []
    for chapter in chapters:
        start_time = chapter['start_time']
        end_time = chapter['end_time']
        title = chapter['title']
        chapter_summary = chapter['summary']
        start_time_minutes_seconds = f"{start_time // 60:02.0f}:{start_time % 60:02.0f}"
        end_time_minutes_seconds = f"{end_time // 60:02.0f}:{end_time % 60:02.0f}"
        chapter_list.append({
            "start": start_time_minutes_seconds,
            "end": end_time_minutes_seconds,
            "title": title,
            "summary": chapter_summary
        })
    return chapter_list

# Format the transcript
def format_transcript(transcript):
    transcript_list = []
    current_speaker = None
    content = ''
    for alternatives in transcript:
        alternative = alternatives['alternatives'][0]
        # Get the speaker
        speaker = alternative['speaker']
        if current_speaker is None:
            current_speaker = speaker
        # Get the text and type and end of sentence
        text = alternative['content']
        type_text = alternatives['type']
        end_of_sentence = False
        if 'is_eos' in alternatives:
            end_of_sentence = alternatives['is_eos']
        # If the speaker is the same, append the text
        if current_speaker == speaker:
            if type_text != 'punctuation':
                content += ' '
            content += text
            if end_of_sentence:
                # If the sentence ends, append the text to the transcript list
                content += '\\n\\n'
        else:
            # If the speaker is different, append the text to the transcript list
            transcript_list.append({
                "speaker": current_speaker.replace('S', 'SPEAKER '),
                "text": content
            })
            # Reset the content and speaker
            content = text
            current_speaker = None
    # Append the last speaker
    transcript_list.append({
        "speaker": current_speaker.replace('S', 'SPEAKER '),
        "text": content
    })
    return transcript_list

Let's see in more detail both functions, starting with format_chapters:

  • It extracts the chapter data from the chapter list returned from the API
  • Then format the time from seconds to minutes and seconds
  • Finally, it appends a formatted dictionary to a new chapter list

The format_transcript function is designed to:

  • Iterate over the list of transcribed words returned from the API
  • It keeps track of the current speaker
  • It accumulates the text and keeps track of the end of a sentence
  • When there is a change of speaker, the text content and speaker indication are placed in a dictionary and appended to a new transcription list
  • For the last sentence of a speaker (or when only one speaker exists), the last content text is appended

Asking questions with LangChain and OpenAI

Before we display the results of the Speechmatics API in the UI, you can first prepare the necessary logic to the able to ask questions to a ChatBot regarding the video content.

This ChatBot is powered by OpenAI's GPT models and uses LangChain as the framework for interfacing with OpenAI and providing the necessary Retrieval-augmented generation (RAG) functionalities.

LangChain is a versatile language model library that offers tools for integrating language models with external knowledge sources and databases, enabling the creation of sophisticated language-based applications. This library is particularly useful for developers and researchers working on advanced natural language processing and AI projects.

You will now create a new file called question.py to keep the LangChain logic separated and organized from the other files:

from decouple import config
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
from langchain_community.chat_models import ChatOpenAI
from langchain_community.document_loaders import TextLoader, MergedDataLoader
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores.chroma import Chroma

OPENAI_API_KEY = config("OPENAI_API_KEY")

With this, you have set up the necessary imports.

Now you can create the function to load the transcript, summary, and chapters into LangChain documents:

def _load_documents(transcript, summary, chapters):
    # Temporary save text to file
    with open("transcript.txt", "w") as f:
        f.write(transcript)
    with open("summary.txt", "w") as f:
        f.write(summary)
    with open("chapters.txt", "w") as f:
        f.write(chapters)
    # Create the loaders
    loader_transcript = TextLoader("transcript.txt")
    loader_summary = TextLoader("summary.txt")
    loader_chapters = TextLoader("chapters.txt")
    # Merge the loaders
    loader_all = MergedDataLoader(loaders=[loader_transcript, loader_summary, loader_chapters])
    # Load the documents
    documents = loader_all.load()
    return documents

Let's describe this code:

  • The several strings are saved to files for easy loading
  • Several TextLoader's are created and merged with MergedDataLoader
  • Finally, the documents are loaded with loader_all.load()

Now you will create the function that splits these documents:

def _split_documents(documents):
    splitter = CharacterTextSplitter(chunk_size=4000, chunk_overlap=0)
    texts = splitter.split_documents(documents)
    return texts

Splitting the documents is necessary to prepare the chunks to be processed by the embedding. The code creates a CharacterTextSplitter and splits the documents into text chunks with splitter.split_documents

Next, you define the retriever:

def _create_retriever(texts):
    db = Chroma.from_documents(texts, OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY))
    retriever = db.as_retriever()
    return retriever

The retriever allows for searching in the documents to prepare the necessary context for the OpenAI GPT model when asking a question. The code simply creates a Chroma database from_documents and a matching retriever with db.as_retriever()

Next, you will create the function that prepares all the data:

def _process_data(transcript, summary, chapters):
    documents = _load_documents(transcript, summary, chapters)
    texts = _split_documents(documents)
    retriever = _create_retriever(texts)
    return retriever

Finally, you create the two main functions that start the chat and handle answering the questions:

def start_chat(transcript, summary, chapters):
    print("Starting chat...")
    retriever = _process_data(transcript, summary, chapters)
    memory = ConversationBufferMemory(memory_key="chat_history", input_key='question', output_key='answer',
                                      return_messages=True)
    model = ChatOpenAI(model="gpt-3.5-turbo-0613", openai_api_key=OPENAI_API_KEY, temperature=0)
    qa_chain = ConversationalRetrievalChain.from_llm(model, retriever=retriever,
                                                     return_source_documents=True, memory=memory)
    return qa_chain

def ask_question(question, qa_chain):
    result = qa_chain(question)
    return result

Let's break the code:

  • On function start_chat, the retriever is set from the call to the _process_data function
  • Then the chat memory and AI model are set, it will use model gpt-3.5-turbo-0613, which provides a good cost/benefit ratio
  • On function ask_question, the chat chain is created from ConversationalRetrievalChain, which enables the enriched chat functionality with access to the transcript, summary, and chapter context

Displaying Results

Now you are ready to display the results of the Speechmatics API and also of the RAG ChatBot on the UI.

For that, you continue with adding code to the main.py file:

# Check if the URL is entered and display it
if youtube_url:
        ... previous code...
        # Call Speechmatics API
        st.write('Calling Speechmatics API...')
        transcript, summary, chapters = transcribe(filename)

    # Display the results
    st.divider()
    st.subheader('Transcript')
    transcript_list = format_transcript(transcript)
    for transcript in transcript_list:
        st.markdown(f"**{transcript['speaker']}:**")
        st.markdown(f"{transcript['text']}")
    st.divider()
    st.subheader('Summary')
    st.write(summary)
    st.divider()
    st.subheader('Chapters')
    chapter_list = format_chapters(chapters)
    for chapter in chapter_list:
        st.write(f"{chapter['start']} - {chapter['end']}: {chapter['title']}")
        st.caption(chapter['summary'])

Let's break down the new code:

  • Inside the previously existing code (the one that downloads from YouTube), now after the download it calls the function transcribe that we defined previously to send and process the file with the Speechmatics API
  • The returned values and then formatted and displayed in the UI interface with st.markdown and st.write separated by dividers (st.divider) to keep the content organized

Finally, you can now write the code to interact with the ChatBot in the main.py file:

    # Prepare chat
    st.divider()
    st.subheader('Chat Bot')
    with st.spinner('Preparing chat...'):
        # Prepare the chat context
        transcript_text = ''
        for transcript in transcript_list:
            transcript_text += f"{transcript['speaker']} - {transcript['text']}\\n"
        summary_text = summary
        chapters_text = ''
        for chapter in chapter_list:
            chapters_text += f"{chapter['start']} - {chapter['end']}: {chapter['title']}\\n"
        # Start the chat
        qa_chain = start_chat(transcript_text, summary_text, chapters_text)
    # Ask a question
    question = st.text_input('Enter question', placeholder='Ask a question...')
    if question:
        with st.spinner('Thinking...'):
            # Process the question
            result = qa_chain(question)
        # Display the answer
        st.caption('Answer:')
        st.markdown(result['answer'])

Let's understand how this code works:

  • First, the transcript and chapter lists are converted to a text string (the summary is already a string)
  • These are passed as inputs to the function start_chat that processes the RAG and prepares the ChatBot
  • Then an input (st.text_input) is prepared for the question that is passed into function qa_chain that calls OpenAI GPT model and provides an answer

Comparisons

Of course, YouTube does allow for some of these features and also Google Gemini can perform question-answering on YouTube videos.

Let's compare how your application compares to YouTube itself in terms of the generated captions and also how it compares to Google Gemini in terms of question-answering.

As an example, consider this YouTube video: Torvalds Speaks: Rust's Impact on the Linux Kernel.

Comparison with YouTube captions

We start the comparison with the generated transcript by your recently created application. This is a sample of that transcript from the Speechmatics API:

Speechmatics Transcript

You can get the full Speechmatics transcription text here:

This is the sample of the YouTube-generated transcript:

YouTube Transcript

You can get the full YouTube transcription text here:

Comparing the two transcripts, several deviations in accuracy can be identified in the YouTube transcript. Here are some key points:

  • Misspellings and Incorrect Words:
    • The YouTube transcript incorrectly spells "kernel" as "konel" and "colonel" in several places.
    • It also spells "Rust" as "rust," which should be capitalized as it's a proper noun (name of a programming language).
  • Misinterpretation of Phrases:
    • The phrase β€œI have no idea how you would translate this into Japanese” is correctly captured in the Speechmatics transcript, while the YouTube transcript breaks it into smaller, disconnected segments.
    • In the YouTube transcript, the phrase β€œRust still is at the point where we don't” is incorrectly broken, changing the meaning.
  • Timestamps vs. Speaker Labels:
    • The Speechmatics transcript uses speaker labels for clarity, while the YouTube transcript uses timestamps. Timestamps are useful for locating parts of the video, but they don't provide clarity on who is speaking.

Comparison with Google Gemini generated chapters

Let's check the chapter creation of Google Gemini and your application.

This is your application's chapters:

Speechmatics Chapters

And asking Google Gemini to generate chapters, this is the result:

Gemini Chapters

Speechmatics chapter functionality provides more meaningful chapters, with better descriptions and also includes chapter summaries.

Comparison to Google Gemini for question-answering

Now let's compare the ChatBot capabilities with Google Gemini.

For example, asking "What is this video about?".

Let's see your ChatBot reply:

Our ChatBot answer to questions

And this is Google Gemini reply:

Gemini answers to questions

As you can see, the answers are very similar, but our ChatBot gives some additional information.


Conclusion

This project showcases the fusion of Speechmatics' advanced speech recognition technology with Streamlit's user-friendly interface, providing a significant advancement in digital content analysis.

With its "Global English" ASR, Speechmatics adeptly handles various accents and dialects, turning spoken words into valuable insights. When integrated with Streamlit, this technology becomes even more powerful, offering an accessible and interactive platform.

This application is not just about transcribing YouTube content; it's about summarizing it, organizing it into chapters, and identifying different speakers, thereby enriching understanding and engagement.

Yet, this is merely the starting point. The true potential of this project lies in its adaptability and potential for growth. It serves as a foundational platform, inviting users to explore, modify, and enhance it to meet their specific needs and creative visions. The project allows for both experienced developers and curious novices alike to engage with it, offering a playground of technological innovation.


References

To further your understanding and exploration of this project, the following resources are invaluable:

Speechmatics Official Documentation: For a comprehensive guide to Speechmatics' API and its features, visit their official documentation. It offers detailed information on setup, usage, and advanced features.

Speechmatics Developer Portal: The Speechmatics Developer Portal is a great resource for developers, providing API references, development guides, and best practices.

Speechmatics Blog: The Speechmatics Blog offers insights, case studies, and the latest updates on speech recognition technology.