Creating an API with FastAPI to Transcribe, Summarize, and Tag Audio Files (Using FasterWhisper and MistralAI on the CPU)

In this article, we will build an API using FastAPI to manage the conversion of audio files into text (transcription), condense the text into a shorter version (summarization), and label the content (tagging).

We'll use FasterWhisper for transcription, and MistralAI for summarization and tagging, all running locally on the CPU.

This system takes advantage of the advanced capabilities of modern machine learning models to process audio data quickly and productively.

If you prefer, you can follow along with the video version:

Get the Source Code

Understanding the Components

FastAPI is a modern, high-performance web framework used for building APIs with Python 3.7 and above. It's based on standard Python type hints and is designed to be user-friendly and easy to learn. FastAPI provides high performance and automatically generates interactive API documentation using Swagger and ReDoc. It enables developers to quickly build robust, production-ready APIs with minimal code, taking advantage of asynchronous capabilities and automatic data validation provided by Pydantic.

FasterWhisper is an improved version of OpenAI's Whisper model, created for fast and accurate automatic speech recognition (ASR). Whisper is a cutting-edge model that transcribes audio into text, capable of handling multiple languages, various accents, and different speaking styles. FasterWhisper enhances the original model's performance, making it suitable for real-time transcription applications where speed and accuracy are essential.

MistralAI Instruct, built on the Mistral-7B model, is a powerful language model that generates detailed and informative responses to user queries. It can be utilized for various natural language processing tasks, such as summarizing, tagging, and generating descriptions. MistralAI Instruct is skilled at understanding and processing complex texts, making it an ideal choice for applications that require nuanced and context-aware text generation. This model is particularly helpful for tasks that involve condensing large amounts of text into concise and meaningful summaries and tags. We will use the CPU version from TheBloke.

Are you tired of writing the same old Python code? Want to take your programming skills to the next level? Look no further! This book is the ultimate resource for beginners and experienced Python developers alike.

Get "Python's Magic Methods - Beyond __init__ and __str__"

Magic methods are not just syntactic sugar, they're powerful tools that can significantly improve the functionality and performance of your code. With this book, you'll learn how to use these tools correctly and unlock the full potential of Python.

Setting Up the FastAPI Application

Let's now set up our FastAPI application, which is composed of the main components: main.py, mistral_wrapper.py, and transcribe_wrapper.py.

Transcribing Audio with FasterWhisper

Create a transcribe_wrapper.py file to handle the transcription of audio files using FasterWhisper:

from faster_whisper import WhisperModel


# Transcribe the audio
def transcribe_audio_from_file(audio_file):
    output_text = ""
    try:
        # Load the model
        model_size = "small"
        model = WhisperModel(model_size, device="cpu", compute_type="int8")
        # Transcribe the audio
        segments, info = model.transcribe(audio_file, beam_size=5, language="en")
        # Combine the segments into a single string
        for segment in segments:
            output_text += segment.text + " "
        # Return the transcribed text
        return output_text, None
    except Exception as e:
        return None, str(e)

Here's a breakdown of the code:

The function imports the WhisperModel class from the faster_whisper library.
The function transcribe_audio_from_file takes one argument, audio_file, which is the path to the audio file to be transcribed.
Inside the function, an empty string output_text is initialized to store the transcribed text.
The function uses a try-except block to handle potential errors during transcription:
- In the try block, the function first loads the FasterWhisper model with a specified size ("small" in this case) and sets the device to "cpu" and compute type to "int8".
- The transcribe method of the WhisperModel instance is called to transcribe the audio file. The method takes the audio file, a beam size (5 in this case), and the language ("en" for English) as arguments.
- The transcription result is stored in two variables: segments (a list of transcribed text segments) and info (additional information about the transcription).
- The function then iterates through the segments list, concatenating each segment's text with a space separator, and stores the result in the output_text variable.
- Finally, the function returns the transcribed text (output_text) and None.
If an exception occurs during the transcription process, the except block catches the exception and the function returns None and the error message as a string.

Summarizing and Tagging with MistralAI

Create a mistral_wrapper.py file to handle the summarization and tagging of the transcribed text using MistralAI:

import json
import os
from llama_cpp import Llama


# Set up the Llama model with Mistral-7B instruct
llm = Llama(model_path="mistral-7b-instruct-v0.2.Q4_K_M.gguf",
            chat_format="llama-2",
            cpu_threads=os.cpu_count()/2,
            n_ctx=4096)


# Function to summarize and generate tags for the transcription
def summarize_and_generate_tags(transcription):
    try:
        # Prepare the chat completion with the system message and user query
        result = llm.create_chat_completion(
            messages=[
                {"role": "system", "content": """You are a helpful assistant that receives transcriptions and generates 
                                                summaries and tags. They should be helpful and informative.
                                                Please return the output in the following JSON format only:
                                                {
                                                    "summary": "A summary of the transcription.",
                                                    "tags": ["tag1", "tag2", "tag3"]
                                                    "description": "A description of the transcription content."
                                                }
                                                """
                 },
                {
                    "role": "user",
                    "content": "Generate a summary and tags for the following transcription: " + transcription
                }
            ]
        )
        # Return the summary and tags from the chat completion converted to JSON
        return json.loads(result["choices"][0]["message"]["content"]), None
    except Exception as e:
        return None, str(e)

Here's a breakdown of the code:

The code imports the json, os, and Llama classes from the llama_cpp library.
A Llama instance named llm is created with the Mistral-7B instruct model, using the specified model path, chat format, CPU threads (half the number of available CPU cores), and a context window on 4096 tokens.
The function summarize_and_generate_tags takes one argument, transcription, which is the text to be summarized and tagged.
Inside the function, a try-except block is used to handle potential errors during the summarization and tagging process:
- In the try block, the function calls the create_chat_completion method of the llm instance, passing a list of messages that includes a system message with instructions and a user message with the transcription.
- The method returns a chat completion, which is stored in the result variable.
- The function then extracts the summary and tags from the chat completion and converts the content from a string to a JSON object using json.loads(), and returns the summary and tags.
If an exception occurs during the summarization and tagging process, the except block catches the exception and the function returns None and the error message as a string.

Please not that you will need to download the model first from HuggingFace. In this case, we are using this file.

Setting Up the FastAPI Application

Create a main.py file for your FastAPI application:

Creating an API with FastAPI to Transcribe, Summarize, and Tag Audio Files (Using FasterWhisper and MistralAI on the CPU)

Understanding the Components

Setting Up the FastAPI Application

Transcribing Audio with FasterWhisper

Summarizing and Tagging with MistralAI

Setting Up the FastAPI Application

About the Author

Developer Service Netherlands

FastAPI vs Django DRF vs Flask - Which Is the Fastest for Building APIs

PYTHON + TIME SERIES – Correcting Outliers with Z-Score and Linear Interpolation

Understanding the Components

Setting Up the FastAPI Application

Transcribing Audio with FasterWhisper

Summarizing and Tagging with MistralAI

Setting Up the FastAPI Application

This article is for paid members only

Join us for more articles about Python, Django and AI

About the Author

Developer Service Netherlands

Related Articles