In this article, we will build an API using FastAPI to manage the conversion of audio files into text (transcription), condense the text into a shorter version (summarization), and label the content (tagging).
We'll use FasterWhisper for transcription, and MistralAI for summarization and tagging, all running locally on the CPU.
This system takes advantage of the advanced capabilities of modern machine learning models to process audio data quickly and productively.
If you prefer, you can follow along with the video version:
Understanding the Components
FastAPI is a modern, high-performance web framework used for building APIs with Python 3.7 and above. It's based on standard Python type hints and is designed to be user-friendly and easy to learn. FastAPI provides high performance and automatically generates interactive API documentation using Swagger and ReDoc. It enables developers to quickly build robust, production-ready APIs with minimal code, taking advantage of asynchronous capabilities and automatic data validation provided by Pydantic.
FasterWhisper is an improved version of OpenAI's Whisper model, created for fast and accurate automatic speech recognition (ASR). Whisper is a cutting-edge model that transcribes audio into text, capable of handling multiple languages, various accents, and different speaking styles. FasterWhisper enhances the original model's performance, making it suitable for real-time transcription applications where speed and accuracy are essential.
MistralAI Instruct, built on the Mistral-7B model, is a powerful language model that generates detailed and informative responses to user queries. It can be utilized for various natural language processing tasks, such as summarizing, tagging, and generating descriptions. MistralAI Instruct is skilled at understanding and processing complex texts, making it an ideal choice for applications that require nuanced and context-aware text generation. This model is particularly helpful for tasks that involve condensing large amounts of text into concise and meaningful summaries and tags. We will use the CPU version from TheBloke.
Get "Python's Magic Methods - Beyond __init__ and __str__"
Magic methods are not just syntactic sugar, they're powerful tools that can significantly improve the functionality and performance of your code. With this book, you'll learn how to use these tools correctly and unlock the full potential of Python.
Setting Up the FastAPI Application
Let's now set up our FastAPI application, which is composed of the main components: main.py
, mistral_wrapper.py
, and transcribe_wrapper.py
.
Transcribing Audio with FasterWhisper
Create a transcribe_wrapper.py
file to handle the transcription of audio files using FasterWhisper:
from faster_whisper import WhisperModel
# Transcribe the audio
def transcribe_audio_from_file(audio_file):
output_text = ""
try:
# Load the model
model_size = "small"
model = WhisperModel(model_size, device="cpu", compute_type="int8")
# Transcribe the audio
segments, info = model.transcribe(audio_file, beam_size=5, language="en")
# Combine the segments into a single string
for segment in segments:
output_text += segment.text + " "
# Return the transcribed text
return output_text, None
except Exception as e:
return None, str(e)
Here's a breakdown of the code:
- The function imports the
WhisperModel
class from thefaster_whisper
library. - The function
transcribe_audio_from_file
takes one argument,audio_file
, which is the path to the audio file to be transcribed. - Inside the function, an empty string
output_text
is initialized to store the transcribed text. - The function uses a try-except block to handle potential errors during transcription:
- In the try block, the function first loads the FasterWhisper model with a specified size ("small" in this case) and sets the device to "cpu" and compute type to "int8".
- The
transcribe
method of theWhisperModel
instance is called to transcribe the audio file. The method takes the audio file, a beam size (5 in this case), and the language ("en" for English) as arguments. - The transcription result is stored in two variables:
segments
(a list of transcribed text segments) andinfo
(additional information about the transcription). - The function then iterates through the
segments
list, concatenating each segment's text with a space separator, and stores the result in theoutput_text
variable. - Finally, the function returns the transcribed text (
output_text
) andNone
.
- If an exception occurs during the transcription process, the except block catches the exception and the function returns
None
and the error message as a string.
Summarizing and Tagging with MistralAI
Create a mistral_wrapper.py
file to handle the summarization and tagging of the transcribed text using MistralAI:
import json
import os
from llama_cpp import Llama
# Set up the Llama model with Mistral-7B instruct
llm = Llama(model_path="mistral-7b-instruct-v0.2.Q4_K_M.gguf",
chat_format="llama-2",
cpu_threads=os.cpu_count()/2,
n_ctx=4096)
# Function to summarize and generate tags for the transcription
def summarize_and_generate_tags(transcription):
try:
# Prepare the chat completion with the system message and user query
result = llm.create_chat_completion(
messages=[
{"role": "system", "content": """You are a helpful assistant that receives transcriptions and generates
summaries and tags. They should be helpful and informative.
Please return the output in the following JSON format only:
{
"summary": "A summary of the transcription.",
"tags": ["tag1", "tag2", "tag3"]
"description": "A description of the transcription content."
}
"""
},
{
"role": "user",
"content": "Generate a summary and tags for the following transcription: " + transcription
}
]
)
# Return the summary and tags from the chat completion converted to JSON
return json.loads(result["choices"][0]["message"]["content"]), None
except Exception as e:
return None, str(e)
Here's a breakdown of the code:
- The code imports the
json
,os
, andLlama
classes from thellama_cpp
library. - A
Llama
instance namedllm
is created with the Mistral-7B instruct model, using the specified model path, chat format, CPU threads (half the number of available CPU cores), and a context window on 4096 tokens. - The function
summarize_and_generate_tags
takes one argument,transcription
, which is the text to be summarized and tagged. - Inside the function, a try-except block is used to handle potential errors during the summarization and tagging process:
- In the try block, the function calls the
create_chat_completion
method of thellm
instance, passing a list of messages that includes a system message with instructions and a user message with the transcription. - The method returns a chat completion, which is stored in the
result
variable. - The function then extracts the summary and tags from the chat completion and converts the content from a string to a JSON object using
json.loads()
, and returns the summary and tags.
- In the try block, the function calls the
- If an exception occurs during the summarization and tagging process, the except block catches the exception and the function returns
None
and the error message as a string.
Please not that you will need to download the model first from HuggingFace. In this case, we are using this file.
Setting Up the FastAPI Application
Create a main.py
file for your FastAPI application:
This article is for paid members only
To continue reading this article, upgrade your account to get full access.
Subscribe NowAlready have an account? Sign In