Video to audio to transcript to summary using local AI: whisperfile and llama3.3

Using local LLMs to extract and summarize audio transcripts from YouTube and podcasts. Examples: DESeq2 tutorial from YouTube, and the Nextflow Podcast.

Stephen Turner

Dec 18, 2024

A few days ago I wrote about translating R package help documentation using a local LLM (e.g. llama3.x)…

Use an LLM to translate help documentation on-the-fly

Stephen Turner

December 14, 2024

Read full story

…when Mick Watson commented:

If we could get one that translates YouTube help videos into simple, textual steps, that would be amazing

I was already thinking of wiring up something like this using local AI models — something to summarize podcasts, conference recordings, etc. The relatively new (as of this writing) Gemini 2.0 Flash model will do this for you for YouTube videos. But what if you wanted to do this offline using a local LLM?

I’ll demo doing this with a video from YouTube (a DESeq2 tutorial) and a podcast (a recent episode of the Nextflow Podcast).

Demo: DESeq2 tutorial from YouTube

Here’s one way to do it. There are probably other ways. Here are the basic steps:

Extract audio from the YouTube video with yt-dlp.
Create a transcript using Whisper via whisperfile.
Extract a summary using llama3.3 via Ollama.

There’s a huge caveat here — this is not actually doing anything with the video. This is extracting a summary from the audio only. So if the majority of the important content is in the video only, this won’t work so well.

Here’s the video we’ll be using. This was a video from Rob Aboukhalil about how to do differential expression analysis with DESeq2 published to the OMGenomics channel on December 17, 2024.

Extract audio with yt-dlp

yt-dlp is an amazing utility for downloading videos for offline use from thousands of websites, including YouTube.1 You can download binaries from the GitHub page or install it from PyPI with pip. Instead of getting the video, we’re going to extract the audio only, in MP3 format with -x --audio-format mp3.

yt-dlp -x --audio-format mp3 -o tmp.mp3 youtube.com/watch?v=NGbZmlGLG5w

That command will give you an MP3 file with the audio from the YouTube video you paste in. Here’s a ~20 second sample:

1×

0:00

-0:19

Extract a transcript with Whisper

Next we’re going to extract a transcript with Whisperfile. Whisperfile is a high-performance implementation of OpenAI's Whisper created by Mozilla Ocho as part of the llamafile project. llamafile and whisperfile are single-file executables to run an LLM locally. Here we’re going to use the tiny ~80mb model to extract a transcript.

First, get the whisper-tiny model as a llamafile, make it executable, then get some help.

# Get the whisper-tiny model as a llamafile
wget https://huggingface.co/Mozilla/whisperfile/resolve/main/whisper-tiny.en.llamafile

# Make it executable
chmod +x whisper-tiny.en.llamafile

# Get some help
./whisper-tiny.en.llamafile -h

First, let’s just print out a transcript using ANSI colors. Transcribed text appears in the terminal on a spectrum of color from green (high confidence) to red (uncertainty).

./whisper-tiny.en.llamafile -pc -f tmp.mp3

Instead of using this, let’s not print any additional information, and let’s strip out the timestamps with a little perl regex.

./whisper-tiny.en.llamafile -np -f tmp.mp3 \
    | perl -pe 's/\[.+\]\ +//g' \
    > transcript.txt

Summarize with llama3.3

Now we’ll use some bash process substitution to stick a custom prompt in front of that transcript. I.e., what’ll be stored in $prompt is “Summarize this transcript: …” where … is the transcript we extracted above.

prompt=$(cat <(echo "summarize this transcript: ") transcript.txt)

Now we’ll summarize this text with llama3.32 using Ollama.

ollama run llama3.3 $prompt > summary.md

The output here is markdown text. Here’s what I got:

This is a comprehensive tutorial on RNA-seq analysis using various tools such as DESeq2, Biomart, and Circa. Here's a summary of the steps:
Introduction
Introduction to RNA-seq and its applications
Overview of the tutorial
Data Preparation
Downloading the data from the Sequence Read Archive (SRA)
Converting the data to FASTQ format
Trimming adapters and low-quality bases using Trim Galore
Aligning reads to the reference genome using HISAT2
Differential Expression Analysis
Counting reads using featureCounts
Importing counts into R using DESeq2
Running DESeq2 to identify differentially expressed genes
Filtering results based on fold change and p-value thresholds
Validation of Results
Validating the results by checking the expression levels of known genes
Comparing the results with existing literature or databases
Visualization of Results
Using Biomart to retrieve gene annotations and genomic coordinates
Merging the DESeq2 output with the Biomart data
Saving the merged data as a CSV file
Creating a Circos plot using Circos to visualize the genomic locations of differentially expressed genes
Circos Plot
Creating a new plot in Circos
Adding data tracks for log2 fold change, base mean read counts, and p-values
Customizing the plot appearance and layout
Annotating the plot with gene names and saving it as a JPEG file
Conclusion
Recap of the tutorial and encouragement to explore different datasets and tools. Some key takeaways from this tutorial include:
The importance of proper data preparation, including adapter trimming and quality control.
The use of DESeq2 for differential expression analysis and its ability to account for various sources of variability in RNA-seq data.
The value of validating results by checking the expression levels of known genes and comparing with existing literature or databases.
The power of visualization tools like Circos in communicating complex genomic data.
Overall, this tutorial provides a comprehensive introduction to RNA-seq analysis and visualization, covering key concepts, tools, and techniques used in the field.

Demo: Nextflow Podcast

You can also do this with Podcasts. Overcast is the best podcast player I’ve ever used. I’ve tried them all and I’ve used Overcast for years. Here I’ll be using an example of a recent Nextflow Podcast (episode 47, published November 5, 2024) where Ben Sherman and Phil Ewels are talking about the new Nextflow language server & VS Code extension. You can get this on Apple Podcasts, Spotify, Youtube, etc., but here I’ll be using the Overcast version, https://overcast.fm/+qphE6P22Y, because I can get the mp3 file offline using yt-dlp.

The steps are the same as above. The same yt-dlp command will work here, saving the mp3 of the podcast for offline use. We’ll then extract a transcript, and summarize it with llama3.3 (remember, use llama3.2 if you don’t have the resources to run llama3.3).

# Get the mp3
yt-dlp -x --audio-format mp3 -o tmp.mp3 https://overcast.fm/+qphE6P22Y

# Get a transcript
./whisper-tiny.en.llamafile -np -f tmp.mp3 \
    | perl -pe 's/\[.+\]\ +//g' \
    > transcript.txt

# Create a prompt
prompt=$(cat <(echo "summarize this transcript: ") transcript.txt)

# Summarize with llama3.3 via Ollama
ollama run llama3.3 $prompt > summary.md

Here’s a preview of the transcript (noting a few errors):

Here’s the summary from llama3.3 (which does fine, despite the transcription errors):

The conversation appears to be a discussion between two individuals, likely developers or researchers, about the development of a language server for Nextflow, a workflow management system. The language server provides features such as code completion, debugging, and error checking, which can improve the user experience and productivity.
Here are some key points from the conversation:
Language Server: The language server is a new feature that provides code intelligence, such as code completion, debugging, and error checking, for Nextflow workflows.
Error Checking: The language server performs error checking on Nextflow scripts, including syntax errors, undefined variables, and type errors.
Code Completion: The language server provides code completion suggestions for Nextflow scripts, including function names, variable names, and config options.
Config Syntax: The config syntax in Nextflow has been made more strict, with a focus on declare syntax rather than arbitrary Groovy code.
Documentation: The language server provides documentation for Nextflow config options, which can be accessed by hovering over the option or clicking on a link to the Nextflow docs.
Future Developments: The developers are planning to add more features to the language server, including type annotations and level 3 checking, which will provide even more advanced error checking and code completion capabilities.
Availability: The language server is available as an update to the VS Code extension for Nextflow and can be installed from the Microsoft marketplace or the open VSX marketplace.
Overall, the conversation suggests that the language server is a significant improvement to the Nextflow workflow management system, providing users with more advanced features and better error checking capabilities. The developers are actively working on improving the language server and adding new features, which will further enhance the user experience.

Standard “I am not a lawyer” disclaimer here. As I understand it downloading videos or extracting audio for offline use isn’t a copyright violation. It might run afoul of the terms of service of whatever website you’re downloading from, so check those first.

Llama3.3 is a 70B model with performance on par with the much larger llama3.1-405B, and frontier models including GPT-4 and Claude 3.5 Sonnet. It can run on my MacBook Pro with 96GB RAM. If you don’t have these resources available, try the much smaller llama3.2 model. It’ll be faster, but results might not be as good.

Paired Ends

Use an LLM to translate help documentation on-the-fly

2 Comments