NLP with Transformers: Advanced Techniques and Multimodal ApplicationsChapter 144

Step 3: Transcribe Audio from Video

Section 6 of 9-~ 12 min read-Synced from Cuantum content

Extract audio from the video and transcribe it using Whisper, an advanced speech recognition model developed by OpenAI. Whisper excels at converting spoken words into text with high accuracy across multiple languages and accents.

The model can handle various audio qualities and background noise levels, making it ideal for processing video content. During this step, we'll separate the audio track from the video file and feed it through Whisper to generate a detailed transcription that captures not just the words, but also maintains proper punctuation and speaker attribution when possible.

import librosafrom transformers import WhisperProcessor, WhisperForConditionalGeneration # Load Whisper model and processorprocessor = WhisperProcessor.from_pretrained("openai/whisper-small")model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small") # Extract audio from videodef extract_audio(video_path, output_audio_path="audio.wav"):    cap = cv2.VideoCapture(video_path)    fps = int(cap.get(cv2.CAP_PROP_FPS))    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))    duration = total_frames / fps    cap.release()     # Convert to audio using ffmpeg (requires ffmpeg installed)    import subprocess    subprocess.run(["ffmpeg", "-i", video_path, "-q:a", "0", "-map", "a", output_audio_path]) # Transcribe audioaudio_path = "audio.wav"extract_audio(video_path, audio_path)audio, rate = librosa.load(audio_path, sr=16000)inputs = processor(audio, sampling_rate=16000, return_tensors="pt")generated_ids = model.generate(inputs["input_features"])transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] print(f"Transcription: {transcription}")

Let me break down this code that handles audio extraction and transcription from video:

1. Library Imports and Model Setup

  • Uses librosa for audio processing and the Whisper model from the transformers library
  • Initializes the Whisper model and processor, specifically using the "whisper-small" variant for efficient processing

2. Audio Extraction Function

  • The extract_audio function handles converting video to audio:
  • Captures video metadata (fps and frame count)
  • Uses ffmpeg to extract the audio track and save it as a WAV file

3. Audio Transcription Process

  • The transcription workflow includes:
  • Loading the extracted audio file using librosa at 16kHz sampling rate
  • Processing the audio through the Whisper model
  • Decoding the model's output into readable text

The Whisper model is particularly powerful as it can handle various audio qualities and maintain proper punctuation and speaker attribution.