Batch Transcription
This guide explains how to use the parallel batch transcription feature to process multiple audio/video files from Google Drive.
Overview
Section titled “Overview”The batch-transcribe command allows you to:
- Process all audio/video files from a catalog CSV
- Run transcriptions in parallel with multiple workers
- Automatically checkpoint progress for resuming interrupted jobs
- Extract and store media duration in the output
- Save transcriptions with full metadata to a results directory
Prerequisites
Section titled “Prerequisites”-
Google Drive Credentials: You need OAuth2 credentials to access Google Drive
- Get credentials from Google Cloud Console
- Save as
credentials.jsonin the project root
-
FFmpeg: Required for extracting media duration
Terminal window # Ubuntu/Debiansudo apt-get install ffmpeg# macOSbrew install ffmpeg# Verify installationffmpeg -version -
Catalog CSV: A CSV file with Google Drive file metadata containing:
file_id: Google Drive file IDname: File namemime_type: MIME type (e.g., audio/mpeg, video/mp4)size_bytes: File size in bytesparents: Parent folder IDs (JSON array)web_content_link: Download linkduration_milliseconds(optional): Pre-computed duration
Basic Usage
Section titled “Basic Usage”Simple Batch Transcription
Section titled “Simple Batch Transcription”Process all audio/video files with a single worker:
arandu batch-transcribe input/catalog.csv --credentials credentials.jsonThis will:
- Read
input/catalog.csv - Filter only audio and video files
- Download and transcribe each file
- Save results to
results/directory - Create checkpoint at
results/checkpoint.json
Parallel Processing
Section titled “Parallel Processing”Use multiple workers for faster processing:
arandu batch-transcribe input/catalog.csv \ --credentials credentials.json \ --workers 4Note: Each worker loads its own model instance into memory. Consider available RAM/VRAM when choosing worker count.
Custom Output Directory
Section titled “Custom Output Directory”Specify a custom output directory:
arandu batch-transcribe input/catalog.csv \ --output-dir transcriptions/ \ --credentials credentials.json \ --workers 2Using Different Models
Section titled “Using Different Models”Use a different Whisper model:
# Faster turbo modelarandu batch-transcribe input/catalog.csv \ --model-id openai/whisper-large-v3 \ --credentials credentials.json \ --workers 4
# Distilled model (smaller, faster)arandu batch-transcribe input/catalog.csv \ --model-id distil-whisper/distil-large-v3 \ --credentials credentials.json \ --workers 4Pipeline ID Tracking
Section titled “Pipeline ID Tracking”Use a custom pipeline ID to group related processing steps:
arandu batch-transcribe input/catalog.csv \ --id etno-project-001 \ --credentials credentials.json \ --workers 4The pipeline ID:
- Creates a versioned results directory (e.g.,
results/etno-project-001/transcription_YYYYMMDD_HHMMSS/) - Enables tracking of the entire pipeline run
- Can be used to link transcription with downstream QA/CEP processing
Memory Optimization
Section titled “Memory Optimization”Use quantization to reduce VRAM usage:
arandu batch-transcribe input/catalog.csv \ --credentials credentials.json \ --quantize \ --workers 2Force CPU Processing
Section titled “Force CPU Processing”Process on CPU instead of GPU:
arandu batch-transcribe input/catalog.csv \ --credentials credentials.json \ --cpu \ --workers 2Note: CPU processing is slower but doesn’t require GPU/VRAM.
Language Configuration
Section titled “Language Configuration”Specify the transcription language for better accuracy and downstream processing:
# Portuguese transcriptions (ETno project)arandu batch-transcribe input/catalog.csv \ --credentials credentials.json \ --language pt \ --workers 4
# Spanish transcriptionsarandu batch-transcribe input/catalog.csv \ --credentials credentials.json \ --language es \ --workers 4Important for downstream processing: The --language option ensures transcription runs with the specified language and the output includes detected_language for language-aware consumers like the KG construction pipeline.
If language is not specified, Whisper will auto-detect, but it’s recommended to set it explicitly for:
- Better transcription accuracy
- Consistent language metadata for downstream processing
- More predictable routing in multilingual KG construction
Checkpoint and Resume
Section titled “Checkpoint and Resume”The batch transcription automatically creates a checkpoint file that tracks:
- Completed files
- Failed files with error messages
- Total progress
Resuming Interrupted Jobs
Section titled “Resuming Interrupted Jobs”If the process is interrupted (Ctrl+C, crash, etc.), simply run the same command again:
arandu batch-transcribe input/catalog.csv \ --credentials credentials.json \ --workers 4The system will:
- Load the checkpoint file
- Skip already completed files
- Resume processing from where it left off
Custom Checkpoint Location
Section titled “Custom Checkpoint Location”Specify a custom checkpoint file location:
arandu batch-transcribe input/catalog.csv \ --credentials credentials.json \ --checkpoint my_checkpoint.json \ --workers 4Starting Fresh
Section titled “Starting Fresh”To start over from scratch, delete the checkpoint file:
rm results/checkpoint.jsonarandu batch-transcribe input/catalog.csv --credentials credentials.jsonOutput Format
Section titled “Output Format”Each transcribed file produces a JSON file with this structure:
{ "file_id": "1JtKnN2pQGmHEkSPniwES6RmWWp5BtKrU", "name": "audio.m4a", "mimeType": "audio/mpeg", "size_bytes": 143657567, "duration_milliseconds": 120000, "parents": ["1OusxPzsL5cVb06sMnyTqtCwlalvv2HDd"], "web_content_link": "https://drive.google.com/uc?id=...", "transcription_text": "Full transcription text here...", "detected_language": "pt", "language_probability": 0.98, "model_id": "openai/whisper-large-v3", "compute_device": "cuda:0", "processing_duration_sec": 45.2, "transcription_status": "completed", "created_at_enrichment": "2025-12-10T22:30:00", "segments": [ { "text": "Segment text", "start": 0.0, "end": 3.5 } ], "transcription_quality": { "script_match_score": 1.0, "repetition_score": 0.95, "segment_quality_score": 1.0, "content_density_score": 0.85, "overall_score": 0.94, "issues_detected": [], "quality_rationale": "High quality transcription" }, "is_valid": true}Quality Fields:
transcription_quality: Quality scores from heuristic-based validationis_valid: Boolean indicating if transcription passes quality threshold
Files are named: {file_id}_transcription.json
Performance Considerations
Section titled “Performance Considerations”Choosing Worker Count
Section titled “Choosing Worker Count”GPU Processing:
- Each worker loads a full model copy into VRAM
- Typical VRAM usage per model:
- whisper-large-v3: ~3-6 GB
- whisper-large-v3-turbo: ~2-4 GB
- distil-whisper: ~1-2 GB
- Example: 24 GB GPU → 4 workers with whisper-large-v3
CPU Processing:
- Each worker loads model into RAM
- Typical RAM usage per model: 4-8 GB
- CPU processing is 5-10x slower than GPU
- Example: 32 GB RAM → 4 workers
Optimizing for Speed
Section titled “Optimizing for Speed”- Use GPU if available (much faster)
- Use turbo or distilled models for faster processing
- Maximize workers based on available VRAM/RAM
- Use quantization to fit more workers
Optimizing for Accuracy
Section titled “Optimizing for Accuracy”- Use large models (whisper-large-v3)
- Use fewer workers to ensure quality
- Avoid quantization for best accuracy
Troubleshooting
Section titled “Troubleshooting””Incomplete download” errors
Section titled “”Incomplete download” errors”If you see errors like:
Incomplete download for 'video.mp4': expected 546.43 MB but got 123.45 MB (22.6% complete)This indicates the download was interrupted or truncated. The system will:
- Automatically retry up to 5 times with exponential backoff (4s → 60s delays)
- Delete the incomplete file before each retry
- Log retry attempts with wait times
Common causes:
- Network instability or timeouts
- Google Drive rate limiting (for large files or many requests)
- Temporary Google API issues
If retries fail:
- Check your network connection
- Wait a few minutes and run the command again (checkpoint will resume)
- For very large files, try processing during off-peak hours
”Empty download” errors
Section titled “”Empty download” errors”If you see:
Download resulted in empty file for 'audio.mp3'This typically indicates:
- Google Drive API permission issues
- The file was deleted or moved in Drive
- OAuth token expired
Solutions:
- Delete
token.jsonand re-authenticate - Verify the file exists and is accessible in Google Drive
- Check that your OAuth credentials have Drive read access
”No audio stream found” errors
Section titled “”No audio stream found” errors”If you see:
No audio stream found in 'video.mp4'This means the file has no audio track to transcribe. Common causes:
- Video file recorded without audio (e.g., screen recording with mic disabled)
- Corrupted audio track
- Unsupported audio codec
To inspect the file:
ffprobe -v error -show_streams your_file.mp4The file will be marked as failed and skipped.
”Failed to extract duration”
Section titled “”Failed to extract duration””If FFmpeg is not installed or file format is unsupported:
- Install FFmpeg
- Duration will be
nullin output (processing continues) - Pre-compute durations in catalog CSV if needed
”Out of memory” errors
Section titled “”Out of memory” errors”- Reduce number of workers
- Use quantization:
--quantize - Use smaller model:
--model-id distil-whisper/distil-large-v3 - Force CPU:
--cpu
”Soundfile is either not in the correct format or is malformed”
Section titled “”Soundfile is either not in the correct format or is malformed””This error from the Whisper pipeline usually indicates:
- Incomplete download - The file didn’t download fully (see “Incomplete download” above)
- No audio stream - The file has no audio track (see “No audio stream found” above)
- Corrupted file - The source file in Google Drive may be corrupted
The system now validates downloads and audio streams before transcription, so this error should be rare. If it occurs:
- Check the checkpoint file for the specific file ID
- Manually download the file from Google Drive and test with
ffprobe - If the source file is valid, report the issue
”Credentials not found”
Section titled “”Credentials not found””- Ensure
credentials.jsonexists - Use
--credentialsto specify path - Follow Google Cloud Console setup for OAuth2
Files being skipped
Section titled “Files being skipped”- Check checkpoint file:
results/checkpoint.json - Files already in
completed_filesare skipped - Delete checkpoint to reprocess all files
Advanced Usage
Section titled “Advanced Usage”Processing Specific File Types
Section titled “Processing Specific File Types”Edit src/arandu/core/batch.py to modify AUDIO_VIDEO_MIME_TYPES set:
AUDIO_VIDEO_MIME_TYPES = { "audio/mpeg", "video/mp4", # Add or remove types as needed}Custom Model Parameters
Section titled “Custom Model Parameters”Modify engine initialization in transcribe_single_file() to add custom parameters:
engine = WhisperEngine( model_id=config.model_id, force_cpu=config.force_cpu, quantize=config.quantize, chunk_length_s=30, # Custom chunk length stride_length_s=5, # Custom stride length)Examples
Section titled “Examples”Example 1: Small Dataset (< 10 files)
Section titled “Example 1: Small Dataset (< 10 files)”arandu batch-transcribe input/catalog.csv \ --credentials credentials.json \ --workers 1Example 2: Medium Dataset (10-100 files)
Section titled “Example 2: Medium Dataset (10-100 files)”arandu batch-transcribe input/catalog.csv \ --credentials credentials.json \ --workers 4 \ --quantizeExample 3: Large Dataset (100+ files)
Section titled “Example 3: Large Dataset (100+ files)”# Initial runarandu batch-transcribe input/catalog.csv \ --credentials credentials.json \ --model-id openai/whisper-large-v3 \ --workers 8 \ --quantize
# If interrupted, resume:arandu batch-transcribe input/catalog.csv \ --credentials credentials.json \ --model-id openai/whisper-large-v3 \ --workers 8 \ --quantizeExample 4: CPU Only (No GPU)
Section titled “Example 4: CPU Only (No GPU)”arandu batch-transcribe input/catalog.csv \ --credentials credentials.json \ --cpu \ --workers 2 \ --model-id distil-whisper/distil-large-v3Example 5: Portuguese Corpus (ETno Project)
Section titled “Example 5: Portuguese Corpus (ETno Project)”For the ETno project with Portuguese transcriptions:
# Transcribe with explicit Portuguese language settingarandu batch-transcribe input/etno_catalog.csv \ --credentials credentials.json \ --language pt \ --model-id openai/whisper-large-v3 \ --workers 4 \ --quantize
# After transcription, the output will have:# - detected_language: "pt"# - metadata.lang: "pt"# This ensures proper routing to Portuguese prompts during KG constructionMonitoring Progress
Section titled “Monitoring Progress”The system logs progress in real-time:
INFO - Loaded 817 audio/video files from catalogINFO - Total files: 817INFO - Already completed: 0INFO - Remaining to process: 817INFO - Using 4 parallel workersINFO - Processing file: audio1.m4a (1JtK...)INFO - ✓ Completed: audio1.m4aINFO - Progress: 1/817 filesINFO - ✓ Completed: audio2.m4aINFO - Progress: 2/817 files...At completion:
============================================================Batch transcription completed!Total files: 817Successfully transcribed: 810Failed: 7Success rate: 99.1%============================================================Support
Section titled “Support”For issues or questions:
- Check logs for error messages
- Review checkpoint file:
results/checkpoint.json - Verify Google Drive credentials
- Ensure FFmpeg is installed
- Check available VRAM/RAM vs worker count