Transcription Quality Validation
Detect common Whisper failure modes in transcription output using a composable judge pipeline (heuristic checks + optional LLM criteria).
Overview
Section titled “Overview”The transcription judge catches failures that Whisper produces silently:
- Pre-filter on length — short records that should be discarded before any other check (
content_length_floor) - Wrong language/script — Japanese characters when expecting Portuguese (strongest signal)
- Repeated words/phrases — e.g., “Obrigada” x30, looping phrase artifacts
- Suspicious segment patterns — uniform 1-second intervals, empty segments
- Abnormal content density — too few or too many words per minute
- Latin-script language drift (LLM stage) — sustained English content in a Portuguese transcription
- Formulaic Whisper hallucinations (LLM stage) — YouTube-style openings/closings, apology loops, channel signatures
Judging is a separate step in the pipeline — it runs after transcription via the arandu judge-transcription CLI, not inline at ingestion time. Verdicts are written back to each *_transcription.json record under the validation field.
How It Works
Section titled “How It Works”Multi-Stage Pipeline with Per-Criterion Thresholds
Section titled “Multi-Stage Pipeline with Per-Criterion Thresholds”TranscriptionJudge runs a two-stage filter pipeline. Each stage applies individual thresholds per criterion — there is no weighted-average overall score. A record fails a stage as soon as any criterion’s score is below its threshold, and a record fails the pipeline as soon as it fails any stage.
| Stage | Criteria | Type | Skipped when |
|---|---|---|---|
heuristic_filter | content_length_floor → script_match → repetition → content_density → segment_quality | Pure-Python heuristics, CPU-only | Never (always runs) |
llm_filter | language_drift, hallucination_loop | LLM-based | No validator client configured, OR previous stage already rejected |
content_length_floor runs first — it short-circuits the rest of the heuristic stage when a record is too short to evaluate meaningfully.
Schema
Section titled “Schema”Judged records carry the full JudgePipelineResult payload under validation. The is_valid field is derived from validation.passed:
{ "validation": { "passed": true, "stage_results": { "heuristic_filter": { "passed": true, "criterion_scores": { "content_length_floor": { "score": 1.0, "passed": true, "rationale": "..." }, "script_match": { "score": 1.0, "passed": true, "rationale": "..." }, "repetition": { "score": 0.85, "passed": true, "rationale": "..." }, "content_density": { "score": 0.95, "passed": true, "rationale": "..." }, "segment_quality": { "score": 1.0, "passed": true, "rationale": "..." } } }, "llm_filter": { "passed": true, "criterion_scores": { "language_drift": { "score": 0.92, "passed": true, "rationale": "..." }, "hallucination_loop": { "score": 0.88, "passed": true, "rationale": "..." } } } } }, "is_valid": true}is_valid: true— every criterion in every stage passed its thresholdis_valid: false— at least one criterion was below its thresholdis_valid: null— record has not been judged yet
The llm_filter stage block is absent when the judge ran without a validator client (heuristic-only mode).
Running the Judge
Section titled “Running the Judge”The judge is a standalone CLI step:
# Heuristic-only mode (no LLM model needed)arandu judge-transcription results/
# With LLM stage enabledarandu judge-transcription results/ --validator-model qwen3:14b
# Resume after an interrupted run (default)arandu judge-transcription results/ --validator-model qwen3:14b# Force a fresh pass over every recordarandu judge-transcription results/ --validator-model qwen3:14b --rejudgeSee the CLI reference for the full flag list. Recommended workflow:
- Transcribe (
arandu batch-transcribe …) — records land withvalidation: null. - Judge (
arandu judge-transcription …) — records getvalidation: <JudgePipelineResult>andis_validis derived from it. - Filter / inspect downstream (e.g.
kg/cep/qastages or report dashboards) usingis_valid.
Validation Dimensions in Detail
Section titled “Validation Dimensions in Detail”1. Script/Charset Match
Section titled “1. Script/Charset Match”Checks whether the text uses the expected character set for the configured language.
For Latin-script languages (pt, en, es, fr, de, it), the validator uses unicodedata.name() to classify each alphabetic character as Latin, CJK, or other. This approach correctly handles:
- Latin Extended-B characters (e.g.,
ǎ,ǒ) - Combining diacriticals used in Portuguese (e.g.,
ã,ç,ê) - Mixed-script text
Scoring:
- 100% Latin text → score 1.0
- >50% CJK characters → score 0.0 (definitive wrong-language signal)
- Non-Latin ratio exceeds
max_non_latin_ratio(default: 10%) → score 0.2 - No alphabetic content → score 0.5 (neutral)
Issue tags: wrong_script:cjk_detected, high_non_latin_ratio, no_alphabetic_content
2. Repetition Detection
Section titled “2. Repetition Detection”Detects both single-word floods and repeated multi-word phrases.
Word repetition: If the most common word exceeds max_word_repetition_ratio (default: 15%) of total words, it is flagged.
Phrase repetition: Checks 3-gram, 4-gram, and 5-gram patterns. If any phrase appears more than max_phrase_repetition_count (default: 4) times, it is flagged with its text coverage ratio.
Scoring: Based on the worst repetition ratio found (not the count of issues). Score = 1.0 - worst_ratio, clamped to [0.0, 1.0]. This ensures “Obrigada” x30 (ratio 1.0) scores 0.0.
Special case: Transcriptions shorter than 5 words receive a neutral score of 0.7 with a very_short_transcription issue.
Issue tags: high_word_repetition:<word>:<count>, repeated_phrase:<text>:<count>, very_short_transcription
3. Segment Pattern Analysis
Section titled “3. Segment Pattern Analysis”Analyzes Whisper’s timestamp segments for anomalies that indicate processing artifacts.
Uniform intervals: Flags when 5+ consecutive segments have ~1-second start-time intervals (within uniform_interval_tolerance of ±0.1s). This pattern suggests Whisper fell into a timestamp loop.
Empty segments: Flags when more than 20% (max_empty_segment_ratio) of segments have no text content.
Scoring:
- No segments provided → score 1.0 (not penalized)
- Empty segment ratio exceeded → -0.3
- Suspicious uniform intervals → -0.5
Issue tags: suspicious_uniform_intervals:<count>, high_empty_segments:<empty>/<total>
4. Content Density
Section titled “4. Content Density”Checks whether the words-per-minute ratio falls within a reasonable range for spoken language.
Range: min_words_per_minute (default: 30) to max_words_per_minute (default: 300).
Scoring:
- Within range → score 1.0
- Below minimum → linearly scaled from 0.0 (at 0 wpm) to 1.0 (at min threshold)
- Above maximum → linearly scaled from 1.0 (at max threshold) to 0.0 (at 2x max)
- Duration unknown (
null) → score 0.5 (neutral, doesn’t skew results) - Invalid duration (0 or negative) → score 0.3
Issue tags: low_content_density:<wpm>_wpm, high_content_density:<wpm>_wpm, duration_unknown:neutral_score, invalid_duration
LLM-Based Criteria (Optional)
Section titled “LLM-Based Criteria (Optional)”Two additional criteria run as a second pipeline stage when TranscriptionJudge is given an LLMClient. They target failure modes that pure-heuristic checks cannot detect.
language_drift— detects when sustained content is in a different Latin-script language than expected (e.g., English content in a Portuguese transcription). The heuristicscript_matchcriterion cannot catch this because English and Portuguese share the Latin alphabet. Default threshold:0.8.hallucination_loop— detects formulaic Whisper hallucinations that slip past the heuristicrepetitioncriterion: YouTube-style openings/closings, short sentence loops that appear only a handful of times, apology/filler loops, channel-name “signatures”. Default threshold:0.7.
Prompts live under prompts/judge/criteria/language_drift/{pt,en}/prompt.md and prompts/judge/criteria/hallucination_loop/{pt,en}/prompt.md. Thresholds live in each criterion’s config.json. Both are domain-neutral by design — they target generic transcription failure modes, not interview-specific content.
Pipeline Behavior
Section titled “Pipeline Behavior”The pipeline is two filter stages in order:
heuristic_filter— content length floor + script match + repetition + content density + segment quality. Thecontent_length_floorcriterion runs first and can short-circuit the rest of the heuristic stage.llm_filter— language drift + hallucination loop (only when anLLMClientis provided).
If the heuristic stage rejects — including an early rejection from content_length_floor — the LLM stage is skipped, so there are no wasted LLM calls on transcriptions already flagged by cheap checks.
Programmatic usage of TranscriptionJudge is documented in the Programmatic Usage section below.
Smoke-testing the Pipeline
Section titled “Smoke-testing the Pipeline”scripts/test_transcription_judge.py exercises both stages against real transcription files:
# Heuristics onlyuv run python scripts/test_transcription_judge.py \ --input-dir results/test-cep-01/transcription/outputs \ --files 5
# Heuristics + LLM (Ollama)uv run python scripts/test_transcription_judge.py \ --validator-model qwen3:14b \ --input-dir results/test-cep-01/transcription/outputs \ --files 5
# Single-file mode (useful for reproducing a known-bad case)uv run python scripts/test_transcription_judge.py \ --validator-model qwen3:14b \ --file results/test-cep-01/transcription/outputs/<id>_transcription.jsonLimitations of the LLM Criteria
Section titled “Limitations of the LLM Criteria”These criteria do not catch all Whisper hallucinations. Be explicit about what they do and don’t detect before relying on them:
-
hallucination_looponly catches formulaic/pattern hallucinations. It is designed to flag content that looks copied from Whisper’s training distribution (stock phrases, channel openings, short loops, implausibly repeated interjections). It does not reliably catch:- Plausible silence-fillers — a single coherent sentence Whisper invents from background noise when no speech occurred.
- Low-SNR invention — phonetically-close but wrong words across a real utterance. The output reads naturally.
- Name/number substitutions — “João” → “Joaquim”, “15 anos” → “50 anos”.
These are fundamentally undetectable from text alone — they require audio-aware signals (Whisper
avg_logprob/no_speech_probper segment, voice-activity detection, or multi-model cross-check). Adding audio-aware heuristics is tracked separately from this criterion. -
language_drifttolerates isolated loanwords and proper nouns by design. It flags sustained non-expected-language content, not single borrowed words, acronyms, or technical terms. Its ceiling is the LLM’s own competence in the target languages; exotic code-switching targets (e.g., indigenous languages) may be over-tolerated. -
Text-only ceiling. Neither criterion can distinguish “real but unusual speech” from “plausibly fabricated speech” without access to the audio. If an interview contains genuinely ordinary conversational content, the LLM judge has no grounds to flag it — which is the correct behavior, but means well-hidden fabrications will pass.
-
LLM cost + latency. Each transcription triggers two LLM calls (one per criterion). For large corpora, budget accordingly or keep the LLM stage disabled at ingestion time and run it selectively via the smoke script.
-
Threshold provenance. The 0.8 / 0.7 defaults started from the rubric design. Empirical calibration against the project’s Portuguese corpus is documented in
docs/research/judge-pipeline-calibration.md, including the dual-class audit protocol (30% rejected + 15% admitted, Clopper-Pearson 95% CIs) used to validate them. Re-calibration is needed if the corpus or the validator model changes substantively.
Configuration Reference
Section titled “Configuration Reference”The judge module uses the ARANDU_JUDGE_ env-var prefix for runtime knobs and per-criterion JSON files for criterion-level tuning.
Runtime settings (ARANDU_JUDGE_*)
Section titled “Runtime settings (ARANDU_JUDGE_*)”| Setting | Type | Default | Env Var |
|---|---|---|---|
language | str | "pt" | ARANDU_JUDGE_LANGUAGE |
temperature | float | 0.3 | ARANDU_JUDGE_TEMPERATURE |
max_tokens | int | 2048 | ARANDU_JUDGE_MAX_TOKENS |
validator_model | str | None | None | ARANDU_JUDGE_VALIDATOR_MODEL |
validator_provider | str | None | None (inferred) | ARANDU_JUDGE_VALIDATOR_PROVIDER |
validator_base_url | str | None | None | ARANDU_JUDGE_VALIDATOR_BASE_URL |
When validator_model is unset (and not provided via --validator-model), judge-transcription runs in heuristic-only mode and skips the LLM stage.
Per-criterion thresholds and detection bounds
Section titled “Per-criterion thresholds and detection bounds”Each criterion ships with a JSON config under prompts/judge/criteria/<name>/config.json (LLM criteria) and Python defaults baked into arandu.transcription.criteria.<name> (heuristic criteria). Detection thresholds (e.g. max_non_latin_ratio, max_word_repetition_ratio, words-per-minute bounds) are configured per-criterion rather than via a global ARANDU_QUALITY_* block. See the criterion source files / JSON configs for the current defaults.
Example .env Configuration
Section titled “Example .env Configuration”# Validator model for the LLM stage (omit to run heuristic-only)ARANDU_JUDGE_VALIDATOR_MODEL=qwen3:14b# Optional: pin a specific provider / base URLARANDU_JUDGE_VALIDATOR_PROVIDER=ollamaARANDU_LLM_BASE_URL=http://localhost:11434/v1
# Sampling for LLM criteria (low temperature for consistent evaluation)ARANDU_JUDGE_TEMPERATURE=0.3ARANDU_JUDGE_MAX_TOKENS=2048
# Language used for criterion prompts and script-match expectationsARANDU_JUDGE_LANGUAGE=ptInterpreting Results
Section titled “Interpreting Results”Common Failure Patterns
Section titled “Common Failure Patterns”| Pattern | Failing criterion | Root cause |
|---|---|---|
| Japanese text for Portuguese audio | script_match | Whisper language detection failure |
| ”Obrigada” x30 | repetition | Whisper repetition loop |
| Uniform 1-second segments | segment_quality | Whisper timestamp artifact |
| 5 words in 60 seconds | content_length_floor (short-circuit) | Mostly silence or background noise |
| English narration in PT corpus | language_drift (LLM stage) | Whisper transcribed against the wrong language model |
| YouTube intro/outro phrasing | hallucination_loop (LLM stage) | Whisper hallucinated training-distribution text |
Threshold Tuning
Section titled “Threshold Tuning”Thresholds are per-criterion, not global. To make any single failure mode stricter or more lenient, edit the criterion’s config rather than chasing a single overall-score cutoff. The overall passed outcome flips as soon as any criterion’s score is below its threshold, so adjusting one criterion can be done without touching the others.
Programmatic Usage
Section titled “Programmatic Usage”from arandu.shared.llm_client import LLMClient, LLMProviderfrom arandu.transcription.judge import TranscriptionJudge, build_validator_client
# Heuristic-only mode (no LLM calls)judge = TranscriptionJudge(language="pt")result = judge.evaluate_transcription( text=record.transcription_text, duration_ms=record.duration_milliseconds, segments=record.segments or [],)
# Heuristics + LLM stagevalidator = build_validator_client(model_id="qwen3:14b") # respects ARANDU_LLM_BASE_URLjudge = TranscriptionJudge(language="pt", validator_client=validator)result = judge.evaluate_transcription(text=..., duration_ms=..., segments=...)
# Persist back into the EnrichedRecord — is_valid is derived from result.passedrecord.validation = resultKnown Limitations
Section titled “Known Limitations”- Non-Latin script detection: Currently only detects CJK (Chinese, Japanese, Korean) as wrong-script for Latin languages. Arabic, Cyrillic, and other scripts are tracked via the generic
non_latin_ratiothreshold but lack dedicated detection. - Language probability: Not used as a scoring dimension because valid Portuguese transcriptions can have
language_probability: 0.0. Character-level script detection is more reliable. - Content density with silence: Long audio with extended silence produces low wpm but may still be a valid transcription. Consider the context when interpreting density scores.
See also: Transcription | Configuration | Getting Started