Skip to content

CLI Reference

Complete reference for all command-line interface commands in Arandu.

  1. Command Overview
  2. Transcription Commands
  3. QA Generation Commands
  4. Utilities Commands
  5. Usage Examples
  6. Common Patterns

The Arandu CLI is built with Typer and provides rich terminal output using Rich.

Base Command: arandu

Command Categories:

  • Transcription: transcribe, drive-transcribe, batch-transcribe
  • Judging: judge-transcription, judge-qa
  • QA Generation: generate-cep-qa
  • Utilities: refresh-auth, info, list-runs, run-info, rebuild-index

Global Options:

  • --help - Show command help
  • --version - Show application version

Transcribe a local audio or video file.

Usage:

Terminal window
arandu transcribe FILE_PATH [OPTIONS]

Arguments:

  • FILE_PATH - Path to the audio/video file to transcribe

Options:

OptionShortTypeDefaultDescription
--model-id-mstropenai/whisper-large-v3Hugging Face model ID for transcription
--output-oPathAuto-generatedOutput file path for transcription JSON
--quantize-qflagFalseEnable 8-bit quantization to reduce VRAM usage
--cpuflagFalseForce CPU execution (disables CUDA/MPS)
--language-lstrAuto-detectLanguage code (e.g., ‘pt’ for Portuguese)

Examples:

Terminal window
# Basic transcription
arandu transcribe audio.mp3
# With custom model
arandu transcribe audio.mp3 --model-id openai/whisper-large-v3
# With quantization (reduced VRAM)
arandu transcribe audio.mp3 --quantize
# Force CPU execution
arandu transcribe audio.mp3 --cpu
# Specify language
arandu transcribe audio.mp3 --language pt
# Custom output location
arandu transcribe audio.mp3 -o results/transcription.json

Transcribe a file from Google Drive. Downloads the file, transcribes it, and uploads the result to the same Drive folder.

Usage:

Terminal window
arandu drive-transcribe FILE_ID [OPTIONS]

Arguments:

  • FILE_ID - Google Drive file ID to transcribe

Options:

OptionShortTypeDefaultDescription
--model-id-mstropenai/whisper-large-v3Hugging Face model ID
--credentials-cPathcredentials.jsonPath to Google OAuth2 credentials file
--token-tPathtoken.jsonPath to Google OAuth2 token file
--quantize-qflagFalseEnable 8-bit quantization
--cpuflagFalseForce CPU execution
--language-lstrAuto-detectLanguage code

Examples:

Terminal window
# Basic usage
arandu drive-transcribe 1abc123xyz --credentials credentials.json
# With custom model and quantization
arandu drive-transcribe 1abc123xyz --model-id openai/whisper-large-v3 --quantize

Batch transcribe audio/video files from a catalog CSV with parallel processing and automatic checkpoint/resume capability.

Usage:

Terminal window
arandu batch-transcribe CATALOG_FILE [OPTIONS]

Arguments:

  • CATALOG_FILE - Path to catalog CSV file with Google Drive file metadata

Required CSV Columns:

  • file_id - Google Drive file ID
  • name - File name
  • mime_type - MIME type
  • size_bytes - File size in bytes
  • parents - Parent folder IDs
  • web_content_link - Download link
  • duration_milliseconds (optional) - Media duration

Options:

OptionShortTypeDefaultDescription
--output-dir-oPath./resultsOutput directory for transcription JSON files
--model-id-mstropenai/whisper-large-v3Hugging Face model ID
--credentials-cPathcredentials.jsonPath to Google OAuth2 credentials file
--token-tPathtoken.jsonPath to Google OAuth2 token file
--workers-wint1Number of parallel workers
--checkpointPathresults/checkpoint.jsonPath to checkpoint file
--quantize-qflagFalseEnable 8-bit quantization
--cpuflagFalseForce CPU execution
--language-lstrAuto-detectLanguage code
--idstrAuto-generatedPipeline ID for grouping related steps

Examples:

Terminal window
# Basic batch transcription
arandu batch-transcribe input/catalog.csv --workers 4
# With custom output directory
arandu batch-transcribe input/catalog.csv -o transcriptions/ --workers 2
# With quantization and custom model
arandu batch-transcribe input/catalog.csv \
--model-id openai/whisper-large-v3 \
--quantize \
--workers 4
# Resume interrupted job (uses checkpoint automatically)
arandu batch-transcribe input/catalog.csv --workers 4
# With custom pipeline ID
arandu batch-transcribe input/catalog.csv --id my-project-001

The judge layer scores artifacts (transcriptions, QA pairs) with a composable two-stage pipeline: cheap heuristics first, optional LLM criteria second. The LLM stage is skipped automatically when the heuristic stage rejects.

Judge transcription quality with heuristic and (optional) LLM criteria. Verdicts are written back into each *_transcription.json record under the validation field.

Usage:

Terminal window
arandu judge-transcription INPUT_DIR [OPTIONS]

Arguments:

  • INPUT_DIR - Directory containing *_transcription.json files

Options:

OptionTypeDefaultDescription
--languagestrptExpected transcription language (pt or en)
--validator-modelstr | Nonefrom ARANDU_JUDGE_VALIDATOR_MODELModel ID for the LLM filter stage. Omit (and leave the env var unset) to run heuristic-only
--validator-providerstrfrom ARANDU_JUDGE_VALIDATOR_PROVIDERLLM provider (openai, ollama, custom). custom requires --validator-base-url or ARANDU_LLM_BASE_URL
--validator-base-urlstrfrom ARANDU_JUDGE_VALIDATOR_BASE_URLBase URL for the validator provider
--validator-temperaturefloatfrom ARANDU_JUDGE_TEMPERATURE (0.3)Sampling temperature for LLM criteria
--validator-max-tokensintfrom ARANDU_JUDGE_MAX_TOKENS (2048)Max tokens for LLM criterion responses
--rejudge / --resumeflag--resume--rejudge re-evaluates every record from scratch; --resume (default) skips records already carrying a validation payload

Heuristic stage: content_length_floor (runs first, can short-circuit) → script_matchrepetitioncontent_densitysegment_quality.

LLM stage: language_drift + hallucination_loop. Skipped when no validator model is configured.

Examples:

Terminal window
# Heuristic-only (no LLM model needed) — fastest, runs entirely on CPU.
arandu judge-transcription results/
# Heuristics + LLM via Ollama
arandu judge-transcription results/ --validator-model qwen3:14b
# Heuristics + LLM via an OpenAI-compatible custom endpoint
ARANDU_LLM_BASE_URL=https://my-llm.example.com/v1 \
arandu judge-transcription results/ --validator-model openai/gpt-4.1-mini
# Force a fresh pass over every record (default is resume)
arandu judge-transcription results/ --validator-model qwen3:14b --rejudge

Judge QA pair quality (faithfulness, Bloom calibration, informativeness, self-containedness). Verdicts are persisted onto each QA pair.

Usage:

Terminal window
arandu judge-qa INPUT_DIR [OPTIONS]

Arguments:

  • INPUT_DIR - Directory containing CEP QA pair JSON files

Options:

OptionTypeDefaultDescription
--model / -mstr | Nonefrom ARANDU_JUDGE_VALIDATOR_MODELModel ID for judge evaluation. Required (CLI flag or env var)
--providerstr | Nonefrom ARANDU_JUDGE_VALIDATOR_PROVIDERLLM provider (openai, ollama, custom). Inferred from ARANDU_LLM_BASE_URL when unset (custom if set, else ollama)
--base-urlstr | Nonefrom ARANDU_JUDGE_VALIDATOR_BASE_URL / ARANDU_LLM_BASE_URLCustom base URL for OpenAI-compatible endpoints (required for --provider custom)
--language / -lstrptExpected QA language (pt or en)
--filesint | None(all)Maximum number of QA files to sample
--pairsint | None(all)Maximum QA pairs to judge per file
--rejudge / --resumeflag--resume--rejudge re-evaluates every pair from scratch; --resume (default) skips pairs already carrying a validation payload

Examples:

Terminal window
# Use ARANDU_JUDGE_VALIDATOR_* env vars from .env
arandu judge-qa cep_dataset/
# Explicit Ollama
arandu judge-qa cep_dataset/ --provider ollama --model qwen3:14b
# OpenAI-compatible custom endpoint (Gemini)
arandu judge-qa cep_dataset/ --provider custom --model gemini-2.5-flash \
--base-url https://generativelanguage.googleapis.com/v1beta/openai/
# Sample-bounded run
arandu judge-qa cep_dataset/ --files 2 --pairs 3

Generate CEP (Cognitive Elicitation Pipeline) QA pairs from transcriptions with Bloom-level scaffolding and LLM-as-a-Judge validation.

Usage:

Terminal window
arandu generate-cep-qa INPUT_DIR [OPTIONS]

Arguments:

  • INPUT_DIR - Directory containing transcription JSON files

Options:

OptionShortTypeDefaultDescription
--output-dir-oPathqa_datasetOutput directory for CEP QA dataset JSON files
--providerstrollamaLLM provider: openai, ollama, custom
--model-id-mstrqwen3:14bModel ID for QA generation
--workers-wint2Number of parallel workers
--questionsint10Number of QA pairs per document (1-50)
--temperaturefloat0.7LLM temperature for generation (0.0-2.0)
--ollama-urlstrhttp://localhost:11434/v1Ollama API base URL
--base-urlstrNoneCustom base URL for OpenAI-compatible endpoints
--language-lstrptLanguage for prompts: ‘pt’ or ‘en’
--validate/--no-validateflagTrueEnable LLM-as-a-Judge validation
--validator-modelstrqwen3:14bModel ID for validation
--bloom-diststrNoneBloom level distribution (e.g., ‘remember:0.2,understand:0.3’)
--jsonl/--no-jsonlflagFalseExport QA pairs to JSONL format for training
--idstrAuto-resolvedPipeline ID (auto-resolves transcription outputs)

Examples:

Terminal window
# Basic usage with Ollama
arandu generate-cep-qa results/ -o qa_dataset/ --workers 4
# With custom Bloom distribution
arandu generate-cep-qa results/ \
--bloom-dist "remember:0.2,understand:0.3,analyze:0.3,evaluate:0.2" \
--questions 15
# With OpenAI
arandu generate-cep-qa results/ \
--provider openai \
--model-id gpt-4o-mini \
--workers 2
# Without validation (faster)
arandu generate-cep-qa results/ \
--no-validate \
--workers 4
# With custom validator model
arandu generate-cep-qa results/ \
--validator-model qwen3:14b \
--questions 12
# Export to JSONL for KGQA training
arandu generate-cep-qa results/ \
--jsonl \
--questions 20
# English prompts
arandu generate-cep-qa results/ \
--language en \
--questions 10
# With pipeline ID
arandu generate-cep-qa results/ --id my-project-001

Output Structure:

qa_dataset/
├── cep_qa_1abc123xyz.json
├── cep_qa_2def456uvw.json
└── cep_qa_checkpoint.json

Fully refresh Google OAuth2 authentication token. Deletes existing token and initiates fresh OAuth2 authorization flow.

Usage:

Terminal window
arandu refresh-auth [OPTIONS]

Options:

OptionShortTypeDefaultDescription
--credentials-cPathcredentials.jsonPath to Google OAuth2 credentials file
--token-tPathtoken.jsonPath to token file to refresh

Example:

Terminal window
arandu refresh-auth --credentials credentials.json --token token.json

Display system information and hardware capabilities.

Usage:

Terminal window
arandu info

Output:

  • Application version
  • Device type (CPU/CUDA/MPS)
  • CUDA/MPS availability
  • PyTorch version and configuration
  • GPU memory information (if available)

Example:

Terminal window
arandu info

List all pipeline runs with status and metadata.

Usage:

Terminal window
arandu list-runs [OPTIONS]

Options:

OptionShortTypeDefaultDescription
--pipeline-pstrAll pipelinesFilter by pipeline type: transcription, qa, cep, kg, evaluation
--results-dir-rPath./resultsBase results directory

Examples:

Terminal window
# List all runs
arandu list-runs
# Filter by pipeline type
arandu list-runs --pipeline transcription
# Custom results directory
arandu list-runs --results-dir /path/to/results

Display detailed information about a specific run including execution environment, hardware info, configuration, and processing statistics.

Usage:

Terminal window
arandu run-info RUN_ID [OPTIONS]

Arguments:

  • RUN_ID - Run ID to display, or “latest” for the most recent run

Options:

OptionShortTypeDefaultDescription
--pipeline-pstrtranscriptionPipeline type (required when using “latest”)
--results-dir-rPath./resultsBase results directory

Examples:

Terminal window
# Display specific run
arandu run-info transcription_20260211_143022
# Display latest transcription run
arandu run-info latest --pipeline transcription
# Display latest CEP run
arandu run-info latest --pipeline cep

Rebuild index.json from existing run directories by scanning all pipeline ID directories for run_metadata.json files.

Usage:

Terminal window
arandu rebuild-index [OPTIONS]

Options:

OptionShortTypeDefaultDescription
--results-dir-rPath./resultsBase results directory

Example:

Terminal window
arandu rebuild-index --results-dir /path/to/results

Complete pipeline from transcription to CEP QA generation:

Terminal window
# Step 1: Batch transcribe files
arandu batch-transcribe input/catalog.csv \
--workers 4 \
--quantize \
--id etno-001
# Step 2: Judge transcription quality
arandu judge-transcription results/
# Step 3: Generate CEP QA pairs
arandu generate-cep-qa results/ \
--workers 4 \
--questions 12 \
--language pt \
--id etno-001
# Step 4: List all runs
arandu list-runs
# Step 5: View run details
arandu run-info latest --pipeline cep

All batch commands support automatic checkpointing:

Terminal window
# Start batch transcription
arandu batch-transcribe input/catalog.csv --workers 4
# If interrupted, resume automatically
arandu batch-transcribe input/catalog.csv --workers 4
# Will skip already processed files
# CEP generation also supports resume
arandu generate-cep-qa results/ --workers 4

Use different LLM providers and models:

Terminal window
# With Ollama (default)
arandu generate-cep-qa results/ \
--provider ollama \
--model-id qwen3:14b \
--workers 4
# With OpenAI
export OPENAI_API_KEY=sk-...
arandu generate-cep-qa results/ \
--provider openai \
--model-id gpt-4o-mini \
--workers 2
# With custom OpenAI-compatible endpoint
arandu generate-cep-qa results/ \
--provider custom \
--base-url https://my-vllm-server/v1 \
--model-id llama3.1:70b

Command-line arguments override environment variables:

Terminal window
# Config says ollama, but we override to openai
export ARANDU_QA_PROVIDER=ollama
arandu generate-cep-qa results/ --provider openai

Set defaults via environment instead of CLI:

Terminal window
# Transcription settings
export ARANDU_MODEL_ID=openai/whisper-large-v3
export ARANDU_WORKERS=4
export ARANDU_QUANTIZE=true
# QA settings
export ARANDU_QA_PROVIDER=ollama
export ARANDU_QA_MODEL_ID=qwen3:14b
export ARANDU_QA_QUESTIONS_PER_DOCUMENT=12
# CEP settings
export ARANDU_CEP_ENABLE_VALIDATION=true
export ARANDU_CEP_VALIDATOR_MODEL_ID=qwen3:14b
# Now run with defaults
arandu batch-transcribe input/catalog.csv
arandu generate-cep-qa results/

Use consistent pipeline IDs across related steps:

Terminal window
# All steps use same pipeline ID
PIPELINE_ID="etno-project-001"
arandu batch-transcribe input/catalog.csv --id $PIPELINE_ID
arandu judge-transcription results/
arandu generate-cep-qa results/ --id $PIPELINE_ID
# View all runs for this pipeline
arandu list-runs
arandu run-info $PIPELINE_ID

LLM Provider Not Available:

Terminal window
Error: Ollama server not reachable at http://localhost:11434
Solution: Start Ollama with 'ollama serve'

API Key Missing:

Terminal window
Error: OPENAI_API_KEY environment variable not set
Solution: export OPENAI_API_KEY=sk-...

Input Directory Empty:

Terminal window
Error: No transcription files found in results/
Solution: Check directory path and ensure files have .json extension

Checkpoint Corruption:

Terminal window
Error: Checkpoint file corrupted
Solution: Delete checkpoint.json and restart the command

Test on sample data before processing full corpus:

Terminal window
# Test on 5 files first
mkdir samples && ls results/*.json | head -5 | xargs -I {} cp {} samples/
arandu generate-cep-qa samples/ -o qa_test/

Checkpoints enable automatic resume:

results/checkpoint.json
# Runs create checkpoints automatically
arandu batch-transcribe input/catalog.csv --workers 4
# Resume automatically if interrupted
arandu batch-transcribe input/catalog.csv --workers 4
# Skips already processed files

Use pipeline tracking commands:

Terminal window
# List all runs
arandu list-runs
# View latest run details
arandu run-info latest --pipeline transcription
# Check run statistics
arandu run-info latest --pipeline cep

Adjust workers based on available resources:

Terminal window
# CPU-bound tasks: Use available cores
arandu generate-cep-qa results/ --workers $(nproc)
# Memory-constrained: Reduce workers
arandu batch-transcribe catalog.csv --workers 2

Always judge transcriptions before downstream tasks:

Terminal window
# Judge first
arandu judge-transcription results/
# Then generate QA
arandu generate-cep-qa results/ --workers 4

Document Version: 2.0
Last Updated: 2026-02-11
Status: Aligned with codebase v0.1.0