CLI Reference
Complete reference for all command-line interface commands in Arandu.
Table of Contents
Section titled “Table of Contents”- Command Overview
- Transcription Commands
- QA Generation Commands
- Utilities Commands
- Usage Examples
- Common Patterns
Command Overview
Section titled “Command Overview”The Arandu CLI is built with Typer and provides rich terminal output using Rich.
Base Command: arandu
Command Categories:
- Transcription:
transcribe,drive-transcribe,batch-transcribe - Judging:
judge-transcription,judge-qa - QA Generation:
generate-cep-qa - Utilities:
refresh-auth,info,list-runs,run-info,rebuild-index
Global Options:
--help- Show command help--version- Show application version
Transcription Commands
Section titled “Transcription Commands”transcribe
Section titled “transcribe”Transcribe a local audio or video file.
Usage:
arandu transcribe FILE_PATH [OPTIONS]Arguments:
FILE_PATH- Path to the audio/video file to transcribe
Options:
| Option | Short | Type | Default | Description |
|---|---|---|---|---|
--model-id | -m | str | openai/whisper-large-v3 | Hugging Face model ID for transcription |
--output | -o | Path | Auto-generated | Output file path for transcription JSON |
--quantize | -q | flag | False | Enable 8-bit quantization to reduce VRAM usage |
--cpu | flag | False | Force CPU execution (disables CUDA/MPS) | |
--language | -l | str | Auto-detect | Language code (e.g., ‘pt’ for Portuguese) |
Examples:
# Basic transcriptionarandu transcribe audio.mp3
# With custom modelarandu transcribe audio.mp3 --model-id openai/whisper-large-v3
# With quantization (reduced VRAM)arandu transcribe audio.mp3 --quantize
# Force CPU executionarandu transcribe audio.mp3 --cpu
# Specify languagearandu transcribe audio.mp3 --language pt
# Custom output locationarandu transcribe audio.mp3 -o results/transcription.jsondrive-transcribe
Section titled “drive-transcribe”Transcribe a file from Google Drive. Downloads the file, transcribes it, and uploads the result to the same Drive folder.
Usage:
arandu drive-transcribe FILE_ID [OPTIONS]Arguments:
FILE_ID- Google Drive file ID to transcribe
Options:
| Option | Short | Type | Default | Description |
|---|---|---|---|---|
--model-id | -m | str | openai/whisper-large-v3 | Hugging Face model ID |
--credentials | -c | Path | credentials.json | Path to Google OAuth2 credentials file |
--token | -t | Path | token.json | Path to Google OAuth2 token file |
--quantize | -q | flag | False | Enable 8-bit quantization |
--cpu | flag | False | Force CPU execution | |
--language | -l | str | Auto-detect | Language code |
Examples:
# Basic usagearandu drive-transcribe 1abc123xyz --credentials credentials.json
# With custom model and quantizationarandu drive-transcribe 1abc123xyz --model-id openai/whisper-large-v3 --quantizebatch-transcribe
Section titled “batch-transcribe”Batch transcribe audio/video files from a catalog CSV with parallel processing and automatic checkpoint/resume capability.
Usage:
arandu batch-transcribe CATALOG_FILE [OPTIONS]Arguments:
CATALOG_FILE- Path to catalog CSV file with Google Drive file metadata
Required CSV Columns:
file_id- Google Drive file IDname- File namemime_type- MIME typesize_bytes- File size in bytesparents- Parent folder IDsweb_content_link- Download linkduration_milliseconds(optional) - Media duration
Options:
| Option | Short | Type | Default | Description |
|---|---|---|---|---|
--output-dir | -o | Path | ./results | Output directory for transcription JSON files |
--model-id | -m | str | openai/whisper-large-v3 | Hugging Face model ID |
--credentials | -c | Path | credentials.json | Path to Google OAuth2 credentials file |
--token | -t | Path | token.json | Path to Google OAuth2 token file |
--workers | -w | int | 1 | Number of parallel workers |
--checkpoint | Path | results/checkpoint.json | Path to checkpoint file | |
--quantize | -q | flag | False | Enable 8-bit quantization |
--cpu | flag | False | Force CPU execution | |
--language | -l | str | Auto-detect | Language code |
--id | str | Auto-generated | Pipeline ID for grouping related steps |
Examples:
# Basic batch transcriptionarandu batch-transcribe input/catalog.csv --workers 4
# With custom output directoryarandu batch-transcribe input/catalog.csv -o transcriptions/ --workers 2
# With quantization and custom modelarandu batch-transcribe input/catalog.csv \ --model-id openai/whisper-large-v3 \ --quantize \ --workers 4
# Resume interrupted job (uses checkpoint automatically)arandu batch-transcribe input/catalog.csv --workers 4
# With custom pipeline IDarandu batch-transcribe input/catalog.csv --id my-project-001Judging Commands
Section titled “Judging Commands”The judge layer scores artifacts (transcriptions, QA pairs) with a composable two-stage pipeline: cheap heuristics first, optional LLM criteria second. The LLM stage is skipped automatically when the heuristic stage rejects.
judge-transcription
Section titled “judge-transcription”Judge transcription quality with heuristic and (optional) LLM criteria. Verdicts are written back into each *_transcription.json record under the validation field.
Usage:
arandu judge-transcription INPUT_DIR [OPTIONS]Arguments:
INPUT_DIR- Directory containing*_transcription.jsonfiles
Options:
| Option | Type | Default | Description |
|---|---|---|---|
--language | str | pt | Expected transcription language (pt or en) |
--validator-model | str | None | from ARANDU_JUDGE_VALIDATOR_MODEL | Model ID for the LLM filter stage. Omit (and leave the env var unset) to run heuristic-only |
--validator-provider | str | from ARANDU_JUDGE_VALIDATOR_PROVIDER | LLM provider (openai, ollama, custom). custom requires --validator-base-url or ARANDU_LLM_BASE_URL |
--validator-base-url | str | from ARANDU_JUDGE_VALIDATOR_BASE_URL | Base URL for the validator provider |
--validator-temperature | float | from ARANDU_JUDGE_TEMPERATURE (0.3) | Sampling temperature for LLM criteria |
--validator-max-tokens | int | from ARANDU_JUDGE_MAX_TOKENS (2048) | Max tokens for LLM criterion responses |
--rejudge / --resume | flag | --resume | --rejudge re-evaluates every record from scratch; --resume (default) skips records already carrying a validation payload |
Heuristic stage: content_length_floor (runs first, can short-circuit) → script_match → repetition → content_density → segment_quality.
LLM stage: language_drift + hallucination_loop. Skipped when no validator model is configured.
Examples:
# Heuristic-only (no LLM model needed) — fastest, runs entirely on CPU.arandu judge-transcription results/
# Heuristics + LLM via Ollamaarandu judge-transcription results/ --validator-model qwen3:14b
# Heuristics + LLM via an OpenAI-compatible custom endpointARANDU_LLM_BASE_URL=https://my-llm.example.com/v1 \arandu judge-transcription results/ --validator-model openai/gpt-4.1-mini
# Force a fresh pass over every record (default is resume)arandu judge-transcription results/ --validator-model qwen3:14b --rejudgejudge-qa
Section titled “judge-qa”Judge QA pair quality (faithfulness, Bloom calibration, informativeness, self-containedness). Verdicts are persisted onto each QA pair.
Usage:
arandu judge-qa INPUT_DIR [OPTIONS]Arguments:
INPUT_DIR- Directory containing CEP QA pair JSON files
Options:
| Option | Type | Default | Description |
|---|---|---|---|
--model / -m | str | None | from ARANDU_JUDGE_VALIDATOR_MODEL | Model ID for judge evaluation. Required (CLI flag or env var) |
--provider | str | None | from ARANDU_JUDGE_VALIDATOR_PROVIDER | LLM provider (openai, ollama, custom). Inferred from ARANDU_LLM_BASE_URL when unset (custom if set, else ollama) |
--base-url | str | None | from ARANDU_JUDGE_VALIDATOR_BASE_URL / ARANDU_LLM_BASE_URL | Custom base URL for OpenAI-compatible endpoints (required for --provider custom) |
--language / -l | str | pt | Expected QA language (pt or en) |
--files | int | None | (all) | Maximum number of QA files to sample |
--pairs | int | None | (all) | Maximum QA pairs to judge per file |
--rejudge / --resume | flag | --resume | --rejudge re-evaluates every pair from scratch; --resume (default) skips pairs already carrying a validation payload |
Examples:
# Use ARANDU_JUDGE_VALIDATOR_* env vars from .envarandu judge-qa cep_dataset/
# Explicit Ollamaarandu judge-qa cep_dataset/ --provider ollama --model qwen3:14b
# OpenAI-compatible custom endpoint (Gemini)arandu judge-qa cep_dataset/ --provider custom --model gemini-2.5-flash \ --base-url https://generativelanguage.googleapis.com/v1beta/openai/
# Sample-bounded runarandu judge-qa cep_dataset/ --files 2 --pairs 3QA Generation Commands
Section titled “QA Generation Commands”generate-cep-qa
Section titled “generate-cep-qa”Generate CEP (Cognitive Elicitation Pipeline) QA pairs from transcriptions with Bloom-level scaffolding and LLM-as-a-Judge validation.
Usage:
arandu generate-cep-qa INPUT_DIR [OPTIONS]Arguments:
INPUT_DIR- Directory containing transcription JSON files
Options:
| Option | Short | Type | Default | Description |
|---|---|---|---|---|
--output-dir | -o | Path | qa_dataset | Output directory for CEP QA dataset JSON files |
--provider | str | ollama | LLM provider: openai, ollama, custom | |
--model-id | -m | str | qwen3:14b | Model ID for QA generation |
--workers | -w | int | 2 | Number of parallel workers |
--questions | int | 10 | Number of QA pairs per document (1-50) | |
--temperature | float | 0.7 | LLM temperature for generation (0.0-2.0) | |
--ollama-url | str | http://localhost:11434/v1 | Ollama API base URL | |
--base-url | str | None | Custom base URL for OpenAI-compatible endpoints | |
--language | -l | str | pt | Language for prompts: ‘pt’ or ‘en’ |
--validate/--no-validate | flag | True | Enable LLM-as-a-Judge validation | |
--validator-model | str | qwen3:14b | Model ID for validation | |
--bloom-dist | str | None | Bloom level distribution (e.g., ‘remember:0.2,understand:0.3’) | |
--jsonl/--no-jsonl | flag | False | Export QA pairs to JSONL format for training | |
--id | str | Auto-resolved | Pipeline ID (auto-resolves transcription outputs) |
Examples:
# Basic usage with Ollamaarandu generate-cep-qa results/ -o qa_dataset/ --workers 4
# With custom Bloom distributionarandu generate-cep-qa results/ \ --bloom-dist "remember:0.2,understand:0.3,analyze:0.3,evaluate:0.2" \ --questions 15
# With OpenAIarandu generate-cep-qa results/ \ --provider openai \ --model-id gpt-4o-mini \ --workers 2
# Without validation (faster)arandu generate-cep-qa results/ \ --no-validate \ --workers 4
# With custom validator modelarandu generate-cep-qa results/ \ --validator-model qwen3:14b \ --questions 12
# Export to JSONL for KGQA trainingarandu generate-cep-qa results/ \ --jsonl \ --questions 20
# English promptsarandu generate-cep-qa results/ \ --language en \ --questions 10
# With pipeline IDarandu generate-cep-qa results/ --id my-project-001Output Structure:
qa_dataset/├── cep_qa_1abc123xyz.json├── cep_qa_2def456uvw.json└── cep_qa_checkpoint.jsonUtilities Commands
Section titled “Utilities Commands”refresh-auth
Section titled “refresh-auth”Fully refresh Google OAuth2 authentication token. Deletes existing token and initiates fresh OAuth2 authorization flow.
Usage:
arandu refresh-auth [OPTIONS]Options:
| Option | Short | Type | Default | Description |
|---|---|---|---|---|
--credentials | -c | Path | credentials.json | Path to Google OAuth2 credentials file |
--token | -t | Path | token.json | Path to token file to refresh |
Example:
arandu refresh-auth --credentials credentials.json --token token.jsonDisplay system information and hardware capabilities.
Usage:
arandu infoOutput:
- Application version
- Device type (CPU/CUDA/MPS)
- CUDA/MPS availability
- PyTorch version and configuration
- GPU memory information (if available)
Example:
arandu infolist-runs
Section titled “list-runs”List all pipeline runs with status and metadata.
Usage:
arandu list-runs [OPTIONS]Options:
| Option | Short | Type | Default | Description |
|---|---|---|---|---|
--pipeline | -p | str | All pipelines | Filter by pipeline type: transcription, qa, cep, kg, evaluation |
--results-dir | -r | Path | ./results | Base results directory |
Examples:
# List all runsarandu list-runs
# Filter by pipeline typearandu list-runs --pipeline transcription
# Custom results directoryarandu list-runs --results-dir /path/to/resultsrun-info
Section titled “run-info”Display detailed information about a specific run including execution environment, hardware info, configuration, and processing statistics.
Usage:
arandu run-info RUN_ID [OPTIONS]Arguments:
RUN_ID- Run ID to display, or “latest” for the most recent run
Options:
| Option | Short | Type | Default | Description |
|---|---|---|---|---|
--pipeline | -p | str | transcription | Pipeline type (required when using “latest”) |
--results-dir | -r | Path | ./results | Base results directory |
Examples:
# Display specific runarandu run-info transcription_20260211_143022
# Display latest transcription runarandu run-info latest --pipeline transcription
# Display latest CEP runarandu run-info latest --pipeline ceprebuild-index
Section titled “rebuild-index”Rebuild index.json from existing run directories by scanning all pipeline ID directories for run_metadata.json files.
Usage:
arandu rebuild-index [OPTIONS]Options:
| Option | Short | Type | Default | Description |
|---|---|---|---|---|
--results-dir | -r | Path | ./results | Base results directory |
Example:
arandu rebuild-index --results-dir /path/to/resultsUsage Examples
Section titled “Usage Examples”End-to-End Pipeline
Section titled “End-to-End Pipeline”Complete pipeline from transcription to CEP QA generation:
# Step 1: Batch transcribe filesarandu batch-transcribe input/catalog.csv \ --workers 4 \ --quantize \ --id etno-001
# Step 2: Judge transcription qualityarandu judge-transcription results/
# Step 3: Generate CEP QA pairsarandu generate-cep-qa results/ \ --workers 4 \ --questions 12 \ --language pt \ --id etno-001
# Step 4: List all runsarandu list-runs
# Step 5: View run detailsarandu run-info latest --pipeline cepResume Interrupted Job
Section titled “Resume Interrupted Job”All batch commands support automatic checkpointing:
# Start batch transcriptionarandu batch-transcribe input/catalog.csv --workers 4
# If interrupted, resume automaticallyarandu batch-transcribe input/catalog.csv --workers 4# Will skip already processed files
# CEP generation also supports resumearandu generate-cep-qa results/ --workers 4Custom LLM Configuration
Section titled “Custom LLM Configuration”Use different LLM providers and models:
# With Ollama (default)arandu generate-cep-qa results/ \ --provider ollama \ --model-id qwen3:14b \ --workers 4
# With OpenAIexport OPENAI_API_KEY=sk-...arandu generate-cep-qa results/ \ --provider openai \ --model-id gpt-4o-mini \ --workers 2
# With custom OpenAI-compatible endpointarandu generate-cep-qa results/ \ --provider custom \ --base-url https://my-vllm-server/v1 \ --model-id llama3.1:70bCommon Patterns
Section titled “Common Patterns”Configuration Override
Section titled “Configuration Override”Command-line arguments override environment variables:
# Config says ollama, but we override to openaiexport ARANDU_QA_PROVIDER=ollamaarandu generate-cep-qa results/ --provider openaiEnvironment Variables
Section titled “Environment Variables”Set defaults via environment instead of CLI:
# Transcription settingsexport ARANDU_MODEL_ID=openai/whisper-large-v3export ARANDU_WORKERS=4export ARANDU_QUANTIZE=true
# QA settingsexport ARANDU_QA_PROVIDER=ollamaexport ARANDU_QA_MODEL_ID=qwen3:14bexport ARANDU_QA_QUESTIONS_PER_DOCUMENT=12
# CEP settingsexport ARANDU_CEP_ENABLE_VALIDATION=trueexport ARANDU_CEP_VALIDATOR_MODEL_ID=qwen3:14b
# Now run with defaultsarandu batch-transcribe input/catalog.csvarandu generate-cep-qa results/Pipeline ID Tracking
Section titled “Pipeline ID Tracking”Use consistent pipeline IDs across related steps:
# All steps use same pipeline IDPIPELINE_ID="etno-project-001"
arandu batch-transcribe input/catalog.csv --id $PIPELINE_IDarandu judge-transcription results/arandu generate-cep-qa results/ --id $PIPELINE_ID
# View all runs for this pipelinearandu list-runsarandu run-info $PIPELINE_IDError Handling
Section titled “Error Handling”Common Errors
Section titled “Common Errors”LLM Provider Not Available:
Error: Ollama server not reachable at http://localhost:11434Solution: Start Ollama with 'ollama serve'API Key Missing:
Error: OPENAI_API_KEY environment variable not setSolution: export OPENAI_API_KEY=sk-...Input Directory Empty:
Error: No transcription files found in results/Solution: Check directory path and ensure files have .json extensionCheckpoint Corruption:
Error: Checkpoint file corruptedSolution: Delete checkpoint.json and restart the commandTips and Best Practices
Section titled “Tips and Best Practices”1. Start Small
Section titled “1. Start Small”Test on sample data before processing full corpus:
# Test on 5 files firstmkdir samples && ls results/*.json | head -5 | xargs -I {} cp {} samples/arandu generate-cep-qa samples/ -o qa_test/2. Use Checkpoints
Section titled “2. Use Checkpoints”Checkpoints enable automatic resume:
# Runs create checkpoints automaticallyarandu batch-transcribe input/catalog.csv --workers 4# Resume automatically if interruptedarandu batch-transcribe input/catalog.csv --workers 4# Skips already processed files3. Monitor Progress
Section titled “3. Monitor Progress”Use pipeline tracking commands:
# List all runsarandu list-runs
# View latest run detailsarandu run-info latest --pipeline transcription
# Check run statisticsarandu run-info latest --pipeline cep4. Optimize Workers
Section titled “4. Optimize Workers”Adjust workers based on available resources:
# CPU-bound tasks: Use available coresarandu generate-cep-qa results/ --workers $(nproc)
# Memory-constrained: Reduce workersarandu batch-transcribe catalog.csv --workers 25. Judge Quality
Section titled “5. Judge Quality”Always judge transcriptions before downstream tasks:
# Judge firstarandu judge-transcription results/
# Then generate QAarandu generate-cep-qa results/ --workers 4Document Version: 2.0
Last Updated: 2026-02-11
Status: Aligned with codebase v0.1.0