Skip to content

Configuration Reference

Complete reference for all configuration settings in the Arandu pipeline.

  1. Configuration System Overview
  2. TranscriberConfig
  3. QAConfig
  4. CEPConfig
  5. KGConfig
  6. EvaluationConfig
  7. LLMConfig
  8. ResultsConfig
  9. TranscriptionQualityConfig
  10. Environment Variables
  11. Configuration Examples

The Arandu project uses Pydantic Settings for configuration management with hierarchical loading:

  1. Command-line arguments (highest priority)
  2. Environment variables with config-specific prefixes
  3. .env file in project root
  4. Default values in config.py (lowest priority)

Configuration File: src/arandu/config.py

Architecture: The system uses 8 separate configuration classes, each with its own environment variable prefix:

  • TranscriberConfig - Prefix: ARANDU_
  • QAConfig - Prefix: ARANDU_QA_
  • CEPConfig - Prefix: ARANDU_CEP_
  • KGConfig - Prefix: ARANDU_KG_
  • EvaluationConfig - Prefix: ARANDU_EVAL_
  • LLMConfig - No prefix (uses aliases like OPENAI_API_KEY)
  • ResultsConfig - Prefix: ARANDU_RESULTS_
  • TranscriptionQualityConfig - Prefix: ARANDU_QUALITY_

Usage:

from arandu.config import TranscriberConfig, QAConfig
transcriber_config = TranscriberConfig()
qa_config = QAConfig()
print(transcriber_config.model_id) # openai/whisper-large-v3
print(qa_config.provider) # ollama

Configuration settings for the transcription pipeline.

Environment Prefix: ARANDU_

SettingTypeDefaultDescription
model_idstr"openai/whisper-large-v3"Hugging Face model ID for Whisper transcription
languagestr | NoneNoneLanguage code (e.g., ‘pt’). If None, auto-detect
return_timestampsboolTrueReturn timestamps for transcription segments
chunk_length_sint30Audio chunk length in seconds
stride_length_sint5Stride length in seconds between chunks
SettingTypeDefaultDescription
force_cpuboolFalseForce CPU execution instead of GPU
quantizeboolFalseEnable 8-bit quantization to reduce VRAM
quantize_bitsint8Number of bits for quantization
SettingTypeDefaultDescription
credentialsstr"credentials.json"Path to Google OAuth2 credentials file
tokenstr"token.json"Path to Google OAuth2 token file
scopeslist[str]["https://www.googleapis.com/auth/drive"]OAuth2 scopes for Google Drive API

Note: credentials_file and token_file are backward-compatible property aliases for credentials and token.

SettingTypeDefaultDescription
workersint1Number of parallel workers for batch processing
catalog_filestr"catalog.csv"Name of the catalog CSV file
SettingTypeDefaultDescription
input_dirstr"./input"Directory containing input files
results_dirstr"./results"Directory for transcription results
credentials_dirstr"./"Directory containing credentials and token files
hf_cache_dirstr"./cache/huggingface"Hugging Face cache directory for model storage
SettingTypeDefaultDescription
temp_dirstr"/tmp/arandu" (platform-specific)Temporary directory for file processing
max_retriesint3Maximum number of retry attempts for failed operations
retry_delayfloat1.0Delay in seconds between retry attempts

Example Configuration:

from arandu.config import TranscriberConfig
config = TranscriberConfig()
# Or with custom settings:
config = TranscriberConfig(
model_id="openai/whisper-large-v3",
force_cpu=False,
workers=4
)

Configuration settings for the QA generation pipeline.

Environment Prefix: ARANDU_QA_

SettingTypeDefaultDescription
providerstr"ollama"LLM provider: “openai”, “ollama”, “custom”
model_idstr"qwen3:14b"Model ID for QA generation
ollama_urlstr"http://localhost:11434/v1"Ollama API base URL
base_urlstr | NoneNoneCustom base URL for OpenAI-compatible endpoints
SettingTypeDefaultDescription
questions_per_documentint10Number of QA pairs to generate per document (min: 1, max: 50)
temperaturefloat0.7Temperature for QA generation LLM (range: 0.0-2.0)
max_tokensint2048Max tokens for QA generation LLM (min: 1)
SettingTypeDefaultDescription
output_dirPathPath("qa_dataset")Output directory for QA datasets
SettingTypeDefaultDescription
languagestr"pt"Language code for QA generation prompts (ISO 639-1: ‘en’ or ‘pt’)
workersint2Number of parallel workers for QA generation

Example Configuration:

from arandu.config import QAConfig
config = QAConfig(
provider="ollama",
model_id="qwen3:14b",
questions_per_document=15,
language="pt"
)

Configuration settings for the CEP (Cognitive Elicitation Pipeline) with Bloom’s Taxonomy scaffolding and LLM-as-a-Judge validation.

Environment Prefix: ARANDU_CEP_

SettingTypeDefaultDescription
enable_reasoning_tracesboolTrueEnable reasoning trace generation for answers
enable_validationboolTrueEnable LLM-as-a-Judge validation (requires additional LLM calls)
SettingTypeDefaultDescription
bloom_levelslist[str]["remember", "understand", "analyze", "evaluate"]Bloom levels for question generation
bloom_distributiondict[str, float]{"remember": 0.2, "understand": 0.3, "analyze": 0.3, "evaluate": 0.2}Distribution per level (must sum to 1.0)
enable_scaffolding_contextboolTruePass previously generated QA pairs as context to higher Bloom levels
max_scaffolding_pairsint10Max prior QA pairs to include as scaffolding context (min: 1, max: 50)

Valid Bloom Levels: remember, understand, apply, analyze, evaluate, create

SettingTypeDefaultDescription
max_hop_countint3Maximum reasoning hops to detect for multi-hop questions (min: 1, max: 5)

Module III - LLM-as-a-Judge Validation Settings

Section titled “Module III - LLM-as-a-Judge Validation Settings”
SettingTypeDefaultDescription
validator_providerstr"ollama"LLM provider for validation: “openai”, “ollama”, “custom”
validator_model_idstr"qwen3:14b"Model ID for LLM-as-a-Judge validation
validator_temperaturefloat0.3Temperature for validator (low for consistent evaluation, range: 0.0-1.0)
validation_thresholdfloat0.6Minimum overall score to pass validation (range: 0.0-1.0)
SettingTypeDefaultDescription
faithfulness_weightfloat0.4Weight for faithfulness score in overall calculation (range: 0.0-1.0)
bloom_calibration_weightfloat0.3Weight for Bloom calibration score in overall calculation (range: 0.0-1.0)
informativeness_weightfloat0.3Weight for informativeness score in overall calculation (range: 0.0-1.0)

Note: The three scoring weights (faithfulness_weight, bloom_calibration_weight, informativeness_weight) must sum to 1.0. A @model_validator enforces this constraint.

SettingTypeDefaultDescription
languagestr"pt"Language for CEP prompts (ISO 639-1: ‘pt’ or ‘en’)

Example Configuration:

from arandu.config import CEPConfig
config = CEPConfig(
bloom_levels=["remember", "understand", "analyze"],
bloom_distribution={"remember": 0.3, "understand": 0.4, "analyze": 0.3},
enable_scaffolding_context=True,
validation_threshold=0.7
)

Configuration settings for the knowledge graph construction pipeline.

Environment Prefix: ARANDU_KG_

SettingTypeDefaultDescription
providerstr"ollama"LLM provider: “openai”, “ollama”, “custom”
model_idstr"llama3.1:8b"Model ID for KG construction
ollama_urlstr"http://localhost:11434/v1"Ollama API base URL for KG construction
base_urlstr | NoneNoneCustom base URL for OpenAI-compatible endpoints
SettingTypeDefaultDescription
backendstr"atlas"KGC backend: "atlas" (AutoSchemaKG)
backend_optionsdict{}Backend-specific options (e.g., chunk_size, batch_size_triple, max_workers)

Backend Validation: Must match pattern ^(atlas)$

SettingTypeDefaultDescription
temperaturefloat0.5Temperature for KG construction LLM (lower = more consistent, range: 0.0-2.0)
SettingTypeDefaultDescription
languagestr"pt"Language code for extraction prompts (ISO 639-1): "pt", "en"

Prompts are stored in prompts/kg/atlas/ using language-keyed JSON files. The atlas backend loads the appropriate language at runtime.

SettingTypeDefaultDescription
output_dirPathPath("knowledge_graphs")Output directory for knowledge graphs

Example Configuration:

from arandu.config import KGConfig
config = KGConfig(
backend="atlas",
provider="ollama",
model_id="qwen3:14b",
language="pt",
temperature=0.5,
backend_options={"chunk_size": 4096, "max_workers": 4},
)

Atlas Backend Options (passed via backend_options):

OptionDefaultDescription
batch_size_triple3Batch size for triple extraction
batch_size_concept16Batch size for concept generation
chunk_size8192Characters per text chunk
max_new_tokens2048Max tokens for LLM generation
include_concepttrueWhether to run concept generation
max_workers3Thread pool size for API calls

Configuration settings for the evaluation pipeline.

Environment Prefix: ARANDU_EVAL_

SettingTypeDefaultDescription
metricslist[str]["qa", "entity", "relation", "semantic"]Metrics to compute
embedding_modelstr"sentence-transformers/all-MiniLM-L6-v2"Sentence transformer model for semantic embeddings

Valid Metrics: qa, entity, relation, semantic

  • qa - QA-based metrics (EM, F1, BLEU)
  • entity - Entity coverage metrics
  • relation - Relation density metrics
  • semantic - Semantic quality metrics
SettingTypeDefaultDescription
output_dirPathPath("evaluation")Output directory for evaluation reports
SettingTypeDefaultDescription
qa_dirPathPath("qa_dataset")Directory containing QA dataset
kg_dirPathPath("knowledge_graphs")Directory containing knowledge graphs
results_dirPathPath("results")Directory containing transcription results

Example Configuration:

from arandu.config import EvaluationConfig
config = EvaluationConfig(
metrics=["qa", "entity", "semantic"],
embedding_model="sentence-transformers/all-MiniLM-L6-v2"
)

Shared LLM configuration settings for API keys and shared LLM settings across pipelines.

Environment Prefix: None (uses field aliases)

SettingTypeDefaultAliasDescription
openai_api_keystr | NoneNoneOPENAI_API_KEYOpenAI API key
base_urlstr | NoneNoneARANDU_LLM_BASE_URLCustom base URL for OpenAI-compatible endpoints

Example Configuration:

from arandu.config import LLMConfig
config = LLMConfig()
# Loaded from OPENAI_API_KEY and ARANDU_LLM_BASE_URL env vars

Environment Variables:

Terminal window
export OPENAI_API_KEY=sk-...
export ARANDU_LLM_BASE_URL=https://my-custom-endpoint/v1

Configuration for versioned results management.

Environment Prefix: ARANDU_RESULTS_

SettingTypeDefaultDescription
base_dirPathPath("./results")Base directory for versioned results
enable_versioningboolTrueEnable versioned result directories

Example Configuration:

from arandu.config import ResultsConfig
config = ResultsConfig(
base_dir=Path("/data/results"),
enable_versioning=True
)

Configuration for transcription quality validation with heuristic quality checks.

Environment Prefix: ARANDU_QUALITY_

SettingTypeDefaultDescription
enabledboolTrueEnable transcription quality validation
quality_thresholdfloat0.5Minimum quality score to mark transcription as valid (range: 0.0-1.0)
expected_languagestr"pt"Expected language code (e.g., ‘pt’, ‘en’)
SettingTypeDefaultDescription
script_match_weightfloat0.35Weight for script/charset match check
repetition_weightfloat0.30Weight for repetition detection
segment_quality_weightfloat0.20Weight for segment pattern analysis
content_density_weightfloat0.15Weight for content density check

Note: The four dimension weights must sum to 1.0. A @model_validator enforces this constraint at initialization.

SettingTypeDefaultDescription
max_non_latin_ratiofloat0.1Maximum ratio of non-Latin characters for Latin languages
max_word_repetition_ratiofloat0.15Maximum ratio of most repeated word
max_phrase_repetition_countint4Maximum allowed repetitions of same phrase
suspicious_uniform_intervalsint5Number of consecutive uniform 1-second intervals to flag
min_words_per_minutefloat30.0Minimum words per minute threshold
max_words_per_minutefloat300.0Maximum words per minute threshold
max_empty_segment_ratiofloat0.2Maximum ratio of empty segments before flagging
uniform_interval_tolerancefloat0.1Tolerance (±seconds) for detecting uniform 1-second intervals

Example Configuration:

from arandu.config import TranscriptionQualityConfig
config = TranscriptionQualityConfig(
enabled=True,
quality_threshold=0.6,
expected_language="pt"
)

See also: Transcription Validation Guide for full usage details


Configuration settings are loaded from environment variables with config-specific prefixes.

Config ClassPrefixExample
TranscriberConfigARANDU_ARANDU_MODEL_ID
QAConfigARANDU_QA_ARANDU_QA_PROVIDER
CEPConfigARANDU_CEP_ARANDU_CEP_ENABLE_VALIDATION
KGConfigARANDU_KG_ARANDU_KG_PROVIDER
EvaluationConfigARANDU_EVAL_ARANDU_EVAL_METRICS
LLMConfig(No prefix)OPENAI_API_KEY, ARANDU_LLM_BASE_URL
ResultsConfigARANDU_RESULTS_ARANDU_RESULTS_BASE_DIR
TranscriptionQualityConfigARANDU_QUALITY_ARANDU_QUALITY_ENABLED
Terminal window
<PREFIX><SETTING_NAME>=<value>

TranscriberConfig (ARANDU_):

Terminal window
export ARANDU_MODEL_ID=openai/whisper-large-v3
export ARANDU_FORCE_CPU=false
export ARANDU_WORKERS=4
export ARANDU_RETRY_DELAY=1.0

QAConfig (ARANDU_QA_):

Terminal window
export ARANDU_QA_PROVIDER=openai
export ARANDU_QA_MODEL_ID=gpt-4o-mini
export ARANDU_QA_QUESTIONS_PER_DOCUMENT=15
export ARANDU_QA_OLLAMA_URL=http://localhost:11434/v1
export ARANDU_QA_LANGUAGE=pt

CEPConfig (ARANDU_CEP_):

Terminal window
export ARANDU_CEP_ENABLE_VALIDATION=true
export ARANDU_CEP_BLOOM_LEVELS=remember,understand,analyze
export ARANDU_CEP_VALIDATION_THRESHOLD=0.7
export ARANDU_CEP_VALIDATOR_PROVIDER=ollama

KGConfig (ARANDU_KG_):

Terminal window
export ARANDU_KG_PROVIDER=openai
export ARANDU_KG_MODEL_ID=gpt-4o
export ARANDU_KG_MERGE_GRAPHS=true
export ARANDU_KG_LANGUAGE=pt
export ARANDU_KG_OLLAMA_URL=http://localhost:11434/v1

EvaluationConfig (ARANDU_EVAL_):

Terminal window
export ARANDU_EVAL_METRICS=qa,entity,relation,semantic
export ARANDU_EVAL_EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2

LLMConfig (No prefix, uses aliases):

Terminal window
export OPENAI_API_KEY=sk-...
export ARANDU_LLM_BASE_URL=https://my-custom-endpoint/v1

ResultsConfig (ARANDU_RESULTS_):

Terminal window
export ARANDU_RESULTS_BASE_DIR=/data/results
export ARANDU_RESULTS_ENABLE_VERSIONING=true

TranscriptionQualityConfig (ARANDU_QUALITY_):

Terminal window
export ARANDU_QUALITY_ENABLED=true
export ARANDU_QUALITY_QUALITY_THRESHOLD=0.6
export ARANDU_QUALITY_EXPECTED_LANGUAGE=pt

Sensitive values should be set as environment variables only (never commit to git):

Terminal window
export OPENAI_API_KEY=sk-...
export ARANDU_LLM_BASE_URL=https://my-custom-endpoint/v1

These can also be set in .env file:

Terminal window
# .env file
OPENAI_API_KEY=sk-...
ARANDU_LLM_BASE_URL=https://my-custom-endpoint/v1

Note: The .env file should be added to .gitignore.


.env file:

Terminal window
# Transcription
ARANDU_MODEL_ID=openai/whisper-large-v3
ARANDU_WORKERS=2
ARANDU_RETRY_DELAY=1.0
# QA Generation
ARANDU_QA_PROVIDER=ollama
ARANDU_QA_MODEL_ID=qwen3:14b
ARANDU_QA_OLLAMA_URL=http://localhost:11434/v1
ARANDU_QA_QUESTIONS_PER_DOCUMENT=10
ARANDU_QA_LANGUAGE=pt
# CEP (Cognitive Elicitation Pipeline)
ARANDU_CEP_ENABLE_VALIDATION=true
ARANDU_CEP_VALIDATION_THRESHOLD=0.6
# KG Construction
ARANDU_KG_PROVIDER=ollama
ARANDU_KG_MODEL_ID=llama3.1:8b
ARANDU_KG_OLLAMA_URL=http://localhost:11434/v1
ARANDU_KG_MERGE_GRAPHS=true
ARANDU_KG_OUTPUT_FORMAT=graphml
ARANDU_KG_LANGUAGE=pt
# Evaluation
ARANDU_EVAL_METRICS=qa,entity,relation,semantic
# Transcription Quality Validation
ARANDU_QUALITY_ENABLED=true
ARANDU_QUALITY_QUALITY_THRESHOLD=0.5

CLI Usage:

Terminal window
# QA generation (uses .env settings)
arandu generate-cep-qa results/
# Override specific settings
arandu generate-cep-qa results/ --questions 15 --provider ollama

.env file:

Terminal window
# API Keys
OPENAI_API_KEY=sk-your-key-here
# QA Generation with OpenAI
ARANDU_QA_PROVIDER=openai
ARANDU_QA_MODEL_ID=gpt-4o-mini
ARANDU_QA_QUESTIONS_PER_DOCUMENT=12
ARANDU_QA_TEMPERATURE=0.7
ARANDU_QA_LANGUAGE=en
# CEP with OpenAI
ARANDU_CEP_ENABLE_VALIDATION=true
ARANDU_CEP_VALIDATOR_PROVIDER=openai
ARANDU_CEP_VALIDATOR_MODEL_ID=gpt-4o-mini
ARANDU_CEP_VALIDATION_THRESHOLD=0.7
# KG Construction with OpenAI
ARANDU_KG_PROVIDER=openai
ARANDU_KG_MODEL_ID=gpt-4o
ARANDU_KG_TEMPERATURE=0.5
ARANDU_KG_MERGE_GRAPHS=true
# Results versioning
ARANDU_RESULTS_BASE_DIR=/data/transcriptions
ARANDU_RESULTS_ENABLE_VERSIONING=true
# Output directories
ARANDU_QA_OUTPUT_DIR=/data/qa_dataset
ARANDU_KG_OUTPUT_DIR=/data/knowledge_graphs
ARANDU_EVAL_OUTPUT_DIR=/data/evaluation

Example 3: Hybrid Approach (OpenAI + Ollama)

Section titled “Example 3: Hybrid Approach (OpenAI + Ollama)”

.env file:

Terminal window
# API Keys
OPENAI_API_KEY=sk-your-key-here
# QA with OpenAI (higher quality)
ARANDU_QA_PROVIDER=openai
ARANDU_QA_MODEL_ID=gpt-4o
ARANDU_QA_QUESTIONS_PER_DOCUMENT=15
# KG with Ollama (cost-effective)
ARANDU_KG_PROVIDER=ollama
ARANDU_KG_MODEL_ID=llama3.1:8b
ARANDU_KG_OLLAMA_URL=http://localhost:11434/v1
# Workers
ARANDU_WORKERS=4

SLURM script (run_qa_generation.slurm):

#!/bin/bash
#SBATCH --job-name=arandu-qa
#SBATCH --partition=grace
#SBATCH --cpus-per-task=16
# Set configuration via environment
export ARANDU_QA_PROVIDER=ollama
export ARANDU_QA_MODEL_ID=qwen3:14b
export ARANDU_QA_OLLAMA_URL=http://localhost:11434/v1
export ARANDU_QA_WORKERS=8
export ARANDU_QA_QUESTIONS_PER_DOCUMENT=12
# Use $SCRATCH for I/O
export ARANDU_RESULTS_BASE_DIR=$SCRATCH/results
export ARANDU_QA_OUTPUT_DIR=$SCRATCH/qa_dataset
# Run via Docker
source scripts/slurm/job_common.sh
docker compose --profile qa up arandu-qa --abort-on-container-exit

docker-compose.override.yml (local overrides):

version: '3.8'
services:
arandu-qa:
environment:
- ARANDU_QA_PROVIDER=ollama
- ARANDU_QA_MODEL_ID=qwen3:14b
- ARANDU_QA_OLLAMA_URL=http://host.docker.internal:11434/v1
- ARANDU_QA_WORKERS=4
- ARANDU_QA_QUESTIONS_PER_DOCUMENT=10
volumes:
- ./results:/app/results:ro
- ./qa_dataset:/app/qa_dataset:rw
arandu-kg:
environment:
- ARANDU_KG_PROVIDER=ollama
- ARANDU_KG_MODEL_ID=llama3.1:8b
- ARANDU_KG_OLLAMA_URL=http://host.docker.internal:11434/v1
- ARANDU_KG_MERGE_GRAPHS=true
- ARANDU_KG_WORKERS=2
volumes:
- ./results:/app/results:ro
- ./knowledge_graphs:/app/knowledge_graphs:rw

Example 6: CEP with Advanced Bloom Scaffolding

Section titled “Example 6: CEP with Advanced Bloom Scaffolding”

.env file:

Terminal window
# CEP Settings
ARANDU_CEP_ENABLE_REASONING_TRACES=true
ARANDU_CEP_ENABLE_VALIDATION=true
ARANDU_CEP_ENABLE_SCAFFOLDING_CONTEXT=true
ARANDU_CEP_MAX_SCAFFOLDING_PAIRS=15
# Bloom Levels (custom distribution)
ARANDU_CEP_BLOOM_LEVELS=remember,understand,analyze,evaluate
# LLM-as-a-Judge Validation
ARANDU_CEP_VALIDATOR_PROVIDER=ollama
ARANDU_CEP_VALIDATOR_MODEL_ID=qwen3:14b
ARANDU_CEP_VALIDATOR_TEMPERATURE=0.3
ARANDU_CEP_VALIDATION_THRESHOLD=0.7
# Scoring weights (must sum to 1.0)
ARANDU_CEP_FAITHFULNESS_WEIGHT=0.4
ARANDU_CEP_BLOOM_CALIBRATION_WEIGHT=0.3
ARANDU_CEP_INFORMATIVENESS_WEIGHT=0.3

The configuration system includes validation rules enforced by Pydantic.

# From QAConfig
questions_per_document: int = Field(
default=10,
ge=1, # Must be >= 1
le=50, # Must be <= 50
)
# From KGConfig
output_format: str = Field(
default="graphml",
pattern="^(graphml|json)$" # Must match pattern
)
# From CEPConfig
@field_validator("bloom_levels")
@classmethod
def validate_bloom_levels(cls, v: list[str]) -> list[str]:
valid_levels = {"remember", "understand", "apply", "analyze", "evaluate", "create"}
for level in v:
if level not in valid_levels:
raise ValueError(f"Invalid Bloom level: {level!r}")
return v
# From TranscriptionQualityConfig
@model_validator(mode="after")
def validate_scoring_weights(self) -> TranscriptionQualityConfig:
total = (
self.script_match_weight
+ self.repetition_weight
+ self.segment_quality_weight
+ self.content_density_weight
)
if not (0.99 <= total <= 1.01):
raise ValueError(f"Quality scoring weights must sum to 1.0, got {total:.3f}")
return self

Pydantic Settings loads configuration in the following priority order (highest to lowest):

  1. Command-line arguments (highest priority):

    Terminal window
    arandu generate-cep-qa results/ --provider ollama --questions 15
  2. Environment variables:

    Terminal window
    export ARANDU_QA_PROVIDER=openai
    export ARANDU_QA_MODEL_ID=gpt-4o-mini
  3. .env file in project root:

    Terminal window
    ARANDU_QA_PROVIDER=ollama
    ARANDU_QA_MODEL_ID=qwen3:14b
  4. Default values in config.py (lowest priority):

    provider: str = Field(default="ollama")
    model_id: str = Field(default="qwen3:14b")

Result: Command-line arguments override everything else. Environment variables override .env file and defaults.


  1. Use .env for local development

    • Easy to manage and version control (with .env.example)
    • Keep .env in .gitignore
    • Commit .env.example with safe defaults
  2. Use environment variables for production

    • Better for CI/CD and containerized environments
    • Secrets management integration (e.g., Kubernetes Secrets)
    • No risk of committing sensitive data
  3. Use command-line arguments for one-off overrides

    • Quick testing without changing configuration
    • Scripting and automation
    • Debugging specific settings
  4. Never commit API keys

    • Always use environment variables or secrets management
    • Add .env to .gitignore
    • Use .env.example for documentation
  5. Document configuration changes

    • Update .env.example when adding new settings
    • Add comments explaining non-obvious settings
    • Document valid values and ranges
  6. Validate configuration early

    • Let Pydantic validation catch errors at startup
    • Use type hints and constraints
    • Write tests for custom validators

.env.example (committed to git):

Terminal window
# ============================================================================
# Arandu Configuration Template
# Copy this file to .env and fill in your values
# ============================================================================
# ============================================================================
# TranscriberConfig (ARANDU_)
# ============================================================================
# Model settings
ARANDU_MODEL_ID=openai/whisper-large-v3
ARANDU_LANGUAGE= # Optional: pt, en, etc. (auto-detect if not set)
ARANDU_CHUNK_LENGTH_S=30
ARANDU_STRIDE_LENGTH_S=5
# Hardware settings
ARANDU_FORCE_CPU=false
ARANDU_QUANTIZE=false
# Google Drive settings
ARANDU_CREDENTIALS=credentials.json
ARANDU_TOKEN=token.json
# Batch processing
ARANDU_WORKERS=2
# Paths
ARANDU_INPUT_DIR=./input
ARANDU_RESULTS_DIR=./results
ARANDU_HF_CACHE_DIR=./cache/huggingface
# Processing
ARANDU_MAX_RETRIES=3
ARANDU_RETRY_DELAY=1.0
# ============================================================================
# QAConfig (ARANDU_QA_)
# ============================================================================
# LLM Provider
ARANDU_QA_PROVIDER=ollama # openai, ollama, custom
ARANDU_QA_MODEL_ID=qwen3:14b
ARANDU_QA_OLLAMA_URL=http://localhost:11434/v1
# ARANDU_QA_BASE_URL= # For custom OpenAI-compatible endpoints
# Generation
ARANDU_QA_QUESTIONS_PER_DOCUMENT=10
ARANDU_QA_TEMPERATURE=0.7
ARANDU_QA_MAX_TOKENS=2048
# Output
ARANDU_QA_OUTPUT_DIR=qa_dataset
# Language
ARANDU_QA_LANGUAGE=pt # pt or en
ARANDU_QA_WORKERS=2
# ============================================================================
# CEPConfig (ARANDU_CEP_)
# ============================================================================
# Module toggles
ARANDU_CEP_ENABLE_REASONING_TRACES=true
ARANDU_CEP_ENABLE_VALIDATION=true
# Bloom scaffolding
ARANDU_CEP_BLOOM_LEVELS=remember,understand,analyze,evaluate
ARANDU_CEP_ENABLE_SCAFFOLDING_CONTEXT=true
ARANDU_CEP_MAX_SCAFFOLDING_PAIRS=10
# Reasoning
ARANDU_CEP_MAX_HOP_COUNT=3
# LLM-as-a-Judge validation
ARANDU_CEP_VALIDATOR_PROVIDER=ollama
ARANDU_CEP_VALIDATOR_MODEL_ID=qwen3:14b
ARANDU_CEP_VALIDATOR_TEMPERATURE=0.3
ARANDU_CEP_VALIDATION_THRESHOLD=0.6
# Scoring weights (must sum to 1.0)
ARANDU_CEP_FAITHFULNESS_WEIGHT=0.4
ARANDU_CEP_BLOOM_CALIBRATION_WEIGHT=0.3
ARANDU_CEP_INFORMATIVENESS_WEIGHT=0.3
# Language
ARANDU_CEP_LANGUAGE=pt
# ============================================================================
# KGConfig (ARANDU_KG_)
# ============================================================================
# LLM Provider
ARANDU_KG_PROVIDER=ollama # openai, ollama, custom
ARANDU_KG_MODEL_ID=llama3.1:8b
ARANDU_KG_OLLAMA_URL=http://localhost:11434/v1
# ARANDU_KG_BASE_URL= # For custom OpenAI-compatible endpoints
# Graph settings
ARANDU_KG_MERGE_GRAPHS=true
ARANDU_KG_OUTPUT_FORMAT=graphml # graphml or json
ARANDU_KG_SCHEMA_MODE=dynamic # dynamic or predefined
# LLM settings
ARANDU_KG_TEMPERATURE=0.5
# Language and prompts
ARANDU_KG_LANGUAGE=pt
ARANDU_KG_PROMPT_PATH=prompts/pt_prompts.json
# Output
ARANDU_KG_OUTPUT_DIR=knowledge_graphs
ARANDU_KG_WORKERS=2
# ============================================================================
# EvaluationConfig (ARANDU_EVAL_)
# ============================================================================
# Metrics
ARANDU_EVAL_METRICS=qa,entity,relation,semantic
ARANDU_EVAL_EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
# Output
ARANDU_EVAL_OUTPUT_DIR=evaluation
# Input directories (optional overrides)
# ARANDU_EVAL_QA_DIR=qa_dataset
# ARANDU_EVAL_KG_DIR=knowledge_graphs
# ARANDU_EVAL_RESULTS_DIR=results
# ============================================================================
# LLMConfig (No prefix - uses aliases)
# ============================================================================
# API Keys (SENSITIVE - do not commit)
# OPENAI_API_KEY=sk-...
# Custom endpoints
# ARANDU_LLM_BASE_URL=http://localhost:11434/v1
# ============================================================================
# ResultsConfig (ARANDU_RESULTS_)
# ============================================================================
ARANDU_RESULTS_BASE_DIR=./results
ARANDU_RESULTS_ENABLE_VERSIONING=true
# ============================================================================
# TranscriptionQualityConfig (ARANDU_QUALITY_)
# ============================================================================
# General
ARANDU_QUALITY_ENABLED=true
ARANDU_QUALITY_QUALITY_THRESHOLD=0.5
ARANDU_QUALITY_EXPECTED_LANGUAGE=pt
# Scoring weights (must sum to 1.0)
ARANDU_QUALITY_SCRIPT_MATCH_WEIGHT=0.35
ARANDU_QUALITY_REPETITION_WEIGHT=0.30
ARANDU_QUALITY_SEGMENT_QUALITY_WEIGHT=0.20
ARANDU_QUALITY_CONTENT_DENSITY_WEIGHT=0.15
# Thresholds (advanced - usually keep defaults)
# ARANDU_QUALITY_MAX_NON_LATIN_RATIO=0.1
# ARANDU_QUALITY_MAX_WORD_REPETITION_RATIO=0.15
# ARANDU_QUALITY_MAX_PHRASE_REPETITION_COUNT=4
# ARANDU_QUALITY_SUSPICIOUS_UNIFORM_INTERVALS=5
# ARANDU_QUALITY_MIN_WORDS_PER_MINUTE=30.0
# ARANDU_QUALITY_MAX_WORDS_PER_MINUTE=300.0
# ARANDU_QUALITY_MAX_EMPTY_SEGMENT_RATIO=0.2
# ARANDU_QUALITY_UNIFORM_INTERVAL_TOLERANCE=0.1

Document Version: 2.0
Last Updated: 2025-01-24
Changes: Complete rewrite to reflect actual implementation with 8 separate config classes