CEP QA Generation
Generate cognitively-calibrated question-answer pairs using the Cognitive Elicitation Pipeline (CEP) with Bloom’s Taxonomy scaffolding and LLM-as-a-Judge validation.
Overview
Section titled “Overview”The CEP pipeline extends the standard QA generation with three specialized modules:
- Module I: Bloom Scaffolding - Generates questions distributed across Bloom’s taxonomy levels
- Module II: Reasoning & Grounding - Enriches higher-level questions with reasoning traces
- Module III: LLM-as-a-Judge - Validates QA pairs for faithfulness, Bloom calibration, and informativeness
Key Features
Section titled “Key Features”- Cognitive Scaffolding: Questions progress from basic recall to advanced analysis
- Bloom Taxonomy Levels: remember, understand, apply, analyze, evaluate, create
- Reasoning Traces: Explicit logical connections for complex questions
- Multi-hop Detection: Identifies questions requiring distant information connection
- Tacit Knowledge Extraction: Surfaces implicit domain knowledge
- Quality Validation: LLM-based scoring for faithfulness and informativeness
Prerequisites
Section titled “Prerequisites”- Transcription results in
results/directory - Docker with Compose v2
- LLM provider (Ollama recommended)
Quick Start
Section titled “Quick Start”Using Docker Compose
Section titled “Using Docker Compose”# Start CEP QA generation with Ollama sidecardocker compose --profile cep upUsing SLURM
Section titled “Using SLURM”# Grace partition (NVIDIA L40S)sbatch scripts/slurm/cep/grace.slurm
# Tupi partition (NVIDIA RTX 4090)sbatch scripts/slurm/cep/tupi.slurm
# Sirius partition (AMD, CPU mode)sbatch scripts/slurm/cep/sirius.slurmUsing CLI
Section titled “Using CLI”# Basic usage (validation enabled by default)arandu generate-cep-qa results/ --output-dir cep_dataset/
# Disable validation for faster processingarandu generate-cep-qa results/ --no-validate --output-dir cep_dataset/
# Export to JSONL formatarandu generate-cep-qa results/ --jsonl --output-dir cep_dataset/
# With pipeline ID for trackingarandu generate-cep-qa results/ --id etno-project-001Configuration
Section titled “Configuration”Environment Variables
Section titled “Environment Variables”| Variable | Default | Description |
|---|---|---|
ARANDU_QA_PROVIDER | ollama | LLM provider: openai, ollama, custom |
ARANDU_QA_MODEL_ID | qwen3:14b | Model for QA generation |
ARANDU_QA_OLLAMA_URL | http://ollama:11434/v1 | Ollama API URL |
ARANDU_QA_QUESTIONS_PER_DOCUMENT | 10 | QA pairs per document |
ARANDU_QA_TEMPERATURE | 0.7 | LLM temperature (0.0-2.0) |
ARANDU_QA_WORKERS | 2 | Parallel workers |
CEP-Specific Variables
Section titled “CEP-Specific Variables”| Variable | Default | Description |
|---|---|---|
ARANDU_CEP_ENABLE_VALIDATION | true | Enable LLM-as-a-Judge validation |
ARANDU_CEP_VALIDATOR_PROVIDER | ollama | Validator LLM provider |
ARANDU_CEP_VALIDATOR_MODEL_ID | qwen3:14b | Validator model |
ARANDU_CEP_LANGUAGE | pt | Prompt language (pt or en) |
ARANDU_CEP_ENABLE_SCAFFOLDING_CONTEXT | true | Pass prior QA pairs to higher Bloom levels |
ARANDU_CEP_MAX_SCAFFOLDING_PAIRS | 10 | Max prior QA pairs to include as context |
Bloom Distribution
Section titled “Bloom Distribution”Default distribution allocates questions across cognitive levels:
| Level | Default Weight | Description |
|---|---|---|
remember | 20% | Recall explicit facts |
understand | 30% | Explain and interpret concepts |
analyze | 30% | Identify relationships and patterns |
evaluate | 20% | Make judgments and justify decisions |
Customize via CLI:
arandu generate-cep-qa results/ \ --bloom-dist "remember:0.1,understand:0.2,analyze:0.4,evaluate:0.3"Example .env Configuration
Section titled “Example .env Configuration”# QA Generation SettingsARANDU_QA_PROVIDER=ollamaARANDU_QA_MODEL_ID=qwen3:14bARANDU_QA_QUESTIONS_PER_DOCUMENT=12ARANDU_QA_TEMPERATURE=0.7ARANDU_QA_WORKERS=4
# CEP-Specific SettingsARANDU_CEP_ENABLE_VALIDATION=trueARANDU_CEP_VALIDATOR_MODEL_ID=qwen3:14bARANDU_CEP_LANGUAGE=pt
# DirectoriesARANDU_RESULTS_DIR=./resultsARANDU_CEP_DIR=./cep_datasetUsage Examples
Section titled “Usage Examples”Basic CEP Generation
Section titled “Basic CEP Generation”# Default configuration (Portuguese prompts, validation enabled)docker compose --profile cep upWithout LLM-as-a-Judge Validation
Section titled “Without LLM-as-a-Judge Validation”# Disable validation for faster processingARANDU_CEP_ENABLE_VALIDATION=false docker compose --profile cep upGPU-Accelerated Ollama
Section titled “GPU-Accelerated Ollama”# Use GPU profile for faster inferencedocker compose --profile cep-gpu upEnglish Language Prompts
Section titled “English Language Prompts”# Generate QA pairs with English promptsARANDU_CEP_LANGUAGE=en docker compose --profile cep upExport to JSONL
Section titled “Export to JSONL”# Generate and export to JSONL format for trainingarandu generate-cep-qa results/ --jsonl --output-dir cep_dataset/Output Format
Section titled “Output Format”CEP records are saved as JSON files in cep_dataset/:
cep_dataset/├── <file_id_1>_cep_qa.json├── <file_id_1>_cep_qa.jsonl # JSONL export for this record├── <file_id_2>_cep_qa.json├── <file_id_2>_cep_qa.jsonl # JSONL export for this record└── cep_checkpoint.json # For resumptionQARecordCEP Schema
Section titled “QARecordCEP Schema”{ "source_file_id": "1abc123xyz", "source_filename": "interview_2023.mp3", "transcription_text": "O pescador contou que quando o rio sobe...", "qa_pairs": [ { "question": "O que o pescador faz quando o rio sobe?", "answer": "Guarda o barco para evitar perda", "context": "Se o rio sobe rápido, guardo o barco para evitar perda", "question_type": "factual", "confidence": 0.92, "bloom_level": "remember", "reasoning_trace": null, "is_multi_hop": false, "hop_count": null, "tacit_inference": null }, { "question": "Por que o pescador guarda o barco quando o rio sobe?", "answer": "Para evitar perda do equipamento devido ao risco de enchente", "context": "Se o rio sobe rápido, guardo o barco para evitar perda", "question_type": "conceptual", "confidence": 0.88, "bloom_level": "analyze", "reasoning_trace": "Fato: rio sobe → Ação: guardar barco → Razão: evitar perda", "is_multi_hop": false, "hop_count": null, "tacit_inference": "Subida rápida do rio indica risco iminente de enchente", "generation_prompt": "Gere uma pergunta de análise que identifique relações de causa e efeito..." } ], "model_id": "qwen3:14b", "provider": "ollama", "generation_timestamp": "2026-02-03T10:30:00Z", "total_pairs": 12, "validated_pairs": 10, "bloom_distribution": { "remember": 3, "understand": 4, "analyze": 3, "evaluate": 2 }, "validation_summary": { "avg_faithfulness": 0.85, "avg_bloom_calibration": 0.78, "avg_informativeness": 0.72, "avg_overall_score": 0.79, "validation_pass_rate": 0.83 }, "cep_version": "1.0"}JSONL Export Format
Section titled “JSONL Export Format”Each line contains one QA pair:
{"question": "O que aconteceu?", "answer": "...", "context": "...", "bloom_level": "remember", "confidence": 0.92}{"question": "Por que isso aconteceu?", "answer": "...", "context": "...", "bloom_level": "analyze", "reasoning_trace": "..."}Bloom Taxonomy Levels
Section titled “Bloom Taxonomy Levels”remember
Section titled “remember”- Description: Recall explicit facts from the text
- Question Starters: O que, Quem, Quando, Onde, Qual
- Example: “O que o pescador faz quando o rio sobe?“
understand
Section titled “understand”- Description: Explain, interpret, or summarize concepts
- Question Starters: Explique, Descreva, Por que (basic), Como funciona
- Example: “Explique o processo de preparação do barco.”
- Description: Use knowledge in new situations
- Question Starters: Como você aplicaria, O que aconteceria se
- Example: “Como o pescador aplicaria este conhecimento em outra situação?“
analyze
Section titled “analyze”- Description: Identify relationships, patterns, and connections
- Question Starters: Por que, Quais são as causas, Compare, Relacione
- Example: “Por que a subida do rio representa um risco para o equipamento?“
evaluate
Section titled “evaluate”- Description: Make judgments and justify decisions
- Question Starters: Avalie, Justifique, Qual a importância, É adequado
- Example: “A decisão de guardar o barco foi acertada? Justifique.”
create
Section titled “create”- Description: Propose solutions or create something new
- Question Starters: Proponha, Como melhorar, Elabore, Sugira
- Example: “Proponha uma solução alternativa para proteger o barco.”
Validation Criteria
Section titled “Validation Criteria”When validation is enabled, each QA pair is scored on three criteria:
Faithfulness (40% weight)
Section titled “Faithfulness (40% weight)”Is the answer grounded in the provided context?
| Score | Description |
|---|---|
| 1.0 | Answer completely grounded in text |
| 0.8 | Answer well grounded with minimal inferences |
| 0.6 | Answer mostly grounded with some non-trivial inferences |
| 0.4 | Answer partially grounded with significant inferences |
| 0.2 | Answer weakly grounded |
| 0.0 | Answer not grounded, hallucinated, or contradictory |
Bloom Calibration (30% weight)
Section titled “Bloom Calibration (30% weight)”Does the question match the proposed cognitive level?
| Score | Description |
|---|---|
| 1.0 | Perfectly calibrated - requires exactly the declared level |
| 0.8 | Well calibrated - requires predominantly the declared level |
| 0.6 | Reasonably calibrated with some overlap |
| 0.4 | Undercalibrated - requires lower cognitive level |
| 0.0 | Completely miscalibrated |
Informativeness (30% weight)
Section titled “Informativeness (30% weight)”Does the answer reveal non-obvious or tacit knowledge?
| Score | Description |
|---|---|
| 1.0 | Reveals significant tacit knowledge or practical know-how |
| 0.8 | Reveals useful and non-obvious knowledge |
| 0.6 | Reveals moderately useful contextual information |
| 0.4 | Common but well-articulated information |
| 0.0 | Trivial or obvious information |
The overall score is a weighted average. QA pairs below the threshold (default 0.6) are marked as invalid.
Programmatic Usage
Section titled “Programmatic Usage”from arandu.schemas import QARecordCEP, QAPairCEPfrom arandu.config import CEPConfig, QAConfig, get_cep_config
# Load existing CEP recordrecord = QARecordCEP.load("cep_dataset/1abc123xyz_cep_qa.json")
# Access QA pairs with Bloom levelsfor qa in record.qa_pairs: print(f"Level: {qa.bloom_level}") print(f"Q: {qa.question}") print(f"A: {qa.answer}") if qa.reasoning_trace: print(f"Reasoning: {qa.reasoning_trace}") print()
# Filter by Bloom levelanalyze_pairs = [qa for qa in record.qa_pairs if qa.bloom_level == "analyze"]
# Check Bloom distributionprint(f"Distribution: {record.bloom_distribution}")
# Export to JSONL stringjsonl_content = record.to_jsonl()
# Export to JSONL filerecord.to_jsonl("output.jsonl")
# Get CEP configurationcep_config = get_cep_config()print(f"Validation enabled: {cep_config.enable_validation}")print(f"Bloom levels: {cep_config.bloom_levels}")Monitoring Progress
Section titled “Monitoring Progress”Docker Logs
Section titled “Docker Logs”# Watch CEP generation logsdocker compose --profile cep logs -f arandu-cep
# Check Ollama statusdocker compose --profile cep logs ollamaSLURM Logs
Section titled “SLURM Logs”# Monitor CEP job outputtail -f logs/cep_<partition>_<jobid>.out
# Check job statussqueue -u $USERResumption
Section titled “Resumption”The pipeline automatically checkpoints progress. To resume an interrupted job:
# Simply restart - checkpoint is detected automaticallydocker compose --profile cep upThe checkpoint file (cep_dataset/cep_checkpoint.json) tracks:
- Completed documents
- Failed documents (for retry)
- Processing statistics
Best Practices
Section titled “Best Practices”Model Selection
Section titled “Model Selection”- qwen3:14b: Default. Best balance of pt-BR support, JSON reliability, and inference speed (24GB GPU)
- qwen3:8b: Lighter alternative (~5GB VRAM) for development or constrained GPUs
- llama3.3:70b: Highest quality for final production runs (48GB+ GPU)
Validation Strategy
Section titled “Validation Strategy”- Keep validation enabled (default) for quality assurance
- Disable validation (
--no-validate) for faster iteration during development - Use validation threshold of 0.6-0.7 to balance coverage and quality
Bloom Distribution
Section titled “Bloom Distribution”- Start with default distribution
- Increase
analyze/evaluatefor research datasets - Increase
remember/understandfor educational datasets
Language Selection
Section titled “Language Selection”- Use
pt(Portuguese) for PT-BR content (default) - Use
en(English) for English content - Prompts are optimized for each language
Troubleshooting
Section titled “Troubleshooting”No QA Pairs Generated
Section titled “No QA Pairs Generated”# Check if transcription results existls -la results/*.json
# Verify Ollama is runningdocker compose --profile cep exec ollama ollama list
# Check for minimum context length# Text must be at least 100 charactersLow Bloom Calibration Scores
Section titled “Low Bloom Calibration Scores”- Ensure model has good reasoning capabilities
- Try larger model:
ARANDU_QA_MODEL_ID=llama3.3:70b - Lower temperature for more consistent output:
ARANDU_QA_TEMPERATURE=0.5
Validation Rejecting Most Pairs
Section titled “Validation Rejecting Most Pairs”- Lower threshold: default is 0.6, try 0.5
- Check validator model capability
- Review prompt language matches content language
Ollama Connection Issues
Section titled “Ollama Connection Issues”# Restart Ollama servicedocker compose --profile cep restart ollama
# Check Ollama healthdocker compose --profile cep exec ollama curl http://localhost:11434/api/tags
# Pull model if missingdocker compose --profile cep exec ollama ollama pull qwen3:14bSee also: QA Generation | Configuration | Docker Deployment