Skip to content

Evaluation

Evaluate the quality of QA datasets and knowledge graphs using targeted metrics for each pipeline.

The evaluation system provides metrics tailored to each pipeline:

PipelineMetricsFocus
QA PipelineqaAnswer quality, retrieval readiness
KG Pipelineentity, relation, semanticGraph quality, knowledge coverage

You can run evaluations independently for each pipeline or combine them.

  • Docker with Compose v2
  • (Optional) Transcription results in results/

For QA Evaluation:

  • QA dataset in qa_dataset/ directory

For KG Evaluation:

  • Knowledge graph in knowledge_graphs/

Note: Evaluation does not require an LLM provider - it uses local computation only.


Evaluate the quality of generated question-answer pairs.

MetricRangeDescriptionGood Value
Exact Match0.0-1.0Percentage of exact answer matches> 0.6
F1 Score0.0-1.0Token-level F1 between predicted and gold> 0.7
BLEU Score0-100N-gram overlap score> 60
Terminal window
# Evaluate QA quality only
ARANDU_EVALUATION_METRICS=qa docker compose --profile evaluate up
Terminal window
EVAL_METRICS=qa sbatch scripts/slurm/evaluation/run_evaluation.slurm
evaluation/
├── report.json # Main evaluation report
└── qa_metrics_<timestamp>.json # Detailed QA metrics
{
"dataset_name": "etno_qa_evaluation",
"evaluation_timestamp": "2026-01-26T18:30:00Z",
"total_documents": 187,
"total_qa_pairs": 2244,
"qa_exact_match": 0.68,
"qa_f1_score": 0.79,
"qa_bleu_score": 72.3,
"overall_score": 0.74
}
Terminal window
# Verify QA dataset exists
ls -la qa_dataset/*.json
IssueIndicatorSolution
Low F1F1 < 0.6Use better LLM model
Poor matchEM < 0.5Lower temperature (0.3-0.5)
InconsistentHigh varianceIncrease questions per document

Evaluate the quality of knowledge graphs through entity, relation, and semantic metrics.

MetricRangeDescriptionGood Value
Total Entities0+Count of all entity mentions-
Unique Entities0+Count of distinct entities-
Entity Density0+Entities per 100 tokens10-20
Entity Diversity0.0-1.0unique/total ratio0.3-0.6
MetricRangeDescriptionGood Value
Total Relations0+Count of all relations-
Unique Relations0+Count of relation types-
Relation Density0+Relations per entity1.5-3.0
Average Degree0+Mean node connections> 2.0
Connected Components1+Number of graph components1-5
Graph Density0.0-1.0Edge/possible edge ratio0.001-0.01
MetricRangeDescriptionGood Value
Coherence Score0.0-1.0Semantic consistency> 0.7
Information Density0+(Entities+Relations)/text_length> 0.03
Knowledge Coverage0.0-1.0Entities covered by QA> 0.5
Terminal window
# Evaluate graph quality only
ARANDU_EVALUATION_METRICS=entity,relation docker compose --profile evaluate up
# Include semantic metrics (requires embedding computation)
ARANDU_EVALUATION_METRICS=entity,relation,semantic docker compose --profile evaluate up
Terminal window
EVAL_METRICS=entity,relation,semantic sbatch scripts/slurm/evaluation/run_evaluation.slurm
evaluation/
├── report.json # Main evaluation report
├── entity_metrics_<timestamp>.json # Entity coverage details
└── relation_metrics_<timestamp>.json # Relation and connectivity details
{
"dataset_name": "etno_kg_evaluation",
"evaluation_timestamp": "2026-01-26T18:30:00Z",
"total_documents": 187,
"entity_coverage": {
"total_entities": 4823,
"unique_entities": 1523,
"entity_density": 12.4,
"entity_diversity": 0.316,
"entity_type_distribution": {
"PERSON": 342,
"LOCATION": 189,
"EVENT": 423,
"CONCEPT": 569
}
},
"relation_metrics": {
"total_relations": 3847,
"unique_relations": 1204,
"relation_density": 2.53,
"relation_diversity": 0.313,
"graph_connectivity": {
"average_degree": 5.05,
"connected_components": 3,
"largest_component_size": 1421,
"density": 0.0033
}
},
"semantic_quality": {
"coherence_score": 0.78,
"information_density": 0.042,
"knowledge_coverage": 0.64
},
"overall_score": 0.71
}
Terminal window
# Verify knowledge graph exists
ls -la knowledge_graphs/corpus_graph.graphml
IssueIndicatorSolution
Too few entitiesDensity < 5Use larger model, check prompts
Low diversityDiversity < 0.2Reduce repetition in extraction
Disconnected graphComponents > 10Improve relation extraction
Low coherenceScore < 0.5Check language settings
Terminal window
# Use smaller embedding model
export ARANDU_EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
# Or run entity/relation metrics first (fast)
ARANDU_EVALUATION_METRICS=entity,relation docker compose --profile evaluate up

Run both QA and KG evaluation together.

Terminal window
# All metrics
docker compose --profile evaluate up
# Or via SLURM
sbatch scripts/slurm/evaluation/run_evaluation.slurm
VariableDefaultDescription
ARANDU_EVALUATION_METRICSqa,entity,relation,semanticMetrics to compute
ARANDU_EMBEDDING_MODELsentence-transformers/all-MiniLM-L6-v2Model for semantic embeddings
ARANDU_QA_DIR./qa_datasetQA dataset directory
ARANDU_KG_DIR./knowledge_graphsKnowledge graph directory
ARANDU_EVAL_DIR./evaluationOutput directory
CategoryMetricsPipeline
qaExact Match, F1, BLEUQA Pipeline
entityCoverage, Density, DiversityKG Pipeline
relationDensity, ConnectivityKG Pipeline
semanticCoherence, Info Density, CoverageKG Pipeline

The overall score is computed as a weighted average:

overall_score = (
0.3 * qa_f1_score +
0.2 * entity_diversity +
0.2 * (relation_density / 3.0) + # normalized
0.3 * coherence_score
)

When running evaluation for only one pipeline, the score reflects available metrics.

MetricGood Value
overall_score> 0.70
qa_f1_score> 0.75
entity_diversity0.3-0.6
relation_density1.5-3.0
coherence_score> 0.75

from arandu.schemas import EvaluationReport
# Load evaluation report
report = EvaluationReport.load("evaluation/report.json")
print(f"Overall Score: {report.overall_score:.3f}")
print(f"Total Documents: {report.total_documents}")
# QA metrics (if available)
if report.qa_f1_score:
print(f"QA F1: {report.qa_f1_score:.3f}")
# KG metrics (if available)
if report.entity_coverage:
print(f"Entity Diversity: {report.entity_coverage.entity_diversity:.3f}")
if report.relation_metrics:
print(f"Relation Density: {report.relation_metrics.relation_density:.3f}")
from arandu.schemas import (
EvaluationReport,
EntityCoverageResult,
RelationMetricsResult,
SemanticQualityResult
)
# QA-only report
qa_report = EvaluationReport(
dataset_name="qa_evaluation",
total_documents=50,
total_qa_pairs=500,
qa_f1_score=0.75,
qa_exact_match=0.68,
qa_bleu_score=71.5
)
# KG-only report
kg_report = EvaluationReport(
dataset_name="kg_evaluation",
total_documents=50,
entity_coverage=EntityCoverageResult(
total_entities=1000,
unique_entities=300,
entity_density=15.0,
entity_type_distribution={"PERSON": 100, "LOCATION": 200}
),
relation_metrics=RelationMetricsResult(
total_relations=800,
unique_relations=250,
relation_density=2.67
)
)
# Save reports
qa_report.save("evaluation/qa_report.json")
kg_report.save("evaluation/kg_report.json")

Terminal window
# Watch evaluation logs
docker compose --profile evaluate logs -f arandu-eval
Terminal window
# Monitor job output
tail -f logs/arandu-eval_<jobid>.out

  1. Generate enough QA pairs (10+ per document)
  2. Use consistent LLM settings for generation
  3. Compare across different model configurations
  1. Verify graph file exists before evaluation
  2. Run entity/relation metrics first (fast)
  3. Use multilingual embedding models for non-English content
    • paraphrase-multilingual-MiniLM-L12-v2 for Portuguese
  1. Save reports from different configurations
  2. Track improvements over iterations
  3. Use recommendations to guide improvements

See also: QA Generation | KG Construction | Configuration