KG Construction
Build knowledge graphs from transcription results using AutoSchemaKG for entity and relation extraction.
Overview
Section titled “Overview”The KG construction pipeline extracts entities and relations from transcribed text to build knowledge graphs. Features include:
- Entity Extraction: People, locations, organizations, events, dates, concepts
- Relation Extraction: Semantic relationships between entities
- Dynamic Schema: AutoSchemaKG infers schema from data
- Metadata Enrichment: Automatic injection of source metadata into extraction context
- GraphML Export: NetworkX-compatible format for analysis
Prerequisites
Section titled “Prerequisites”- Transcription results in
results/directory - Docker with Compose v2
- LLM provider (Ollama recommended)
- (Optional) GPU for faster Ollama inference
Quick Start
Section titled “Quick Start”Using Docker Compose
Section titled “Using Docker Compose”# Start KG construction with Ollama sidecardocker compose --profile kg upUsing SLURM
Section titled “Using SLURM”sbatch scripts/slurm/kg/tupi.slurmConfiguration
Section titled “Configuration”Environment Variables
Section titled “Environment Variables”| Variable | Default | Description |
|---|---|---|
ARANDU_KG_BACKEND | atlas | KGC backend: atlas (AutoSchemaKG) |
ARANDU_KG_PROVIDER | ollama | LLM provider: openai, ollama, custom |
ARANDU_KG_MODEL_ID | llama3.1:8b | Model for extraction |
ARANDU_KG_OLLAMA_URL | http://localhost:11434/v1 | Ollama API URL |
ARANDU_KG_BASE_URL | (none) | Custom OpenAI-compatible endpoint |
ARANDU_KG_LANGUAGE | pt | Language code (ISO 639-1): pt, en |
ARANDU_KG_TEMPERATURE | 0.5 | LLM temperature (0.0-2.0, lower = more consistent) |
ARANDU_KG_OUTPUT_DIR | knowledge_graphs | Output directory for graph artifacts |
Language Support
Section titled “Language Support”The pipeline supports multilingual extraction via language-specific prompts:
| Code | Language | Prompts Directory |
|---|---|---|
pt | Portuguese | prompts/kg/atlas/ (language-keyed JSON) |
en | English | prompts/kg/atlas/ (language-keyed JSON) |
Example .env Configuration
Section titled “Example .env Configuration”# KG Construction SettingsARANDU_KG_BACKEND=atlasARANDU_KG_PROVIDER=ollamaARANDU_KG_MODEL_ID=llama3.1:8bARANDU_KG_LANGUAGE=ptARANDU_KG_TEMPERATURE=0.5
# DirectoriesARANDU_RESULTS_DIR=./resultsARANDU_KG_OUTPUT_DIR=./knowledge_graphsUsage Examples
Section titled “Usage Examples”Basic Usage
Section titled “Basic Usage”# Default configurationdocker compose --profile kg upCustom Model
Section titled “Custom Model”# Use larger model for better extractionARANDU_KG_MODEL_ID=llama3.1:70b docker compose --profile kg upDifferent Language
Section titled “Different Language”# Extract from English transcriptionsARANDU_KG_LANGUAGE=en docker compose --profile kg upUsing OpenAI
Section titled “Using OpenAI”# Use OpenAI for extractionexport ARANDU_KG_PROVIDER=openaiexport ARANDU_KG_MODEL_ID=gpt-4oexport OPENAI_API_KEY=sk-...docker compose --profile kg upSLURM with Custom Settings
Section titled “SLURM with Custom Settings”# Submit to specific partitionARANDU_KG_MODEL_ID=qwen3:14b PIPELINE_ID=test-cep-01 \ sbatch scripts/slurm/kg/tupi.slurmOutput Format
Section titled “Output Format”Knowledge graphs are saved in the output directory:
<output_dir>/├── atlas_input/│ └── transcriptions.json # Input prepared for atlas-rag├── atlas_output/│ ├── kg_extraction/ # Raw extraction results│ ├── triples_csv/ # Extracted triples as CSV│ └── <model>_<timestamp>.graphml # Final knowledge graph└── <model>_<timestamp>.metadata.json # Provenance sidecarGraphML Structure
Section titled “GraphML Structure”The GraphML files are NetworkX-compatible and contain:
Nodes (Entities):
id: Unique identifierlabel: Entity texttype: Entity type (PERSON, LOCATION, ORGANIZATION, EVENT, DATE, CONCEPT)
Edges (Relations):
source: Source entity IDtarget: Target entity IDrelation: Relation type (LOCATED_IN, CAUSED_BY, AFFECTED_BY, etc.)
Metadata Schema
Section titled “Metadata Schema”{ "graph_id": "test-cep-01", "source_documents": ["1abc123xyz", "2def456uvw"], "model_id": "qwen3:14b", "provider": "ollama", "language": "pt", "created_at": "2026-02-26T15:45:00Z", "total_documents": 2, "total_nodes": 342, "total_edges": 187, "backend_version": "atlas-rag==0.0.5"}Working with Graphs
Section titled “Working with Graphs”Loading with NetworkX
Section titled “Loading with NetworkX”import networkx as nx
# Load the corpus graphgraph = nx.read_graphml("knowledge_graphs/corpus_graph.graphml")
# Basic statisticsprint(f"Nodes: {graph.number_of_nodes()}")print(f"Edges: {graph.number_of_edges()}")print(f"Density: {nx.density(graph):.4f}")
# List entity typesentity_types = {}for node, attrs in graph.nodes(data=True): etype = attrs.get('type', 'UNKNOWN') entity_types[etype] = entity_types.get(etype, 0) + 1print(f"Entity types: {entity_types}")Querying the Graph
Section titled “Querying the Graph”import networkx as nx
graph = nx.read_graphml("knowledge_graphs/corpus_graph.graphml")
# Find all PERSON entitiespersons = [n for n, d in graph.nodes(data=True) if d.get('type') == 'PERSON']print(f"Found {len(persons)} persons")
# Get neighbors of a nodenode = persons[0]neighbors = list(graph.neighbors(node))print(f"Neighbors of {node}: {neighbors}")
# Find paths between entitiesif len(persons) > 1: try: path = nx.shortest_path(graph, persons[0], persons[1]) print(f"Path: {' -> '.join(path)}") except nx.NetworkXNoPath: print("No path found")Loading Metadata
Section titled “Loading Metadata”from arandu.schemas import KGMetadata
# Load metadatametadata = KGMetadata.load("knowledge_graphs/corpus_graph_metadata.json")
print(f"Graph ID: {metadata.graph_id}")print(f"Source documents: {len(metadata.source_documents)}")print(f"Language: {metadata.language}")print(f"Model: {metadata.model_id}")Entity Types
Section titled “Entity Types”AutoSchemaKG extracts the following entity types:
| Type | Description | Examples |
|---|---|---|
PERSON | People, groups | ”Maria Silva”, “the community” |
LOCATION | Geographic places | ”Rio de Janeiro”, “the riverbank” |
ORGANIZATION | Institutions | ”IBAMA”, “the university” |
EVENT | Occurrences | ”the 2023 flood”, “the meeting” |
DATE | Temporal references | ”January 2023”, “last year” |
CONCEPT | Abstract ideas | ”environmental protection”, “tradition” |
Relation Types
Section titled “Relation Types”Common relation types extracted:
| Relation | Description | Example |
|---|---|---|
LOCATED_IN | Spatial containment | ”community LOCATED_IN riverbank” |
CAUSED_BY | Causal relationship | ”flood CAUSED_BY heavy rain” |
AFFECTED_BY | Impact relationship | ”crops AFFECTED_BY drought” |
OCCURRED_IN | Temporal/spatial | ”meeting OCCURRED_IN January” |
BELONGS_TO | Membership | ”Maria BELONGS_TO community” |
Monitoring Progress
Section titled “Monitoring Progress”Docker Logs
Section titled “Docker Logs”# Watch KG construction logsdocker compose --profile kg logs -f arandu-kg
# Check Ollama statusdocker compose --profile kg logs ollamaSLURM Logs
Section titled “SLURM Logs”# Monitor job outputtail -f logs/arandu-kg_<jobid>.outAtlas Backend Metadata Injection
Section titled “Atlas Backend Metadata Injection”When transcription records have source metadata (participant name, location, date, event context), the atlas backend automatically prepends a translated metadata header to every chunk sent to the LLM for triple extraction. This provides provenance context that improves entity and relation quality.
How It Works
Section titled “How It Works”- During input preparation, each document’s
SourceMetadatafields are formatted into a header using translated labels fromprompts/kg/atlas/metadata_labels.json - The header is stored in the document’s metadata dict as
_metadata_header - A custom
DatasetProcessorsubclass intercepts atlas-rag’s chunking step - After text is split into chunks, the header is prepended to each chunk’s text
- The LLM sees the full provenance context in every extraction prompt
Example Chunk (Portuguese)
Section titled “Example Chunk (Portuguese)”[Contexto da Entrevista]Participante: João da SilvaLocal: Barra de PelotasData: 2024-03-15Contexto: Audiência Câmara de Vereadores
[Transcrição]...então a água subiu muito rápido, em menos de duas horas já estava...Example Chunk (English)
Section titled “Example Chunk (English)”[Interview Context]Participant: João da SilvaLocation: Barra de PelotasDate: 2024-03-15Event Context: Audiência Câmara de Vereadores
[Transcription]...so the water rose very fast, in less than two hours it was already...Adding a New Language
Section titled “Adding a New Language”Add a new key to prompts/kg/atlas/metadata_labels.json:
{ "es": { "header": "[Contexto de la Entrevista]", "transcription": "[Transcripción]", "participant": "Participante", "location": "Ubicación", "date": "Fecha", "context": "Contexto del Evento", "researcher": "Investigador(a)", "sequence": "Secuencia" }}Then add "es" to KGConfig.validate_language in src/arandu/config.py.
Disabling Metadata Injection
Section titled “Disabling Metadata Injection”If source metadata is not available on the transcription records (i.e., source_metadata is None), the header is simply omitted and chunks are passed through unmodified. No configuration flag is needed.
Batch-Level Resume
Section titled “Batch-Level Resume”When a SLURM job times out or is interrupted during triple extraction, the atlas backend automatically resumes from the last completed batch on the next run. No manual intervention is needed.
How It Works
Section titled “How It Works”- Atlas-rag writes extraction results as JSONL lines to
atlas_output/kg_extraction/ - On the next run, the backend counts existing JSONL lines and divides by
batch_size_tripleto determine completed batches - Any partial last batch (incomplete lines from a mid-batch interruption) is trimmed to avoid duplicates
ProcessingConfig.resume_fromis set to skip already-processed batches- Atlas-rag creates a new timestamped output file for the remaining batches
- Downstream steps (
json2csv, concept generation, GraphML export) read all files in the directory and merge results automatically
Requirements
Section titled “Requirements”- The
atlas_output/directory from the previous run must be preserved (sameoutput_dir) - The input data must be identical (same records produce the same chunks and batch boundaries)
batch_size_triplemust be the same between runs
Example
Section titled “Example”# First run — times out after processing 9 chunks (3 batches of 3)PIPELINE_ID=test-cep-01 sbatch scripts/slurm/kg/tupi.slurm
# Resubmit — automatically detects 3 completed batches, resumes from batch 4PIPELINE_ID=test-cep-01 sbatch scripts/slurm/kg/tupi.slurmThe logs will show:
Resuming from batch 3 (9 chunks already processed)Best Practices
Section titled “Best Practices”-
Model Selection
- Use
llama3.1:8bfor balanced speed/quality - Use
llama3.1:70bfor higher quality extraction - Lower temperature (0.3-0.5) for more consistent extraction
- Use
-
Language Settings
- Match
KG_LANGUAGEto your transcription language - Use appropriate prompt templates
- Match
-
Backend Options
- Adjust
chunk_sizefor memory constraints - Tune
batch_size_triplefor API rate limits - Set
max_workersto control parallelism
- Adjust
Troubleshooting
Section titled “Troubleshooting”Empty Graph
Section titled “Empty Graph”# Check if transcription results existls -la results/*.json
# Verify transcription has contentcat results/<file_id>.json | jq '.transcription_text | length'Poor Entity Extraction
Section titled “Poor Entity Extraction”- Use larger model:
ARANDU_KG_MODEL_ID=llama3.1:70b - Lower temperature:
ARANDU_KG_TEMPERATURE=0.3 - Verify language setting matches transcription language
Memory Issues
Section titled “Memory Issues”# Use smaller modelexport ARANDU_KG_MODEL_ID=llama3.2:3b
# Reduce chunk size via backend optionsexport ARANDU_KG_BACKEND_OPTIONS='{"chunk_size": 4096}'Ollama Timeout
Section titled “Ollama Timeout”# Increase keep-alive timeexport OLLAMA_KEEP_ALIVE=30m
# Restart with more memorydocker compose --profile kg downdocker compose --profile kg upSee also: QA Generation | Evaluation | Configuration