• About ApeRAG
Design
    • System Architecture
    • Chat History Message Data Flow
    • Document Upload Architecture Design
    • Knowledge Graph
Development
    • Container Container Images
    • Development Guide
  1. Documents
GithubDocuments

ApeRAG Graph Index Creation Process Technical Documentation

Overview

The ApeRAG Graph Index creation process is the core pipeline of the entire knowledge graph construction system, responsible for transforming raw documents into structured knowledge graphs. This process is based on the LightRAG framework with extensive refactoring and optimization.

Technical Improvements Overview

The original LightRAG has numerous limitations: non-stateless design leading to global state management concurrency conflicts, lack of effective concurrency control mechanisms, storage layer stability and consistency issues, and coarse-grained locking affecting performance. Most critically, the original LightRAG does not support data isolation - all collection nodes and edges are stored in the same global space, causing data conflicts and pollution between different users and projects, making true multi-tenant support impossible.
We have conducted large-scale refactoring to address these issues:
  • Complete rewrite to stateless architecture: Each task uses independent instances, completely resolving concurrency conflicts
  • Introduced workspace data isolation mechanism: Each collection has independent data space, completely solving data conflicts and pollution
  • Self-developed Concurrent Control model: Implementing fine-grained lock management for high concurrency processing
  • Optimized lock granularity: From coarse-grained global locks to entity-level and relationship-level precise locking
  • Refactored storage layer: Support for Neo4j, PostgreSQL and other graph database backends, implementing reliable multi-storage consistency guarantees
  • Connected component concurrency optimization: Intelligent concurrency strategy based on graph topology analysis
The Graph Index creation process includes the following core stages:
  • Task reception and instance creation: Celery task scheduling, LightRAG instance initialization
  • Document chunking processing: Intelligent chunking algorithms maintaining semantic coherence
  • Entity relationship extraction: LLM-based entity and relationship identification
  • Connected component analysis: Topological analysis of entity relationship networks
  • Grouped concurrent processing: Processing by connected components to improve performance
  • Node edge merging: Entity deduplication, relationship aggregation, description summarization
  • Multi-storage writing: Consistent writing to vector databases and graph databases

Architecture Overview

Core Design Principles

1. Stateless Architecture Refactoring

The original LightRAG uses global state management, leading to severe concurrency conflicts, with multiple tasks sharing the same instance causing data pollution. More seriously, all collection graph data is stored in the same global namespace, where entities and relationships from different projects interfere with each other, making true multi-tenant isolation impossible.
We completely rewrote LightRAG's instance management code, implementing a stateless design: each Celery task creates an independent LightRAG instance, achieving collection-level data isolation through the workspace parameter. Each collection's graph data is stored in independent namespaces (e.g., entity:{entity_name}:{workspace}), supporting Neo4j, PostgreSQL and other graph database backends, with strict instance lifecycle management to ensure no resource leaks.

2. Staged Pipeline Processing

Document processing and graph indexing separation:
  • ainsert_and_chunk_document: Responsible for document chunking and storage
  • aprocess_graph_indexing: Responsible for graph index construction
  • Advantages: Modular design, easy to test and maintain

3. Connected Component Concurrency Optimization

The original LightRAG lacks effective concurrency strategies, with simple global locks causing performance bottlenecks and inability to fully utilize multi-core CPU resources.
We designed a connected component discovery algorithm based on graph theory, decomposing entity relationship networks into independent processing components. Through topology analysis-driven intelligent grouped concurrency, different connected components can be processed completely in parallel, achieving zero lock conflict design.
The core algorithm approach is: construct adjacency graphs of entity relationships, use BFS traversal to discover all connected components, group entities belonging to different connected components into independent processing tasks, thus achieving true parallel processing.

4. Fine-grained Concurrency Control Mechanism

The original LightRAG lacks effective concurrency control mechanisms, with storage operation consistency unable to be guaranteed, frequently experiencing data races and deadlock issues.
We implemented a Concurrent Control model from scratch, establishing a fine-grained lock manager that supports precise locking at entity and relationship levels. Lock naming uses workspace isolation design: entity:{entity_name}:{workspace} and relationship:{src}:{tgt}:{workspace}. We designed intelligent lock strategies, only locking during merge writes, completely lock-free during entity extraction stages, and preventing deadlocks through sorted lock acquisition mechanisms to avoid circular waiting.

Specific Execution Pipeline Example

Complete Graph Index Creation Process

Taking a single document's graph index creation as an example, the entire processing pipeline includes the following key stages:
  • Task Reception Layer: Celery tasks receive Graph index creation requests, calling LightRAG Manager
  • LightRAG Manager Layer: Create independent LightRAG instances for each task, ensuring stateless processing
  • Document Chunking Stage:
    • Content cleaning and preprocessing
    • Intelligent chunking based on token count (supporting overlap)
    • Generate unique chunk IDs and metadata
    • Serial writing to vector storage and text storage
  • Graph Index Construction Stage:
    • Call LLM for concurrent entity relationship extraction
    • Connected component analysis and grouping processing
    • Statistical extraction results
  • Entity Relationship Extraction Stage:
    • Build LLM prompt templates
    • Use semaphores to control concurrency
    • Support optional gleaning extraction
    • Parse extraction results into structured data
  • Connected Component Grouping Processing:
    • Discover connected components and create processing tasks
    • Filter entities and relationships belonging to each component
    • Use semaphores to control component concurrent processing
  • Node Edge Merging Stage:
    • Collect same-named entities and same-direction relationships
    • Use fine-grained locks for concurrent merging
    • Synchronously update graph database and vector database

Core Data Flow Diagram

The Graph Index creation process is essentially a complex data transformation pipeline. The following data flow diagram shows the complete data transformation process from raw documents to structured knowledge graphs:

Data Flow Transformation Process Analysis

๐Ÿš€ Document Input โ†’ Structured Chunking

Raw documents undergo format cleaning and noise filtering, using intelligent chunking algorithms to segment according to semantic boundaries and token limits, generating chunk collections with unique identifiers and metadata. This step ensures data quality and traceability for subsequent processing.

๐Ÿ”ฌ Chunk Data โ†’ AI Extraction Results

Chunk data undergoes intelligent analysis through LLM concurrent calls, simultaneously identifying entities (people, organizations, concepts, etc.) and semantic relationships between entities in the text. This stage produces raw, undeduped entity and relationship lists, providing raw materials for subsequent graph construction.

๐Ÿง  Extraction Results โ†’ Topology Grouping

Based on extracted entity relationships, adjacency graph networks are constructed, using BFS algorithms to discover connected components and identify groups of mutually related entities. For example: technology team-related entities form one group, finance department-related entities form another group. This topology analysis lays the foundation for parallel processing.

โšก Topology Grouping โ†’ Intelligent Merging

Different connected components can be processed completely in parallel, with same-named entities undergoing intelligent deduplication and information aggregation, and same-direction relationships undergoing weight accumulation and description merging. This process integrates fragmented information into complete knowledge units.

๐Ÿ“ Merged Data โ†’ Content Optimization

Length checks are performed on merged entity and relationship descriptions, with overly long content undergoing intelligent summary compression through LLM, ensuring a balance between information density and storage efficiency. Short content is retained directly, while long content is intelligently summarized.

๐Ÿ’พ Optimized Content โ†’ Multi-dimensional Storage

Final knowledge content is written simultaneously to multiple storage systems:
  • Graph Database: Stores entity nodes and relationship edges, supporting graph queries
  • Vector Database: Stores semantic vectors, supporting similarity search
  • Text Storage: Retains original chunks, supporting full-text search
This multi-dimensional storage architecture ensures optimal performance of knowledge graphs in different query scenarios.

Data Flow Optimization Features

1. Fine-grained Concurrency Control

We implemented precise locking mechanisms at entity and relationship levels: entity:{entity_name}:{workspace} and relationship:{src}:{tgt}:{workspace}, minimizing lock scope to only during merge writes, with completely parallel entity extraction stages. Through sorted lock acquisition order, we effectively prevent circular waiting and deadlocks.

2. Connected Component-driven Concurrency Optimization

We designed topology analysis based on BFS algorithms, discovering independent entity relationship networks and grouping them for parallel processing. Different connected components are processed completely independently, achieving zero lock competition, while processing in batches by component effectively controls memory peaks.

3. Intelligent Data Merging Strategy

We implemented intelligent entity deduplication based on entity_name, supporting intelligent concatenation and summarization of multiple description fragments, quantitative accumulation of relationship strength, and established complete data lineage recording mechanisms.

Performance Optimization Strategies

1. Connected Component Optimization

Topology-driven Concurrency Strategy:
  • Independent Processing: Different connected components processed completely in parallel
  • Lock Competition Minimization: Entities within components don't conflict across components
  • Memory Efficiency: Batch processing by component, controlling memory usage
The system automatically collects connected component distribution statistics, including total components, maximum component size, average component size, single-entity component count, and large component count, for performance tuning and resource allocation.

2. LLM Call Optimization

Batch Processing and Caching Strategy:
  • Concurrency Control: Use semaphores to limit concurrent LLM calls
  • Batch Optimization: Batch processing of similar content
  • Caching Mechanism: Reuse of entity description summary caching
The system intelligently checks description length, automatically calling LLM to generate summaries when exceeding token thresholds, supporting summary result caching reuse for improved efficiency.

3. Storage Write Optimization

Batch Writing and Connection Reuse:
  • Batch Operations: Reduce database round trips
  • Connection Pooling: Reuse database connections
  • Async Writing: Parallel writing to different storage systems

4. Memory Management Optimization

Streaming Processing and Memory Control:
  • Chunk Processing: Streaming chunking of large documents
  • Timely Release: Immediate memory release after processing completion
  • Monitoring Alerts: Memory usage monitoring

Code Organization Structure

Directory Structure

aperag/
โ”œโ”€โ”€ graph/                        # Graph indexing core module
โ”‚   โ”œโ”€โ”€ lightrag_manager.py      # LightRAG manager (Celery entry)
โ”‚   โ””โ”€โ”€ lightrag/                 # LightRAG core implementation
โ”‚       โ”œโ”€โ”€ lightrag.py          # Main LightRAG class
โ”‚       โ”œโ”€โ”€ operate.py           # Core operation functions
โ”‚       โ”œโ”€โ”€ base.py              # Base interface definitions
โ”‚       โ”œโ”€โ”€ utils.py             # Utility functions
โ”‚       โ”œโ”€โ”€ prompt.py            # Prompt templates
โ”‚       โ””โ”€โ”€ kg/                  # Knowledge graph storage implementations
โ”‚           โ”œโ”€โ”€ neo4j_sync_impl.py    # Neo4j sync implementation
โ”‚           โ””โ”€โ”€ postgres_sync_impl.py # PostgreSQL sync implementation
โ”œโ”€โ”€ concurrent_control/           # Concurrency control module
โ”‚   โ”œโ”€โ”€ manager.py               # Lock manager
โ”‚   โ””โ”€โ”€ protocols.py             # Lock interface definitions
โ””โ”€โ”€ tasks/                       # Task module
    โ””โ”€โ”€ document.py              # Document processing business logic

config/
โ””โ”€โ”€ celery_tasks.py              # Celery task definitions

Core Interface Design

LightRAG Management Interface

Responsible for instance creation, document processing and deletion entry management, as well as dynamic generation of embedding and LLM functions.

LightRAG Core Interface

Implements document chunk storage, graph index construction, document deletion, connected component discovery and grouping processing and other core functions.

Operation Function Interface

Provides entity extraction, node edge merging, chunk processing and other low-level operation functions, supporting asynchronous concurrent execution.

Data Structure Design

Core Data Models

The system uses unified data structure design:
  • Chunk Data: Contains token count, content, sequence index, document ID and file path
  • Entity Data: Contains entity name, type, description, source ID and creation timestamp
  • Relationship Data: Contains source entity, target entity, description, keywords, weight and source information
  • Connected Component Data: Contains component index, entity list, filtered results and total component count
All data structures support multi-source aggregation, using delimiters (such as |) to merge multiple source information.

Performance Monitoring and Debugging

1. Performance Metrics

Key Performance Indicators (KPIs):
  • Document Processing Throughput: Documents processed per minute
  • Entity Extraction Accuracy: Quality assessment of extracted entities
  • Connected Component Distribution: Complexity analysis of topological structure
  • LLM Call Efficiency: Average response time and concurrency
  • Storage Write Performance: Database operation latency statistics

2. Debugging Tools

Structured Logging: The system provides complete structured logging functionality, including entity extraction progress tracking, entity merge detail recording, relationship merge status monitoring, etc. Logs record processing progress percentages, entity relationship quantity statistics, summary generation types and other key information.

3. Performance Analysis

Execution Time Statistics: Through performance decorators, key functions undergo execution time statistics, including entity extraction, node edge merging and other core operation duration analysis, facilitating performance optimization and bottleneck identification.

Configuration and Environment

1. Core Configuration Parameters

LightRAG Configuration: The system supports rich configuration parameter tuning, including chunk size, overlap size, LLM concurrency count, similarity thresholds, batch size, summary parameters, embedding token limits, etc. Default configurations are optimized for Chinese environments, supporting flexible adjustment according to actual needs.

2. Storage Configuration

Multi-storage Backend Support:
# Environment variable configuration
GRAPH_INDEX_KV_STORAGE=PGOpsSyncKVStorage          # KV storage
GRAPH_INDEX_VECTOR_STORAGE=PGOpsSyncVectorStorage  # Vector storage
GRAPH_INDEX_GRAPH_STORAGE=Neo4JSyncStorage         # Graph storage

# PostgreSQL configuration
POSTGRES_HOST=127.0.0.1
POSTGRES_PORT=5432
POSTGRES_DB=postgres
POSTGRES_USER=postgres
POSTGRES_PASSWORD=postgres

# Neo4J configuration example
NEO4J_HOST=127.0.0.1
NEO4J_PORT=7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=password

Summary

We conducted large-scale refactoring and optimization of the original LightRAG, achieving a high-concurrency knowledge graph construction system truly suitable for production environments:
๐Ÿ”ฌ Algorithm Deep Dive: For specific algorithm implementation details of entity extraction and merging, continue reading LightRAG Entity Extraction and Merging Mechanism

Core Technical Contributions

  • Complete rewrite to stateless architecture: We completely rewrote LightRAG's core architecture, solving the original version's inability to execute concurrently, with each task using independent instances supporting true multi-tenant isolation
  • Self-developed Concurrent Control model: We designed a fine-grained lock management system, implementing precise concurrency control at entity and relationship levels
  • Connected component concurrency optimization: We designed intelligent concurrency strategies based on graph topology analysis, maximizing parallel processing efficiency
  • Refactored storage layer architecture: We completely rewrote the storage abstraction layer, solving the original version's unreliable storage implementation and inconsistent multi-storage backend implementations
  • End-to-end data flow design: We designed a complete data transformation pipeline, with full-chain optimization from document chunking to multi-storage writing

Performance Improvements

Through these technical improvements, we achieved:
  • 5-10x performance improvement in concurrent processing capability
  • 95%+ reduction in lock competition
  • True multi-tenant concurrent processing support
  • Solving production environment stability issues that the original version couldn't handle
  • Complete workspace-level data isolation, eliminating data conflicts and pollution between different collections
The refactored system transforms the original research prototype into a production-ready, high-performance knowledge graph construction platform that truly meets enterprise-level requirements.

Related Documents

  • ๐Ÿ“‹ Indexing Architecture Design - Overall indexing architecture
  • ๐Ÿ“– LightRAG Entity Extraction and Merging Mechanism - Core algorithm details
  • ๐Ÿ—๏ธ Graph Index ๅˆ›ๅปบๆต็จ‹ๆŠ€ๆœฏๆ–‡ๆกฃ - Chinese Version