ApeRAG Document Upload Architecture Design
Overview
This document details the complete architecture design of the document upload module in the ApeRAG project, covering the full pipeline from file upload, temporary storage, document parsing, format conversion to final index construction.
Core Design Philosophy: Adopts a two-phase commit pattern, separating file upload (temporary storage) from document confirmation (formal addition), providing better user experience and resource management capabilities.
System Architecture
Overall Architecture
┌─────────────────────────────────────────────────────────────┐
│ Frontend │
│ (Next.js) │
└────────┬───────────────────────────────────┬────────────────┘
│ │
│ Step 1: Upload │ Step 2: Confirm
│ POST /documents/upload │ POST /documents/confirm
▼ ▼
┌─────────────────────────────────────────────────────────────┐
│ View Layer: aperag/views/collections.py │
│ - HTTP request handling │
│ - JWT authentication │
│ - Parameter validation │
└────────┬───────────────────────────────────┬────────────────┘
│ │
│ document_service.upload_document() │ document_service.confirm_documents()
▼ ▼
┌─────────────────────────────────────────────────────────────┐
│ Service Layer: aperag/service/document_service.py │
│ - Business logic orchestration │
│ - File validation (type, size) │
│ - SHA-256 hash deduplication │
│ - Quota checking │
│ - Transaction management │
└────────┬───────────────────────────────────┬────────────────┘
│ │
│ Step 1 │ Step 2
▼ ▼
┌────────────────────────┐ ┌────────────────────────────┐
│ 1. Create Document │ │ 1. Update Document status │
│ status=UPLOADED │ │ UPLOADED → PENDING │
│ 2. Save to ObjectStore│ │ 2. Create DocumentIndex │
│ 3. Calculate hash │ │ 3. Trigger indexing tasks │
└────────┬───────────────┘ └────────┬───────────────────┘
│ │
▼ ▼
┌─────────────────────────────────────────────────────────────┐
│ Storage Layer │
│ │
│ ┌───────────────┐ ┌──────────────────┐ ┌─────────────┐ │
│ │ PostgreSQL │ │ Object Store │ │ Vector DB │ │
│ │ │ │ │ │ │ │
│ │ - document │ │ - Local/S3 │ │ - Qdrant │ │
│ │ - document_ │ │ - Original files │ │ - Vectors │ │
│ │ index │ │ - Converted files│ │ │ │
│ └───────────────┘ └──────────────────┘ └─────────────┘ │
│ │
│ ┌───────────────┐ ┌──────────────────┐ │
│ │ Elasticsearch │ │ Neo4j/PG │ │
│ │ │ │ │ │
│ │ - Full-text │ │ - Knowledge Graph│ │
│ └───────────────┘ └──────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌───────────────────┐
│ Celery Workers │
│ │
│ - Doc parsing │
│ - Format convert │
│ - Content extract│
│ - Doc chunking │
│ - Index building │
└───────────────────┘
Layered Architecture
┌─────────────────────────────────────────────┐
│ View Layer (views/collections.py) │ HTTP handling, auth, validation
└─────────────────┬───────────────────────────┘
│ calls
┌─────────────────▼───────────────────────────┐
│ Service Layer (service/document_service.py)│ Business logic, transaction, permission
└─────────────────┬───────────────────────────┘
│ calls
┌─────────────────▼───────────────────────────┐
│ Repository Layer (db/ops.py, objectstore/) │ Data access abstraction
└─────────────────┬───────────────────────────┘
│ accesses
┌─────────────────▼───────────────────────────┐
│ Storage Layer (PG, S3, Qdrant, ES, Neo4j) │ Data persistence
└─────────────────────────────────────────────┘
Core Process Details
Phase 0: API Interface Definition
The system provides three main interfaces:
-
Upload File (Two-phase mode - Step 1)
- Endpoint:
POST /api/v1/collections/{collection_id}/documents/upload - Function: Upload file to temporary storage, status
UPLOADED - Returns:
document_id,filename,size,status
- Endpoint:
-
Confirm Documents (Two-phase mode - Step 2)
- Endpoint:
POST /api/v1/collections/{collection_id}/documents/confirm - Function: Confirm uploaded documents, trigger index building
- Parameters:
document_idsarray - Returns:
confirmed_count,failed_count,failed_documents
- Endpoint:
-
One-step Upload (Legacy mode, backward compatible)
- Endpoint:
POST /api/v1/collections/{collection_id}/documents - Function: Upload and directly add to knowledge base, status directly to
PENDING - Supports batch upload
- Endpoint:
Phase 1: File Upload and Temporary Storage
1.1 Upload Flow
User selects files
│
▼
Frontend calls upload API
│
▼
View layer validates identity and params
│
▼
Service layer processes business logic:
│
├─► Verify collection exists and active
│
├─► Validate file type and size
│
├─► Read file content
│
├─► Calculate SHA-256 hash
│
└─► Transaction processing:
│
├─► Duplicate detection (by filename + hash)
│ ├─ Exact match: Return existing doc (idempotent)
│ ├─ Same name, different content: Throw conflict error
│ └─ New document: Continue creation
│
├─► Create Document record (status=UPLOADED)
│
├─► Upload to object store
│ └─ Path: user-{user_id}/{collection_id}/{document_id}/original{suffix}
│
└─► Update document metadata (object_path)
1.2 File Validation
Supported File Types:
- Documents:
.pdf,.doc,.docx,.ppt,.pptx,.xls,.xlsx - Text:
.txt,.md,.html,.json,.xml,.yaml,.yml,.csv - Images:
.png,.jpg,.jpeg,.gif,.bmp,.tiff,.tif - Audio:
.mp3,.wav,.m4a - Archives:
.zip,.tar,.gz,.tgz
Size Limits:
- Default: 100 MB (configurable via
MAX_DOCUMENT_SIZEenvironment variable) - Extracted total size: 5 GB (
MAX_EXTRACTED_SIZE)
1.3 Duplicate Detection Mechanism
Uses filename + SHA-256 hash dual detection:
| Scenario | Filename | Hash | System Behavior |
| Exact match | Same | Same | Return existing document (idempotent) |
| Name conflict | Same | Different | Throw DocumentNameConflictException |
| New document | Different | - | Create new document record |
Advantages:
- ✅ Supports idempotent upload: Network retries won't create duplicates
- ✅ Prevents content conflicts: Same name with different content prompts user
- ✅ Saves storage space: Same content stored only once
Phase 2: Temporary Storage Configuration
2.1 Object Storage Types
System supports two object storage backends, switchable via environment variables:
1. Local Storage (Local filesystem)
Use cases:
- Development and testing environments
- Small-scale deployments
- Single-machine deployments
Configuration:
# Development environment
OBJECT_STORE_TYPE=local
OBJECT_STORE_LOCAL_ROOT_DIR=.objects
# Docker environment
OBJECT_STORE_TYPE=local
OBJECT_STORE_LOCAL_ROOT_DIR=/shared/objects
Storage path example:
.objects/
└── user-google-oauth2-123456/
└── col_abc123/
└── doc_xyz789/
├── original.pdf # Original file
├── converted.pdf # Converted PDF
├── processed_content.md # Parsed Markdown
├── chunks/ # Chunked data
│ ├── chunk_0.json
│ └── chunk_1.json
└── images/ # Extracted images
├── page_0.png
└── page_1.png
2. S3 Storage (Compatible with AWS S3/MinIO/OSS, etc.)
Use cases:
- Production environments
- Large-scale deployments
- Distributed deployments
- High availability and disaster recovery needs
Configuration:
OBJECT_STORE_TYPE=s3
OBJECT_STORE_S3_ENDPOINT=http://127.0.0.1:9000 # MinIO/S3 address
OBJECT_STORE_S3_REGION=us-east-1 # AWS Region
OBJECT_STORE_S3_ACCESS_KEY=minioadmin # Access Key
OBJECT_STORE_S3_SECRET_KEY=minioadmin # Secret Key
OBJECT_STORE_S3_BUCKET=aperag # Bucket name
OBJECT_STORE_S3_PREFIX_PATH=dev/ # Optional path prefix
OBJECT_STORE_S3_USE_PATH_STYLE=true # Set to true for MinIO
2.2 Object Storage Path Rules
Path Format:
{prefix}/user-{user_id}/{collection_id}/{document_id}/{filename}
Components:
prefix: Optional global prefix (S3 only)user_id: User ID (|replaced with-)collection_id: Collection IDdocument_id: Document IDfilename: Filename (e.g.,original.pdf,page_0.png)
Multi-tenancy Isolation:
- Each user has an independent namespace
- Each collection has an independent storage directory
- Each document has an independent folder
Phase 3: Document Confirmation and Index Building
3.1 Confirmation Flow
User clicks "Save to Collection"
│
▼
Frontend calls confirm API
│
▼
Service layer processes:
│
├─► Validate collection configuration
│
├─► Check Quota (deduct quota at confirmation stage)
│
└─► For each document_id:
│
├─► Verify document status is UPLOADED
│
├─► Update document status: UPLOADED → PENDING
│
├─► Create index records based on collection config:
│ ├─ VECTOR (Vector index, required)
│ ├─ FULLTEXT (Full-text index, required)
│ ├─ GRAPH (Knowledge graph, optional)
│ ├─ SUMMARY (Document summary, optional)
│ └─ VISION (Vision index, optional)
│
└─► Return confirmation result
│
▼
Trigger Celery task: reconcile_document_indexes
│
▼
Background async index building
3.2 Quota Management
Check Timing:
- ❌ Not checked during upload phase (temporary storage doesn't consume quota)
- ✅ Checked during confirmation phase (formal addition consumes quota)
Quota Types:
-
User Global Quota
max_document_count: Total document count limit per user- Default: 1000 (configurable via
MAX_DOCUMENT_COUNT)
-
Per-Collection Quota
max_document_count_per_collection: Document count limit per collection- Excludes
UPLOADEDandDELETEDstatus documents
Quota Exceeded Handling:
- Throws
QuotaExceededException - Returns HTTP 400 error
- Includes current usage and quota limit information
Phase 4: Document Parsing and Format Conversion
4.1 Parser Architecture
System uses a multi-parser chain invocation architecture, where each parser handles specific file types:
DocParser (Main Controller)
│
├─► MinerUParser
│ └─ Function: High-precision PDF parsing (commercial API)
│ └─ Supports: .pdf
│
├─► DocRayParser
│ └─ Function: Document layout analysis and content extraction
│ └─ Supports: .pdf, .docx, .pptx, .xlsx
│
├─► ImageParser
│ └─ Function: Image content recognition (OCR + vision understanding)
│ └─ Supports: .jpg, .png, .gif, .bmp, .tiff
│
├─► AudioParser
│ └─ Function: Audio transcription (Speech-to-Text)
│ └─ Supports: .mp3, .wav, .m4a
│
└─► MarkItDownParser (Fallback)
└─ Function: Universal document to Markdown conversion
└─ Supports: Almost all common formats
4.2 Parser Configuration
Configuration Method: Dynamically controlled via Collection Config
{
"parser_config": {
"use_mineru": false, // Enable MinerU (requires API Token)
"use_doc_ray": false, // Enable DocRay
"use_markitdown": true, // Enable MarkItDown (default)
"mineru_api_token": "xxx" // MinerU API Token (optional)
}
}
Environment Variable Configuration:
USE_MINERU_API=false # Globally enable MinerU
MINERU_API_TOKEN=your_token # MinerU API Token
4.3 Parsing Flow
Celery Worker receives indexing task
│
▼
1. Download original file from object store
│
▼
2. Select Parser based on file extension
│
├─► Try first matching Parser
│ ├─ Success: Return parsing result
│ └─ Failure: FallbackError → Try next Parser
│
└─► Final fallback: MarkItDownParser
│
▼
3. Parsing result (Parts):
│
├─► MarkdownPart: Text content
│ └─ Contains: headings, paragraphs, lists, tables, etc.
│
├─► PdfPart: PDF file
│ └─ For: linearization, page rendering
│
└─► AssetBinPart: Binary resources
└─ Contains: images, embedded files, etc.
│
▼
4. Post-processing:
│
├─► PDF pages to images (required for Vision index)
│ └─ Each page rendered as PNG image
│ └─ Saved to {document_path}/images/page_N.png
│
├─► PDF linearization (speed up browser loading)
│ └─ Use pikepdf to optimize PDF structure
│ └─ Saved to {document_path}/converted.pdf
│
└─► Extract text content (plain text)
└─ Merge all MarkdownPart content
└─ Saved to {document_path}/processed_content.md
│
▼
5. Save to object store
4.4 Format Conversion Examples
Example 1: PDF Document
Input: user_manual.pdf (5 MB)
│
▼
Parser selection: DocRayParser / MarkItDownParser
│
▼
Output Parts:
├─ MarkdownPart: "# User Manual\n\n## Chapter 1\n..."
└─ PdfPart: <original PDF data>
│
▼
Post-processing:
├─ Render 50 pages to images → images/page_0.png ~ page_49.png
├─ Linearize PDF → converted.pdf
└─ Extract text → processed_content.md
Example 2: Image File
Input: screenshot.png (2 MB)
│
▼
Parser selection: ImageParser
│
▼
Output Parts:
├─ MarkdownPart: "[OCR extracted text]"
└─ AssetBinPart: <original image data> (vision_index=true)
│
▼
Post-processing:
└─ Save original image copy → images/file.png
Example 3: Audio File
Input: meeting_record.mp3 (50 MB)
│
▼
Parser selection: AudioParser
│
▼
Output Parts:
└─ MarkdownPart: "[Transcribed meeting content]"
│
▼
Post-processing:
└─ Save transcription text → processed_content.md
Phase 5: Index Building
5.1 Index Types and Functions
| Index Type | Required | Function Description | Storage Location |
| VECTOR | ✅ Required | Vector retrieval, semantic search | Qdrant / Elasticsearch |
| FULLTEXT | ✅ Required | Full-text search, keyword search | Elasticsearch |
| GRAPH | ❌ Optional | Knowledge graph, entity & relation extraction | Neo4j / PostgreSQL |
| SUMMARY | ❌ Optional | Document summary, LLM generated | PostgreSQL (index_data) |
| VISION | ❌ Optional | Vision understanding, image content analysis | Qdrant (vectors) + PG (metadata) |
5.2 Index Building Flow
Celery Worker: reconcile_document_indexes task
│
▼
1. Scan DocumentIndex table, find indexes needing processing
│
├─► PENDING status + observed_version < version
│ └─ Need to create or update index
│
└─► DELETING status
└─ Need to delete index
│
▼
2. Group by document, process one by one
│
▼
3. For each document:
│
├─► parse_document (parse document)
│ ├─ Download original file from object store
│ ├─ Call DocParser to parse
│ └─ Return ParsedDocumentData
│
└─► For each index type:
│
├─► create_index (create/update index)
│ │
│ ├─ VECTOR index:
│ │ ├─ Document chunking
│ │ ├─ Generate vectors using Embedding model
│ │ └─ Write to Qdrant
│ │
│ ├─ FULLTEXT index:
│ │ ├─ Extract plain text content
│ │ ├─ Chunk by paragraph/section
│ │ └─ Write to Elasticsearch
│ │
│ ├─ GRAPH index:
│ │ ├─ Extract entities using LightRAG
│ │ ├─ Extract entity relationships
│ │ └─ Write to Neo4j/PostgreSQL
│ │
│ ├─ SUMMARY index:
│ │ ├─ Generate summary using LLM
│ │ └─ Save to DocumentIndex.index_data
│ │
│ └─ VISION index:
│ ├─ Extract image Assets
│ ├─ Understand image content using Vision LLM
│ ├─ Generate image description vectors
│ └─ Write to Qdrant
│
└─► Update index status
├─ Success: CREATING → ACTIVE
└─ Failure: CREATING → FAILED
│
▼
4. Update document overall status
│
├─ All indexes ACTIVE → Document.status = COMPLETE
├─ Any index FAILED → Document.status = FAILED
└─ Some indexes still processing → Document.status = RUNNING
5.3 Document Chunking
Chunking Strategy:
- Recursive character splitting (RecursiveCharacterTextSplitter)
- Prioritize splitting by natural paragraphs and sections
- Maintain context overlap
Chunking Parameters:
{
"chunk_size": 1000, // Max characters per chunk
"chunk_overlap": 200, // Overlap characters
"separators": ["\n\n", "\n", " ", ""] // Separator priority
}
Chunking Result Storage:
{document_path}/chunks/
├─ chunk_0.json: {"text": "...", "metadata": {...}}
├─ chunk_1.json: {"text": "...", "metadata": {...}}
└─ ...
Database Design
Table 1: document (Document Metadata)
Table Structure:
| Field | Type | Description | Index |
id | String(24) | Document ID, primary key, format: doc{random_id} | PK |
name | String(1024) | Filename | - |
user | String(256) | User ID (supports multiple IDPs) | ✅ Index |
collection_id | String(24) | Collection ID | ✅ Index |
status | Enum | Document status (see table below) | ✅ Index |
size | BigInteger | File size (bytes) | - |
content_hash | String(64) | SHA-256 hash (for deduplication) | ✅ Index |
object_path | Text | Object store path (deprecated, use doc_metadata) | - |
doc_metadata | Text | Document metadata (JSON string) | - |
gmt_created | DateTime(tz) | Creation time (UTC) | - |
gmt_updated | DateTime(tz) | Update time (UTC) | - |
gmt_deleted | DateTime(tz) | Deletion time (soft delete) | ✅ Index |
Unique Constraint:
UNIQUE INDEX uq_document_collection_name_active
ON document (collection_id, name)
WHERE gmt_deleted IS NULL;
- Within the same collection, active document names cannot be duplicated
- Deleted documents are excluded from uniqueness check
Document Status Enum (
DocumentStatus):| Status | Description | When Set | Visibility |
UPLOADED | Uploaded to temporary storage | upload_document API | Frontend file selection UI |
PENDING | Waiting for index building | confirm_documents API | Document list (processing) |
RUNNING | Index building in progress | Celery task starts processing | Document list (processing) |
COMPLETE | All indexes completed | All indexes become ACTIVE | Document list (available) |
FAILED | Index building failed | Any index fails | Document list (failed) |
DELETED | Deleted | delete_document API | Not visible (soft delete) |
EXPIRED | Temporary document expired | Scheduled cleanup task | Not visible |
Document Metadata Example (
doc_metadata JSON field):{
"object_path": "user-xxx/col_xxx/doc_xxx/original.pdf",
"converted_path": "user-xxx/col_xxx/doc_xxx/converted.pdf",
"processed_content_path": "user-xxx/col_xxx/doc_xxx/processed_content.md",
"images": [
"user-xxx/col_xxx/doc_xxx/images/page_0.png",
"user-xxx/col_xxx/doc_xxx/images/page_1.png"
],
"parser_used": "DocRayParser",
"parse_duration_ms": 5420,
"page_count": 50,
"custom_field": "value"
}
Table 2: document_index (Index Status Management)
Table Structure:
| Field | Type | Description | Index |
id | Integer | Auto-increment ID, primary key | PK |
document_id | String(24) | Related document ID | ✅ Index |
index_type | Enum | Index type (see table below) | ✅ Index |
status | Enum | Index status (see table below) | ✅ Index |
version | Integer | Index version number | - |
observed_version | Integer | Processed version number | - |
index_data | Text | Index data (JSON), e.g., summary content | - |
error_message | Text | Error message (on failure) | - |
gmt_created | DateTime(tz) | Creation time | - |
gmt_updated | DateTime(tz) | Update time | - |
gmt_last_reconciled | DateTime(tz) | Last reconciliation time | - |
Unique Constraint:
UNIQUE CONSTRAINT uq_document_index
ON document_index (document_id, index_type);
- Each document has only one record per index type
Index Type Enum (
DocumentIndexType):| Type | Value | Description | External Storage |
VECTOR | "VECTOR" | Vector index | Qdrant / Elasticsearch |
FULLTEXT | "FULLTEXT" | Full-text index | Elasticsearch |
GRAPH | "GRAPH" | Knowledge graph | Neo4j / PostgreSQL |
SUMMARY | "SUMMARY" | Document summary | PostgreSQL (index_data) |
VISION | "VISION" | Vision index | Qdrant + PostgreSQL |
Index Status Enum (
DocumentIndexStatus):| Status | Description | When Set |
PENDING | Waiting for processing | confirm_documents creates index record |
CREATING | Creating | Celery Worker starts processing |
ACTIVE | Ready for use | Index building successful |
DELETING | Marked for deletion | delete_document API |
DELETION_IN_PROGRESS | Deleting | Celery Worker is deleting |
FAILED | Failed | Index building failed |
Version Control Mechanism:
version: Expected index version (incremented on document update)observed_version: Processed version number- When
version > observed_version, triggers index update
Reconciler:
# Query indexes needing processing
SELECT * FROM document_index
WHERE status = 'PENDING'
AND observed_version < version;
# Update after processing
UPDATE document_index
SET status = 'ACTIVE',
observed_version = version,
gmt_last_reconciled = NOW()
WHERE id = ?;
Table Relationship Diagram
┌─────────────────────────────────┐
│ collection │
│ ───────────────────────────── │
│ id (PK) │
│ name │
│ config (JSON) │
│ status │
│ ... │
└────────────┬────────────────────┘
│ 1:N
▼
┌─────────────────────────────────┐
│ document │
│ ───────────────────────────── │
│ id (PK) │
│ collection_id (FK) │◄──── Unique constraint: (collection_id, name)
│ name │
│ user │
│ status (Enum) │
│ size │
│ content_hash (SHA-256) │
│ doc_metadata (JSON) │
│ gmt_created │
│ gmt_deleted │
│ ... │
└────────────┬────────────────────┘
│ 1:N
▼
┌─────────────────────────────────┐
│ document_index │
│ ───────────────────────────── │
│ id (PK) │
│ document_id (FK) │◄──── Unique constraint: (document_id, index_type)
│ index_type (Enum) │
│ status (Enum) │
│ version │
│ observed_version │
│ index_data (JSON) │
│ error_message │
│ gmt_last_reconciled │
│ ... │
└─────────────────────────────────┘
State Machine and Lifecycle
Document State Transitions
┌─────────────────────────────────────────────┐
│ │
│ ▼
[Upload] ──► UPLOADED ──► [Confirm] ──► PENDING ──► RUNNING ──► COMPLETE
│ │
│ ▼
│ FAILED
│ │
│ ▼
└──────► [Delete] ──────────────► DELETED
│
┌───────────────────────────────────┘
│
▼
EXPIRED (Scheduled cleanup of unconfirmed docs)
Key Transitions:
- UPLOADED → PENDING: User clicks "Save to Collection"
- PENDING → RUNNING: Celery Worker starts processing
- RUNNING → COMPLETE: All indexes successful
- RUNNING → FAILED: Any index fails
- Any status → DELETED: User deletes document
Index State Transitions
[Create index record] ──► PENDING ──► CREATING ──► ACTIVE
│
▼
FAILED
│
▼
┌──────────► PENDING (retry)
│
[Delete request] ────────┼──────────► DELETING ──► DELETION_IN_PROGRESS ──► (record deleted)
│
└──────────► (directly delete record, if PENDING/FAILED)
Async Task Scheduling (Celery)
Task Definitions
Main Task:
reconcile_document_indexes- Trigger timing:
- After
confirm_documentsAPI call - Scheduled task (every 30 seconds)
- Manual trigger (admin interface)
- After
- Function: Scan
document_indextable, process indexes needing reconciliation
Sub-tasks:
parse_document_task: Parse document contentcreate_vector_index_task: Create vector indexcreate_fulltext_index_task: Create full-text indexcreate_graph_index_task: Create knowledge graph indexcreate_summary_index_task: Create summary indexcreate_vision_index_task: Create vision index
Task Scheduling Strategy
Concurrency Control:
- Each Worker processes at most N documents simultaneously (default 4)
- Multiple indexes of each document can be built in parallel
- Use Celery's
task_acks_late=Trueto ensure tasks aren't lost
Failure Retry:
- Maximum 3 retries
- Exponential backoff (1 min → 5 min → 15 min)
- Marked as
FAILEDafter 3 failures
Idempotency:
- All tasks support repeated execution
- Use
observed_versionmechanism to avoid duplicate processing - Same input produces same output
Design Features and Advantages
1. Two-Phase Commit Design
Advantages:
- ✅ Better User Experience: Fast upload response, doesn't block user operations
- ✅ Selective Addition: Can selectively confirm partial files after batch upload
- ✅ Reasonable Resource Control: Unconfirmed documents don't build indexes, don't consume quota
- ✅ Failure Recovery Friendly: Temporary documents can be periodically cleaned up without affecting business
Status Isolation:
Temporary status (UPLOADED):
- Not counted in quota
- Doesn't trigger indexing
- Can be automatically cleaned up
Formal status (PENDING/RUNNING/COMPLETE):
- Counted in quota
- Triggers index building
- Won't be automatically cleaned up
2. Idempotency Design
File-Level Idempotency:
- SHA-256 hash deduplication
- Same file uploaded multiple times returns same
document_id - Avoids storage space waste
API-Level Idempotency:
upload_document: Repeated upload returns existing documentconfirm_documents: Repeated confirmation doesn't create duplicate indexesdelete_document: Repeated deletion returns success (soft delete)
3. Multi-Tenancy Isolation
Storage Isolation:
user-{user_A}/... # User A's files
user-{user_B}/... # User B's files
Database Isolation:
- All queries filter by
userfield - Collection-level permission control (
collection.user) - Soft delete support (
gmt_deleted)
4. Flexible Storage Backend
Unified Interface:
AsyncObjectStore:
- put(path, data)
- get(path)
- delete_objects_by_prefix(prefix)
Runtime Switching:
- Switch between Local/S3 via environment variables
- No need to modify business code
- Supports custom storage backends (just implement the interface)
5. Transaction Consistency
Two-Phase Commit for Database + Object Store:
async with transaction:
# 1. Create database record
document = create_document_record()
# 2. Upload to object store
await object_store.put(path, data)
# 3. Update metadata
document.doc_metadata = json.dumps(metadata)
# All operations succeed to commit, any failure rolls back
Failure Handling:
- Database record creation fails: Don't upload file
- File upload fails: Rollback database record
- Metadata update fails: Rollback previous operations
6. Observability
Audit Logging:
@auditdecorator records all document operations- Includes: user, time, operation type, resource ID
Task Tracking:
gmt_last_reconciled: Last processing timeerror_message: Failure reason- Celery task ID: Link log tracing
Monitoring Metrics:
- Document upload rate
- Index building duration
- Failure rate statistics
Performance Optimization
1. Async Processing
Upload Doesn't Block:
- Returns immediately after file upload to object store
- Index building executes asynchronously in Celery
- Frontend gets progress via polling or WebSocket
2. Batch Operations
Batch Confirmation:
confirm_documents(document_ids=[id1, id2, ..., idN])
- Process multiple documents in one transaction
- Batch create index records
- Reduce database round-trips
3. Caching Strategy
Parsing Result Cache:
- Parsed content saved to
processed_content.md - Subsequent index rebuilds can read directly without re-parsing
Chunking Result Cache:
- Chunking results saved to
chunks/directory - Vector index rebuilds can reuse chunking results
4. Parallel Index Building
Multiple Indexes in Parallel:
# VECTOR, FULLTEXT, GRAPH can be built in parallel
await asyncio.gather(
create_vector_index(),
create_fulltext_index(),
create_graph_index()
)
Error Handling
Common Exceptions
| Exception Type | HTTP Status | Trigger Scenario | Handling Suggestion |
ResourceNotFoundException | 404 | Collection/document doesn't exist | Check if ID is correct |
CollectionInactiveException | 400 | Collection not active | Wait for collection initialization |
DocumentNameConflictException | 409 | Same name, different content | Rename file or delete old document |
QuotaExceededException | 429 | Quota exceeded | Upgrade plan or delete old documents |
InvalidFileTypeException | 400 | Unsupported file type | Check supported file type list |
FileSizeTooLargeException | 413 | File too large | Split file or compress |
Exception Propagation
Service Layer throws exception
│
▼
View Layer catches and converts
│
▼
Exception Handler unified handling
│
▼
Return standard JSON response:
{
"error_code": "QUOTA_EXCEEDED",
"message": "Document count limit exceeded",
"details": {
"limit": 1000,
"current": 1000
}
}
Related Files Index
Core Implementation
- View Layer:
aperag/views/collections.py- HTTP interface definition - Service Layer:
aperag/service/document_service.py- Business logic - Database Models:
aperag/db/models.py- Document, DocumentIndex table definitions - Database Operations:
aperag/db/ops.py- CRUD operation encapsulation
Object Storage
- Interface Definition:
aperag/objectstore/base.py- AsyncObjectStore abstract class - Local Implementation:
aperag/objectstore/local.py- Local filesystem storage - S3 Implementation:
aperag/objectstore/s3.py- S3-compatible storage
Document Parsing
- Main Controller:
aperag/docparser/doc_parser.py- DocParser - Parser Implementations:
aperag/docparser/mineru_parser.py- MinerU PDF parsingaperag/docparser/docray_parser.py- DocRay document parsingaperag/docparser/markitdown_parser.py- MarkItDown universal parsingaperag/docparser/image_parser.py- Image OCRaperag/docparser/audio_parser.py- Audio transcription
- Document Processing:
aperag/index/document_parser.py- Parsing flow orchestration
Index Building
- Index Management:
aperag/index/manager.py- DocumentIndexManager - Vector Index:
aperag/index/vector_index.py- VectorIndexer - Full-text Index:
aperag/index/fulltext_index.py- FulltextIndexer - Knowledge Graph:
aperag/index/graph_index.py- GraphIndexer - Document Summary:
aperag/index/summary_index.py- SummaryIndexer - Vision Index:
aperag/index/vision_index.py- VisionIndexer
Task Scheduling
- Task Definitions:
config/celery_tasks.py- Celery task registration - Reconciler:
aperag/tasks/reconciler.py- DocumentIndexReconciler - Document Tasks:
aperag/tasks/document.py- DocumentIndexTask
Frontend Implementation
- Document List:
web/src/app/workspace/collections/[collectionId]/documents/page.tsx - Document Upload:
web/src/app/workspace/collections/[collectionId]/documents/upload/document-upload.tsx
Summary
ApeRAG's document upload module adopts a two-phase commit + multi-parser chain invocation + parallel multi-index building architecture design:
Core Features:
- ✅ Two-Phase Commit: Upload (temporary storage) → Confirm (formal addition), providing better user experience
- ✅ SHA-256 Deduplication: Prevents duplicate documents, supports idempotent upload
- ✅ Flexible Storage Backend: Local/S3 configurable switching, unified interface abstraction
- ✅ Multi-Parser Architecture: Supports MinerU, DocRay, MarkItDown and other parsers
- ✅ Automatic Format Conversion: PDF→images, audio→text, images→OCR text
- ✅ Multi-Index Coordination: Five index types: vector, full-text, graph, summary, vision
- ✅ Quota Management: Quota deducted at confirmation stage, reasonable resource control
- ✅ Async Processing: Celery task queue, doesn't block user operations
- ✅ Transaction Consistency: Two-phase commit for database + object store
- ✅ Observability: Audit logs, task tracking, complete error information recording
This design ensures both high performance and scalability, supports complex document processing scenarios (multi-format, multi-language, multi-modal), while maintaining good fault tolerance and user experience.