Skip to main content

What is a Document Family?

A Document Family is Kodexa’s core entity representing a single document with all its versions, processing history, metadata, and extracted data. Every document uploaded to a store creates a document family.

Accessing Document Families

There are two ways to access documents in Kodexa:

1. Through Stores (File System Style)

Use the Stores API when you know the document’s path:
GET /api/stores/{orgSlug}/{slug}/fs/2024/invoice-001.pdf
Best for: Browsing, organizing, and managing documents by path

2. Direct Access (By ID)

Use the DocumentFamilies API when you have the document family ID:
GET /api/documentFamilies/{id}
Best for: Working with specific documents, processing results, external integrations

When to Use DocumentFamilies API

Use the /api/documentFamilies endpoints when you need to:
  1. Access a specific document by ID - When you have the UUID from processing results or webhooks
  2. Get external data - Retrieve data from external systems associated with the document
  3. Check processing steps - View the complete processing pipeline and transformations
  4. Update document status - Change workflow status (PROCESSING, COMPLETE, FAILED, etc.)
  5. Manage knowledge features - Add or remove knowledge base entries linked to the document
  6. Trigger events - Send document update notifications without modifying content

Key Operations

Get External Data

Documents can store data from external systems (ERP, CRM, databases):
from kodexa import KodexaPlatform

platform = KodexaPlatform(url="https://platform.kodexa.com", api_key="your-api-key")

# Get default external data
external_data = platform.get_document_external_data(
    family_id="550e8400-e29b-41d4-a716-446655440000"
)

# Get specific external data key
erp_data = platform.get_document_external_data(
    family_id="550e8400-e29b-41d4-a716-446655440000",
    key="erp_system"
)

Update External Data

Store references or metadata from external systems:
# Update ERP reference
platform.update_document_external_data(
    family_id="550e8400-e29b-41d4-a716-446655440000",
    data={
        "invoice_id": "INV-2024-001",
        "vendor_id": "V-12345",
        "posted_date": "2024-01-15",
        "status": "approved"
    },
    key="erp_system"
)

# Update CRM reference
platform.update_document_external_data(
    family_id="550e8400-e29b-41d4-a716-446655440000",
    data={
        "opportunity_id": "OPP-789",
        "account_id": "ACC-456"
    },
    key="crm_system"
)

Get Processing Steps

View the complete processing pipeline:
# Get processing steps
steps = platform.get_document_steps(
    family_id="550e8400-e29b-41d4-a716-446655440000"
)

for step in steps:
    print(f"{step.step_type}: {step.status}")
    print(f"  Duration: {step.duration_ms}ms")
    if step.error:
        print(f"  Error: {step.error}")

Update Document Status

Change workflow status:
# Update status to processing
platform.update_document_status(
    family_id="550e8400-e29b-41d4-a716-446655440000",
    status="PROCESSING"
)

# Update to complete
platform.update_document_status(
    family_id="550e8400-e29b-41d4-a716-446655440000",
    status="COMPLETE"
)

# Mark as failed
platform.update_document_status(
    family_id="550e8400-e29b-41d4-a716-446655440000",
    status="FAILED"
)

Add Knowledge Features

Link document to knowledge base:
# Add knowledge feature
knowledge_feature = platform.add_knowledge_feature(
    family_id="550e8400-e29b-41d4-a716-446655440000",
    feature={
        "type": "vendor",
        "value": "ACME Corporation",
        "confidence": 0.95,
        "source": "extraction"
    }
)

# Remove knowledge feature
platform.remove_knowledge_feature(
    family_id="550e8400-e29b-41d4-a716-446655440000",
    feature=knowledge_feature
)

Touch Document

Trigger events without changes:
# Touch document to trigger event listeners
platform.touch_document_family(
    family_id="550e8400-e29b-41d4-a716-446655440000"
)

External Data Use Cases

External data provides a bridge between Kodexa and your business systems:

ERP Integration

# Store invoice posting details
external_data = {
    "invoice_number": "INV-2024-001",
    "gl_account": "1200-5000",
    "cost_center": "CC-100",
    "posted_date": "2024-01-15T10:30:00Z",
    "batch_id": "BATCH-2024-01-15-001"
}

platform.update_document_external_data(
    family_id=family_id,
    data=external_data,
    key="erp_posting"
)

CRM Tracking

# Link document to CRM opportunity
crm_data = {
    "opportunity_id": "OPP-12345",
    "account_id": "ACC-67890",
    "contact_id": "CON-54321",
    "stage": "proposal_sent",
    "probability": 75
}

platform.update_document_external_data(
    family_id=family_id,
    data=crm_data,
    key="crm_link"
)

Workflow State

# Store workflow state
workflow_data = {
    "workflow_id": "WF-001",
    "current_step": "approval",
    "assigned_to": "user@example.com",
    "due_date": "2024-01-20",
    "priority": "high"
}

platform.update_document_external_data(
    family_id=family_id,
    data=workflow_data,
    key="workflow"
)

Processing Steps Explained

Processing steps track every transformation:
[
  {
    "stepType": "UPLOAD",
    "status": "COMPLETE",
    "durationMs": 150,
    "timestamp": "2024-01-15T10:00:00Z"
  },
  {
    "stepType": "OCR",
    "status": "COMPLETE",
    "durationMs": 2300,
    "timestamp": "2024-01-15T10:00:01Z",
    "metadata": {
      "pages": 3,
      "confidence": 0.98
    }
  },
  {
    "stepType": "EXTRACTION",
    "status": "COMPLETE",
    "durationMs": 1500,
    "timestamp": "2024-01-15T10:00:03Z",
    "metadata": {
      "fieldsExtracted": 15,
      "assistant": "invoice-extractor-v2"
    }
  }
]

Document Status Values

Common status values for workflow management:
StatusDescriptionUse Case
UPLOADEDDocument uploaded, awaiting processingInitial state
PROCESSINGAI processing in progressDuring extraction
PROCESSEDProcessing complete, data extractedReady for review
REVIEWAwaiting human reviewQuality control
APPROVEDReviewed and approvedReady for export
REJECTEDRejected during reviewNeeds correction
FAILEDProcessing failedError handling
ARCHIVEDArchived for retentionLong-term storage

Best Practices

Use External Data for System Integration

✅ Good: Store external references
external_data = {
    "erp_id": "INV-2024-001",
    "posted": True,
    "post_date": "2024-01-15"
}

❌ Avoid: Duplicating document content
external_data = {
    "vendor": "ACME",  # Already in extracted data
    "amount": "1500"   # Already in extracted data
}

Choose the Right Access Method

✅ Good: Use store path when browsing
files = platform.list_store_files("my-org/invoices")

✅ Good: Use family ID when processing
data = platform.get_document_external_data(family_id)

❌ Avoid: Using family ID for browsing
# Don't iterate all families just to list documents

Status Workflow

✅ Good: Clear status progression
UPLOADEDPROCESSINGPROCESSEDREVIEWAPPROVED

❌ Avoid: Unclear status values
UPLOADEDDONEFINISHEDOK

Next Steps

I