Skip to main content

What is a Document Family?

A Document Family is Kodexa’s core entity representing a single document with all its versions, processing history, metadata, and extracted data. Every document uploaded to a store creates a document family.

Accessing Document Families

Access documents directly using the DocumentFamilies API with the document family ID:
GET /api/document-families/{id}
Best for: Working with specific documents, processing results, external integrations, uploading new content

When to Use DocumentFamilies API

Use the /api/document-families endpoints when you need to:
  1. Upload new content - Add new versions or documents to an existing family with knowledge features
  2. Access a specific document by ID - When you have the UUID from processing results or webhooks
  3. Get external data - Retrieve data from external systems associated with the document
  4. Check processing steps - View the complete processing pipeline and transformations
  5. Update document status - Change workflow status (PROCESSING, COMPLETE, FAILED, etc.)
  6. Manage knowledge features - Add or remove knowledge base entries linked to the document
  7. Trigger events - Send document update notifications without modifying content

Uploading Content to Document Families

The /api/document-families/{id}/newContent endpoint is the primary way to upload new content to an existing document family. This endpoint supports attaching knowledge features during upload.

Endpoint

POST /api/document-families/{id}/newContent

Form Data Parameters

ParameterTypeRequiredDescription
documentFileYesThe Kodexa document file to upload
sourceContentObjectIdStringNoID of the source content object for the transition (defaults to latest)
transitionTypeStringNoTransition type: DERIVED, REVISED, etc. (defaults to DERIVED)
dataStoreRefStringNoReference to a data store for extraction (e.g., “org/slug/version”)
taxonomyRefsStringNoComma-separated taxonomy references for extraction
documentVersionStringNoVersion string for the new content object
actorTypeStringNoActor type for audit trail: USER, API, SYSTEM, ASSISTANT
actorIdStringNoActor ID for audit trail (defaults to current user ID)
labelStringNoLabel to add to the document family and content object

Example: Upload with cURL

curl -X POST "https://platform.kodexa-enterprise.com/api/document-families/{id}/newContent" \
  -H "x-api-key: your-api-key-here" \
  -F "[email protected]" \
  -F "transitionType=DERIVED" \
  -F "label=reviewed"

Example: Upload with Python

import requests

url = f"https://platform.kodexa-enterprise.com/api/document-families/{family_id}/newContent"

files = {
    "document": ("processed.kdx", open("processed.kdx", "rb"), "application/octet-stream")
}

data = {
    "transitionType": "DERIVED",
    "label": "processed",
    "actorType": "API"
}

response = requests.post(
    url,
    headers={"x-api-key": api_key},
    files=files,
    data=data
)

content_object = response.json()
print(f"Created content object: {content_object['id']}")

Example: Upload with Data Extraction

curl -X POST "https://platform.kodexa-enterprise.com/api/document-families/{id}/newContent" \
  -H "x-api-key: your-api-key-here" \
  -F "[email protected]" \
  -F "dataStoreRef=my-org/invoice-data/1.0" \
  -F "taxonomyRefs=my-org/invoice-taxonomy/1.0"

Managing Knowledge Features

Knowledge features allow you to attach structured metadata and classification information to document families. Features are linked to both the ContentObject and the DocumentFamily.

Knowledge Feature Structure

{
  "knowledgeFeatureRef": "<feature-type-slug>",
  "properties": {
    "<key>": "<value>"
  }
}
  • knowledgeFeatureRef: The slug of an existing KnowledgeFeatureType in your organization
  • properties: A map of key-value pairs specific to this feature instance

Example: Provider Feature

To set a provider knowledge feature with a providerId:
{
  "knowledgeFeatureRef": "provider",
  "properties": {
    "providerId": "provider-123"
  }
}

Example: Multiple Features

You can work with multiple knowledge features:
[
  {
    "knowledgeFeatureRef": "provider",
    "properties": {
      "providerId": "provider-123"
    }
  },
  {
    "knowledgeFeatureRef": "document-type",
    "properties": {
      "type": "invoice",
      "confidence": 0.95
    }
  }
]

Add Knowledge Feature

POST /api/document-families/{id}/addKnowledgeFeature
import requests
import json

url = f"https://platform.kodexa-enterprise.com/api/document-families/{family_id}/addKnowledgeFeature"

feature = {
    "knowledgeFeatureRef": "provider",
    "properties": {
        "providerId": "provider-123"
    }
}

response = requests.post(
    url,
    headers={
        "x-api-key": api_key,
        "Content-Type": "application/json"
    },
    json=feature
)

print(f"Added feature: {response.json()}")

Remove Knowledge Feature

POST /api/document-families/{id}/removeKnowledgeFeature
response = requests.post(
    f"https://platform.kodexa-enterprise.com/api/document-families/{family_id}/removeKnowledgeFeature",
    headers={
        "x-api-key": api_key,
        "Content-Type": "application/json"
    },
    json=feature
)

Assess Document for Knowledge

Automatically assess a document family for applicable knowledge features and sets:
POST /api/document-families/{id}/assess
This endpoint:
  • Extracts features from the content object
  • Finds applicable knowledge sets based on organization and store-project relationships
  • Associates new features with the document family
  • Skips documents that are locked
Retrieve all knowledge items related to a document family through shared knowledge features:
GET /api/document-families/{id}/knowledgeItems

Get Applied Knowledge Sets

Get all knowledge sets that have been applied to a document family:
GET /api/document-families/{id}/appliedKnowledgeSets

Feature Deduplication

Knowledge features are deduplicated based on (featureType, properties):
  • If a feature with the same type slug and identical properties already exists, the existing feature is reused
  • If the properties differ, a new feature is created
  • Features are linked to both the ContentObject and the DocumentFamily
This means uploading multiple files with the same provider and providerId will share a single KnowledgeFeature record.

Prerequisites for Knowledge Features

Before working with knowledge features:
  1. Ensure the KnowledgeFeatureType exists (e.g., provider type must be created first)
  2. The feature type slug in knowledgeFeatureRef must match exactly
  3. If the feature type doesn’t exist, linking will fail silently with a warning in the logs

Filtering by Knowledge Expression

The list endpoint supports filtering document families by knowledge expressions using boolean logic:
GET /api/document-families?knowledgeExpression={expression}

Expression Types

TypeDescriptionExample
FEATUREMatch documents with a specific feature{"type":"FEATURE","slug":"document-type-abc123"}
ANDMatch documents with ALL specified features{"type":"AND","children":[...]}
ORMatch documents with ANY specified features{"type":"OR","children":[...]}
NOTMatch documents WITHOUT a feature{"type":"NOT","children":[...]}

Example: Filter by Single Feature

curl -G "https://platform.kodexa-enterprise.com/api/document-families" \
  -H "x-api-key: your-api-key-here" \
  --data-urlencode 'knowledgeExpression={"type":"FEATURE","slug":"provider-abc123"}'

Example: Filter by Multiple Features (AND)

curl -G "https://platform.kodexa-enterprise.com/api/document-families" \
  -H "x-api-key: your-api-key-here" \
  --data-urlencode 'knowledgeExpression={"type":"AND","children":[{"type":"FEATURE","slug":"feature-1"},{"type":"FEATURE","slug":"feature-2"}]}'

Key Operations

Get External Data

Documents can store data from external systems (ERP, CRM, databases):
from kodexa import KodexaPlatform

platform = KodexaPlatform(url="https://platform.kodexa-enterprise.com", api_key="your-api-key")

# Get default external data
external_data = platform.get_document_external_data(
    family_id="550e8400-e29b-41d4-a716-446655440000"
)

# Get specific external data key
erp_data = platform.get_document_external_data(
    family_id="550e8400-e29b-41d4-a716-446655440000",
    key="erp_system"
)

Update External Data

Store references or metadata from external systems:
# Update ERP reference
platform.update_document_external_data(
    family_id="550e8400-e29b-41d4-a716-446655440000",
    data={
        "invoice_id": "INV-2024-001",
        "vendor_id": "V-12345",
        "posted_date": "2024-01-15",
        "status": "approved"
    },
    key="erp_system"
)

# Update CRM reference
platform.update_document_external_data(
    family_id="550e8400-e29b-41d4-a716-446655440000",
    data={
        "opportunity_id": "OPP-789",
        "account_id": "ACC-456"
    },
    key="crm_system"
)

Get Processing Steps

View the complete processing pipeline:
# Get processing steps
steps = platform.get_document_steps(
    family_id="550e8400-e29b-41d4-a716-446655440000"
)

for step in steps:
    print(f"{step.step_type}: {step.status}")
    print(f"  Duration: {step.duration_ms}ms")
    if step.error:
        print(f"  Error: {step.error}")

Update Document Status

Change workflow status:
# Update status to processing
platform.update_document_status(
    family_id="550e8400-e29b-41d4-a716-446655440000",
    status="PROCESSING"
)

# Update to complete
platform.update_document_status(
    family_id="550e8400-e29b-41d4-a716-446655440000",
    status="COMPLETE"
)

# Mark as failed
platform.update_document_status(
    family_id="550e8400-e29b-41d4-a716-446655440000",
    status="FAILED"
)

Touch Document

Trigger events without changes:
# Touch document to trigger event listeners
platform.touch_document_family(
    family_id="550e8400-e29b-41d4-a716-446655440000"
)

External Data Use Cases

External data provides a bridge between Kodexa and your business systems:

ERP Integration

# Store invoice posting details
external_data = {
    "invoice_number": "INV-2024-001",
    "gl_account": "1200-5000",
    "cost_center": "CC-100",
    "posted_date": "2024-01-15T10:30:00Z",
    "batch_id": "BATCH-2024-01-15-001"
}

platform.update_document_external_data(
    family_id=family_id,
    data=external_data,
    key="erp_posting"
)

CRM Tracking

# Link document to CRM opportunity
crm_data = {
    "opportunity_id": "OPP-12345",
    "account_id": "ACC-67890",
    "contact_id": "CON-54321",
    "stage": "proposal_sent",
    "probability": 75
}

platform.update_document_external_data(
    family_id=family_id,
    data=crm_data,
    key="crm_link"
)

Workflow State

# Store workflow state
workflow_data = {
    "workflow_id": "WF-001",
    "current_step": "approval",
    "assigned_to": "[email protected]",
    "due_date": "2024-01-20",
    "priority": "high"
}

platform.update_document_external_data(
    family_id=family_id,
    data=workflow_data,
    key="workflow"
)

Processing Steps Explained

Processing steps track every transformation:
[
  {
    "stepType": "UPLOAD",
    "status": "COMPLETE",
    "durationMs": 150,
    "timestamp": "2024-01-15T10:00:00Z"
  },
  {
    "stepType": "OCR",
    "status": "COMPLETE",
    "durationMs": 2300,
    "timestamp": "2024-01-15T10:00:01Z",
    "metadata": {
      "pages": 3,
      "confidence": 0.98
    }
  },
  {
    "stepType": "EXTRACTION",
    "status": "COMPLETE",
    "durationMs": 1500,
    "timestamp": "2024-01-15T10:00:03Z",
    "metadata": {
      "fieldsExtracted": 15,
      "assistant": "invoice-extractor-v2"
    }
  }
]

Document Status Values

Common status values for workflow management:
StatusDescriptionUse Case
UPLOADEDDocument uploaded, awaiting processingInitial state
PROCESSINGAI processing in progressDuring extraction
PROCESSEDProcessing complete, data extractedReady for review
REVIEWAwaiting human reviewQuality control
APPROVEDReviewed and approvedReady for export
REJECTEDRejected during reviewNeeds correction
FAILEDProcessing failedError handling
ARCHIVEDArchived for retentionLong-term storage

Best Practices

Use External Data for System Integration

✅ Good: Store external references
external_data = {
    "erp_id": "INV-2024-001",
    "posted": True,
    "post_date": "2024-01-15"
}

❌ Avoid: Duplicating document content
external_data = {
    "vendor": "ACME",  # Already in extracted data
    "amount": "1500"   # Already in extracted data
}

Choose the Right Access Method

✅ Good: Use store path when browsing
files = platform.list_store_files("my-org/invoices")

✅ Good: Use family ID when processing
data = platform.get_document_external_data(family_id)

❌ Avoid: Using family ID for browsing
# Don't iterate all families just to list documents

Status Workflow

✅ Good: Clear status progression
UPLOADEDPROCESSINGPROCESSEDREVIEWAPPROVED

❌ Avoid: Unclear status values
UPLOADEDDONEFINISHEDOK

Reprocessing Documents

Trigger reprocessing of a document family with specific assistants:
PUT /api/document-families/{id}/reprocess?assistantId={assistantId1}&assistantId={assistantId2}
import requests

response = requests.put(
    f"https://platform.kodexa-enterprise.com/api/document-families/{family_id}/reprocess",
    headers={"x-api-key": api_key},
    params={"assistantId": ["assistant-1", "assistant-2"]}
)

Exporting Document Families

Export a document family as a .dfm file:
GET /api/document-families/{id}/export
This returns a downloadable file containing the complete document family including all content objects and metadata.

Getting Data Exports

Export data objects from a document family in various formats:
GET /api/document-families/{id}/data?format={format}
Supported formats:
  • json - Standard JSON format
  • csv - Comma-separated values
  • xml - XML format
  • datalake - NDJson for lakehouse/S3 storage with metadata wrapper
curl "https://platform.kodexa-enterprise.com/api/document-families/{id}/data?format=json&friendlyNames=true" \
  -H "x-api-key: your-api-key-here"

Next Steps