What is a Document Family?
A Document Family is Kodexa’s core entity representing a single document with all its versions, processing history, metadata, and extracted data. Every document uploaded to a store creates a document family.
Accessing Document Families
Access documents directly using the DocumentFamilies API with the document family ID:
GET /api/document-families/{id}
Best for: Working with specific documents, processing results, external integrations, uploading new content
When to Use DocumentFamilies API
Use the /api/document-families endpoints when you need to:
- Upload new content - Add new versions or documents to an existing family with knowledge features
- Access a specific document by ID - When you have the UUID from processing results or webhooks
- Get external data - Retrieve data from external systems associated with the document
- Check processing steps - View the complete processing pipeline and transformations
- Update document status - Change workflow status (PROCESSING, COMPLETE, FAILED, etc.)
- Manage knowledge features - Add or remove knowledge base entries linked to the document
- Trigger events - Send document update notifications without modifying content
Uploading Content to Document Families
The /api/document-families/{id}/newContent endpoint is the primary way to upload new content to an existing document family. This endpoint supports attaching knowledge features during upload.
Endpoint
POST /api/document-families/{id}/newContent
| Parameter | Type | Required | Description |
|---|
document | File | Yes | The Kodexa document file to upload |
sourceContentObjectId | String | No | ID of the source content object for the transition (defaults to latest) |
transitionType | String | No | Transition type: DERIVED, REVISED, etc. (defaults to DERIVED) |
dataStoreRef | String | No | Reference to a data store for extraction (e.g., “org/slug/version”) |
taxonomyRefs | String | No | Comma-separated taxonomy references for extraction |
documentVersion | String | No | Version string for the new content object |
actorType | String | No | Actor type for audit trail: USER, API, SYSTEM, ASSISTANT |
actorId | String | No | Actor ID for audit trail (defaults to current user ID) |
label | String | No | Label to add to the document family and content object |
Example: Upload with cURL
curl -X POST "https://platform.kodexa-enterprise.com/api/document-families/{id}/newContent" \
-H "x-api-key: your-api-key-here" \
-F "document=@processed-document.kdx" \
-F "transitionType=DERIVED" \
-F "label=reviewed"
Example: Upload with Python
import requests
url = f"https://platform.kodexa-enterprise.com/api/document-families/{family_id}/newContent"
files = {
"document": ("processed.kdx", open("processed.kdx", "rb"), "application/octet-stream")
}
data = {
"transitionType": "DERIVED",
"label": "processed",
"actorType": "API"
}
response = requests.post(
url,
headers={"x-api-key": api_key},
files=files,
data=data
)
content_object = response.json()
print(f"Created content object: {content_object['id']}")
curl -X POST "https://platform.kodexa-enterprise.com/api/document-families/{id}/newContent" \
-H "x-api-key: your-api-key-here" \
-F "document=@invoice.kdx" \
-F "dataStoreRef=my-org/invoice-data/1.0" \
-F "taxonomyRefs=my-org/invoice-taxonomy/1.0"
Managing Knowledge Features
Knowledge features allow you to attach structured metadata and classification information to document families. Features are linked to both the ContentObject and the DocumentFamily.
Knowledge Feature Structure
{
"knowledgeFeatureRef": "<feature-type-slug>",
"properties": {
"<key>": "<value>"
}
}
- knowledgeFeatureRef: The slug of an existing
KnowledgeFeatureType in your organization
- properties: A map of key-value pairs specific to this feature instance
Example: Provider Feature
To set a provider knowledge feature with a providerId:
{
"knowledgeFeatureRef": "provider",
"properties": {
"providerId": "provider-123"
}
}
Example: Multiple Features
You can work with multiple knowledge features:
[
{
"knowledgeFeatureRef": "provider",
"properties": {
"providerId": "provider-123"
}
},
{
"knowledgeFeatureRef": "document-type",
"properties": {
"type": "invoice",
"confidence": 0.95
}
}
]
Add Knowledge Feature
POST /api/document-families/{id}/addKnowledgeFeature
import requests
import json
url = f"https://platform.kodexa-enterprise.com/api/document-families/{family_id}/addKnowledgeFeature"
feature = {
"knowledgeFeatureRef": "provider",
"properties": {
"providerId": "provider-123"
}
}
response = requests.post(
url,
headers={
"x-api-key": api_key,
"Content-Type": "application/json"
},
json=feature
)
print(f"Added feature: {response.json()}")
Remove Knowledge Feature
POST /api/document-families/{id}/removeKnowledgeFeature
response = requests.post(
f"https://platform.kodexa-enterprise.com/api/document-families/{family_id}/removeKnowledgeFeature",
headers={
"x-api-key": api_key,
"Content-Type": "application/json"
},
json=feature
)
Assess Document for Knowledge
Automatically assess a document family for applicable knowledge features and sets:
POST /api/document-families/{id}/assess
This endpoint:
- Extracts features from the content object
- Finds applicable knowledge sets based on organization and store-project relationships
- Associates new features with the document family
- Skips documents that are locked
Retrieve all knowledge items related to a document family through shared knowledge features:
GET /api/document-families/{id}/knowledgeItems
Get Applied Knowledge Sets
Get all knowledge sets that have been applied to a document family:
GET /api/document-families/{id}/appliedKnowledgeSets
Feature Deduplication
Knowledge features are deduplicated based on (featureType, properties):
- If a feature with the same type slug and identical properties already exists, the existing feature is reused
- If the properties differ, a new feature is created
- Features are linked to both the
ContentObject and the DocumentFamily
This means uploading multiple files with the same provider and providerId will share a single KnowledgeFeature record.
Prerequisites for Knowledge Features
Before working with knowledge features:
- Ensure the
KnowledgeFeatureType exists (e.g., provider type must be created first)
- The feature type slug in
knowledgeFeatureRef must match exactly
- If the feature type doesn’t exist, linking will fail silently with a warning in the logs
Filtering Document Families
The list endpoint (GET /api/document-families) supports the standard filter parameter using Kodexa filter syntax. For example:
GET /api/document-families?filter=status: 'PROCESSED' and path~ '*invoice*'
Knowledge Expression Filtering
In addition to standard filters, document families support filtering by knowledge expressions using boolean logic:
GET /api/document-families?knowledgeExpression={expression}
Expression Types
| Type | Description | Example |
|---|
FEATURE | Match documents with a specific feature | {"type":"FEATURE","slug":"document-type-abc123"} |
AND | Match documents with ALL specified features | {"type":"AND","children":[...]} |
OR | Match documents with ANY specified features | {"type":"OR","children":[...]} |
NOT | Match documents WITHOUT a feature | {"type":"NOT","children":[...]} |
Example: Filter by Single Feature
curl -G "https://platform.kodexa-enterprise.com/api/document-families" \
-H "x-api-key: your-api-key-here" \
--data-urlencode 'knowledgeExpression={"type":"FEATURE","slug":"provider-abc123"}'
Example: Filter by Multiple Features (AND)
curl -G "https://platform.kodexa-enterprise.com/api/document-families" \
-H "x-api-key: your-api-key-here" \
--data-urlencode 'knowledgeExpression={"type":"AND","children":[{"type":"FEATURE","slug":"feature-1"},{"type":"FEATURE","slug":"feature-2"}]}'
Key Operations
Get External Data
Documents can store data from external systems (ERP, CRM, databases):
from kodexa import KodexaPlatform
platform = KodexaPlatform(url="https://platform.kodexa-enterprise.com", api_key="your-api-key")
# Get default external data
external_data = platform.get_document_external_data(
family_id="550e8400-e29b-41d4-a716-446655440000"
)
# Get specific external data key
erp_data = platform.get_document_external_data(
family_id="550e8400-e29b-41d4-a716-446655440000",
key="erp_system"
)
Update External Data
Store references or metadata from external systems:
# Update ERP reference
platform.update_document_external_data(
family_id="550e8400-e29b-41d4-a716-446655440000",
data={
"invoice_id": "INV-2024-001",
"vendor_id": "V-12345",
"posted_date": "2024-01-15",
"status": "approved"
},
key="erp_system"
)
# Update CRM reference
platform.update_document_external_data(
family_id="550e8400-e29b-41d4-a716-446655440000",
data={
"opportunity_id": "OPP-789",
"account_id": "ACC-456"
},
key="crm_system"
)
Get Processing Steps
View the complete processing pipeline:
# Get processing steps
steps = platform.get_document_steps(
family_id="550e8400-e29b-41d4-a716-446655440000"
)
for step in steps:
print(f"{step.step_type}: {step.status}")
print(f" Duration: {step.duration_ms}ms")
if step.error:
print(f" Error: {step.error}")
Update Document Status
Change workflow status:
# Update status to processing
platform.update_document_status(
family_id="550e8400-e29b-41d4-a716-446655440000",
status="PROCESSING"
)
# Update to complete
platform.update_document_status(
family_id="550e8400-e29b-41d4-a716-446655440000",
status="COMPLETE"
)
# Mark as failed
platform.update_document_status(
family_id="550e8400-e29b-41d4-a716-446655440000",
status="FAILED"
)
Touch Document
Trigger events without changes:
# Touch document to trigger event listeners
platform.touch_document_family(
family_id="550e8400-e29b-41d4-a716-446655440000"
)
External Data Use Cases
External data provides a bridge between Kodexa and your business systems:
ERP Integration
# Store invoice posting details
external_data = {
"invoice_number": "INV-2024-001",
"gl_account": "1200-5000",
"cost_center": "CC-100",
"posted_date": "2024-01-15T10:30:00Z",
"batch_id": "BATCH-2024-01-15-001"
}
platform.update_document_external_data(
family_id=family_id,
data=external_data,
key="erp_posting"
)
CRM Tracking
# Link document to CRM opportunity
crm_data = {
"opportunity_id": "OPP-12345",
"account_id": "ACC-67890",
"contact_id": "CON-54321",
"stage": "proposal_sent",
"probability": 75
}
platform.update_document_external_data(
family_id=family_id,
data=crm_data,
key="crm_link"
)
Workflow State
# Store workflow state
workflow_data = {
"workflow_id": "WF-001",
"current_step": "approval",
"assigned_to": "user@example.com",
"due_date": "2024-01-20",
"priority": "high"
}
platform.update_document_external_data(
family_id=family_id,
data=workflow_data,
key="workflow"
)
Processing Steps Explained
Processing steps track every transformation:
[
{
"stepType": "UPLOAD",
"status": "COMPLETE",
"durationMs": 150,
"timestamp": "2024-01-15T10:00:00Z"
},
{
"stepType": "OCR",
"status": "COMPLETE",
"durationMs": 2300,
"timestamp": "2024-01-15T10:00:01Z",
"metadata": {
"pages": 3,
"confidence": 0.98
}
},
{
"stepType": "EXTRACTION",
"status": "COMPLETE",
"durationMs": 1500,
"timestamp": "2024-01-15T10:00:03Z",
"metadata": {
"fieldsExtracted": 15,
"assistant": "invoice-extractor-v2"
}
}
]
Document Status Values
Common status values for workflow management:
| Status | Description | Use Case |
|---|
UPLOADED | Document uploaded, awaiting processing | Initial state |
PROCESSING | AI processing in progress | During extraction |
PROCESSED | Processing complete, data extracted | Ready for review |
REVIEW | Awaiting human review | Quality control |
APPROVED | Reviewed and approved | Ready for export |
REJECTED | Rejected during review | Needs correction |
FAILED | Processing failed | Error handling |
ARCHIVED | Archived for retention | Long-term storage |
Best Practices
Use External Data for System Integration
✅ Good: Store external references
external_data = {
"erp_id": "INV-2024-001",
"posted": True,
"post_date": "2024-01-15"
}
❌ Avoid: Duplicating document content
external_data = {
"vendor": "ACME", # Already in extracted data
"amount": "1500" # Already in extracted data
}
Choose the Right Access Method
✅ Good: Use store path when browsing
files = platform.list_store_files("my-org/invoices")
✅ Good: Use family ID when processing
data = platform.get_document_external_data(family_id)
❌ Avoid: Using family ID for browsing
# Don't iterate all families just to list documents
Status Workflow
✅ Good: Clear status progression
UPLOADED → PROCESSING → PROCESSED → REVIEW → APPROVED
❌ Avoid: Unclear status values
UPLOADED → DONE → FINISHED → OK
Reprocessing Documents
Trigger reprocessing of a document family with specific assistants:
PUT /api/document-families/{id}/reprocess?assistantId={assistantId1}&assistantId={assistantId2}
import requests
response = requests.put(
f"https://platform.kodexa-enterprise.com/api/document-families/{family_id}/reprocess",
headers={"x-api-key": api_key},
params={"assistantId": ["assistant-1", "assistant-2"]}
)
Downloading Native Files
Download the original native file (PDF, XLSX, etc.) for a document family:
GET /api/document-families/{id}/native
This returns the original uploaded file with the appropriate Content-Type header and filename. This is useful when you need to retrieve the source document rather than the processed Kodexa document representation.
import requests
response = requests.get(
f"https://platform.kodexa-enterprise.com/api/document-families/{family_id}/native",
headers={"x-api-key": api_key}
)
# Save the native file
filename = response.headers.get("Content-Disposition", "document").split("filename=")[-1]
with open(filename, "wb") as f:
f.write(response.content)
The native file endpoint returns the original uploaded file. If the document was uploaded as a PDF, you get the PDF back. This is distinct from the /export endpoint which returns a .dfm package containing all content objects and metadata.
Exporting Document Families
Export a document family as a .dfm file:
GET /api/document-families/{id}/export
This returns a downloadable file containing the complete document family including all content objects and metadata.
Getting Data Exports
Export data objects from a document family in various formats:
GET /api/document-families/{id}/data?format={format}
Supported formats:
json - Standard JSON format
csv - Comma-separated values
xml - XML format
datalake - NDJson for lakehouse/S3 storage with metadata wrapper
curl "https://platform.kodexa-enterprise.com/api/document-families/{id}/data?format=json&friendlyNames=true" \
-H "x-api-key: your-api-key-here"
Next Steps