Processing

The processing module provides classes for tracking document processing history, recording knowledge items, and managing execution pipelines.

ProcessingStep

A ProcessingStep represents a unit of work in a document processing pipeline. Steps form a DAG (directed acyclic graph) with parent-child relationships, enabling you to track the full lineage of how a document was processed.

Creating Steps

from kodexa_document import ProcessingStep

# Create a processing step
step = ProcessingStep(name="PDF Extraction")

# With metadata
step = ProcessingStep(
    name="Invoice Classification",
    metadata={"model_version": "2.1", "threshold": 0.85},
    presentation_metadata={"icon": "file-invoice", "color": "blue"}
)

Fields

Field	Type	Description
`id`	`str`	UUID (auto-generated)
`name`	`str`	Step name (required)
`start_timestamp`	`datetime`	When the step started
`duration`	`int`	Duration in milliseconds
`metadata`	`dict`	Arbitrary key-value metadata
`presentation_metadata`	`dict`	UI display hints
`children`	`List[ProcessingStep]`	Child steps
`parents`	`List[ProcessingStep]`	Parent steps
`internal_steps`	`List[ProcessingStep]`	Internal sub-steps
`knowledge_items`	`List[KnowledgeItem]`	Associated knowledge items

Parent-Child Relationships

Build processing hierarchies:

# Create a pipeline
pipeline = ProcessingStep(name="Document Pipeline")

# Add child steps
extraction = ProcessingStep(name="Text Extraction")
pipeline.add_child(extraction)  # Bidirectional link

classification = ProcessingStep(name="Classification")
pipeline.add_child(classification)

tagging = ProcessingStep(name="Entity Tagging")
extraction.add_child(tagging)

# Children know their parents
print(tagging.parents[0].name)  # "Text Extraction"

Merging Steps

Combine multiple processing branches:

# Two independent processing steps
ocr_step = ProcessingStep(name="OCR Processing")
nlp_step = ProcessingStep(name="NLP Analysis")

# Merge them into a combined step
merged = ProcessingStep.merge_with(ocr_step, nlp_step)
print(merged.name)  # "Merged Step"
print(len(merged.parents))  # 2

Serialization

Steps serialize to JSON with circular reference handling:

step = ProcessingStep(name="My Step")
child = ProcessingStep(name="Child Step")
step.add_child(child)

# To dict/JSON
step_dict = step.to_dict()
json_str = step.to_json()

# From dict/JSON
restored = ProcessingStep.from_dict(step_dict)
restored = ProcessingStep.from_json(json_str)

The to_dict() method uses a seen set to handle circular references from bidirectional parent-child links. The from_dict() method uses a step_cache to reconstruct these references.

KnowledgeItem

A KnowledgeItem represents a piece of knowledge produced or consumed during processing.

from kodexa_document.processing.processing_step import KnowledgeItem, KnowledgeFeature

item = KnowledgeItem(
    title="Invoice #12345",
    description="Extracted invoice data",
    slug="invoice-12345",
    knowledge_item_type_ref="knowledge-item-type://my-org/invoice",
    properties={"total": 1234.56, "vendor": "Acme Corp"}
)

Fields

Field	Type	Description
`id`	`str`	UUID (auto-generated)
`title`	`str`	Display title
`description`	`str`	Description
`slug`	`str`	URL-friendly identifier
`sequence_order`	`int`	Ordering within a set
`knowledge_item_type_ref`	`str`	Reference to the item type
`properties`	`dict`	Arbitrary properties
`features`	`List[KnowledgeFeature]`	Associated features

KnowledgeFeature

A KnowledgeFeature attaches structured metadata to knowledge items.

feature = KnowledgeFeature(
    feature_type_ref="knowledge-feature-type://my-org/confidence-score",
    slug="extraction-confidence",
    properties={"score": 0.95, "model": "invoice-v2"},
    extended_properties={"per_field_scores": {"total": 0.99, "date": 0.91}}
)

item.features.append(feature)

Fields

Field	Type	Description
`id`	`str`	UUID (auto-generated)
`feature_type_ref`	`str`	Reference to the feature type
`slug`	`str`	URL-friendly identifier
`active`	`bool`	Whether this feature is active (default: `True`)
`properties`	`dict`	Core properties
`extended_properties`	`dict`	Additional properties

Attaching to Processing Steps

Knowledge items are associated with processing steps:

step = ProcessingStep(name="Invoice Extraction")

item = KnowledgeItem(
    title="Invoice #12345",
    knowledge_item_type_ref="knowledge-item-type://my-org/invoice"
)

feature = KnowledgeFeature(
    feature_type_ref="knowledge-feature-type://my-org/extraction-result",
    properties={"status": "complete", "field_count": 12}
)
item.features.append(feature)

step.knowledge_items.append(item)

PipelineContext

PipelineContext tracks the state of a running execution pipeline. It is primarily used by module developers building custom processing steps.

from kodexa_document.platform.kodexa import PipelineContext

context = PipelineContext(
    execution_id="exec-abc-123",
    content_provider=my_content_provider
)

Fields

Field	Type	Description
`execution_id`	`str`	UUID for this execution (auto-generated if not provided)
`statistics`	`PipelineStatistics`	Tracks documents processed
`output_document`	`Document`	The output document
`content_objects`	`list`	Content objects in the pipeline
`stop_on_exception`	`bool`	Whether to halt on errors (default: `True`)
`current_document`	`Document`	The document currently being processed
`document_family`	`DocumentFamily`	The document family context
`document_store`	`Store`	The associated document store

Status Updates

Report progress during execution:

context.update_status("Extracting page 5 of 20", progress=5, progress_max=20)

Cancellation

Check for user-initiated cancellation:

if context.is_cancelled():
    print("Execution cancelled by user")
    return

RemoteStep

RemoteStep wraps a reference to a module on the platform and can process documents remotely:

from kodexa_document.platform.client import RemoteStep

step = RemoteStep(
    ref="my-org/fast-pdf-model:1.0.0",
    step_type="ACTION",
    options={"dpi": 300, "language": "en"}
)

# Process a document through the remote module
result_doc = step.process(document, context)

Complete Example

from datetime import datetime
from kodexa_document import ProcessingStep
from kodexa_document.processing.processing_step import KnowledgeItem, KnowledgeFeature

# Build a processing pipeline history
pipeline = ProcessingStep(
    name="Invoice Processing Pipeline",
    start_timestamp=datetime.now(),
    metadata={"version": "3.0"}
)

# Stage 1: PDF extraction
extraction = ProcessingStep(
    name="PDF Extraction",
    metadata={"model": "fast-pdf-model", "pages": 3}
)
pipeline.add_child(extraction)

# Stage 2: Classification
classification = ProcessingStep(
    name="Document Classification",
    metadata={"result": "invoice", "confidence": 0.98}
)
pipeline.add_child(classification)

# Stage 3: Data extraction (depends on both previous steps)
data_extraction = ProcessingStep(
    name="Data Extraction",
    metadata={"fields_extracted": 12}
)
extraction.add_child(data_extraction)
classification.add_child(data_extraction)

# Attach knowledge item to the data extraction step
item = KnowledgeItem(
    title="Invoice #12345",
    knowledge_item_type_ref="knowledge-item-type://acme/invoice",
    properties={"vendor": "Acme Corp", "total": 1234.56}
)

item.features.append(KnowledgeFeature(
    feature_type_ref="knowledge-feature-type://acme/extraction-score",
    properties={"overall_confidence": 0.95}
))

data_extraction.knowledge_items.append(item)

# Serialize the full pipeline
pipeline_json = pipeline.to_json()

# Deserialize later
restored = ProcessingStep.from_json(pipeline_json)
print(f"Pipeline: {restored.name}")
print(f"Steps: {len(restored.children)}")

Overview

Document Data

Document Structure

Structured Data

Change Management

Python SDK

ProcessingStep

Creating Steps

Fields

Parent-Child Relationships

Merging Steps

Serialization

KnowledgeItem

Fields

KnowledgeFeature

Fields

Attaching to Processing Steps

PipelineContext

Fields

Status Updates

Cancellation

RemoteStep

Complete Example

Overview

Document Data

Document Structure

Structured Data

Change Management

Python SDK

Documentation Index

​ProcessingStep

​Creating Steps

​Fields

​Parent-Child Relationships

​Merging Steps

​Serialization

​KnowledgeItem

​Fields

​KnowledgeFeature

​Fields

​Attaching to Processing Steps

​PipelineContext

​Fields

​Status Updates

​Cancellation

​RemoteStep

​Complete Example

ProcessingStep

Creating Steps

Fields

Parent-Child Relationships

Merging Steps

Serialization

KnowledgeItem

Fields

KnowledgeFeature

Fields

Attaching to Processing Steps

PipelineContext

Fields

Status Updates

Cancellation

RemoteStep

Complete Example