> ## Documentation Index
> Fetch the complete documentation index at: https://developer.kodexa.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Processing

> Track Kodexa document processing steps, knowledge items, and execution pipelines from Python using the processing module of the Python SDK.

The processing module provides classes for tracking document processing history, recording knowledge items, and managing execution pipelines.

## ProcessingStep

A `ProcessingStep` represents a unit of work in a document processing pipeline. Steps form a DAG (directed acyclic graph) with parent-child relationships, enabling you to track the full lineage of how a document was processed.

### Creating Steps

```python theme={null}
from kodexa_document import ProcessingStep

# Create a processing step
step = ProcessingStep(name="PDF Extraction")

# With metadata
step = ProcessingStep(
    name="Invoice Classification",
    metadata={"model_version": "2.1", "threshold": 0.85},
    presentation_metadata={"icon": "file-invoice", "color": "blue"}
)
```

### Fields

| Field                   | Type                   | Description                  |
| ----------------------- | ---------------------- | ---------------------------- |
| `id`                    | `str`                  | UUID (auto-generated)        |
| `name`                  | `str`                  | Step name (required)         |
| `start_timestamp`       | `datetime`             | When the step started        |
| `duration`              | `int`                  | Duration in milliseconds     |
| `metadata`              | `dict`                 | Arbitrary key-value metadata |
| `presentation_metadata` | `dict`                 | UI display hints             |
| `children`              | `List[ProcessingStep]` | Child steps                  |
| `parents`               | `List[ProcessingStep]` | Parent steps                 |
| `internal_steps`        | `List[ProcessingStep]` | Internal sub-steps           |
| `knowledge_items`       | `List[KnowledgeItem]`  | Associated knowledge items   |

### Parent-Child Relationships

Build processing hierarchies:

```python theme={null}
# Create a pipeline
pipeline = ProcessingStep(name="Document Pipeline")

# Add child steps
extraction = ProcessingStep(name="Text Extraction")
pipeline.add_child(extraction)  # Bidirectional link

classification = ProcessingStep(name="Classification")
pipeline.add_child(classification)

tagging = ProcessingStep(name="Entity Tagging")
extraction.add_child(tagging)

# Children know their parents
print(tagging.parents[0].name)  # "Text Extraction"
```

### Merging Steps

Combine multiple processing branches:

```python theme={null}
# Two independent processing steps
ocr_step = ProcessingStep(name="OCR Processing")
nlp_step = ProcessingStep(name="NLP Analysis")

# Merge them into a combined step
merged = ProcessingStep.merge_with(ocr_step, nlp_step)
print(merged.name)  # "Merged Step"
print(len(merged.parents))  # 2
```

### Serialization

Steps serialize to JSON with circular reference handling:

```python theme={null}
step = ProcessingStep(name="My Step")
child = ProcessingStep(name="Child Step")
step.add_child(child)

# To dict/JSON
step_dict = step.to_dict()
json_str = step.to_json()

# From dict/JSON
restored = ProcessingStep.from_dict(step_dict)
restored = ProcessingStep.from_json(json_str)
```

The `to_dict()` method uses a `seen` set to handle circular references from bidirectional parent-child links. The `from_dict()` method uses a `step_cache` to reconstruct these references.

## KnowledgeItem

A `KnowledgeItem` represents a piece of knowledge produced or consumed during processing.

```python theme={null}
from kodexa_document.processing.processing_step import KnowledgeItem, KnowledgeFeature

item = KnowledgeItem(
    title="Invoice #12345",
    description="Extracted invoice data",
    slug="invoice-12345",
    knowledge_item_type_ref="knowledge-item-type://my-org/invoice",
    properties={"total": 1234.56, "vendor": "Acme Corp"}
)
```

### Fields

| Field                     | Type                     | Description                |
| ------------------------- | ------------------------ | -------------------------- |
| `id`                      | `str`                    | UUID (auto-generated)      |
| `title`                   | `str`                    | Display title              |
| `description`             | `str`                    | Description                |
| `slug`                    | `str`                    | URL-friendly identifier    |
| `sequence_order`          | `int`                    | Ordering within a set      |
| `knowledge_item_type_ref` | `str`                    | Reference to the item type |
| `properties`              | `dict`                   | Arbitrary properties       |
| `features`                | `List[KnowledgeFeature]` | Associated features        |

## KnowledgeFeature

A `KnowledgeFeature` attaches structured metadata to knowledge items.

```python theme={null}
feature = KnowledgeFeature(
    feature_type_ref="knowledge-feature-type://my-org/confidence-score",
    slug="extraction-confidence",
    properties={"score": 0.95, "model": "invoice-v2"},
    extended_properties={"per_field_scores": {"total": 0.99, "date": 0.91}}
)

item.features.append(feature)
```

### Fields

| Field                 | Type   | Description                                      |
| --------------------- | ------ | ------------------------------------------------ |
| `id`                  | `str`  | UUID (auto-generated)                            |
| `feature_type_ref`    | `str`  | Reference to the feature type                    |
| `slug`                | `str`  | URL-friendly identifier                          |
| `active`              | `bool` | Whether this feature is active (default: `True`) |
| `properties`          | `dict` | Core properties                                  |
| `extended_properties` | `dict` | Additional properties                            |

## Attaching to Processing Steps

Knowledge items are associated with processing steps:

```python theme={null}
step = ProcessingStep(name="Invoice Extraction")

item = KnowledgeItem(
    title="Invoice #12345",
    knowledge_item_type_ref="knowledge-item-type://my-org/invoice"
)

feature = KnowledgeFeature(
    feature_type_ref="knowledge-feature-type://my-org/extraction-result",
    properties={"status": "complete", "field_count": 12}
)
item.features.append(feature)

step.knowledge_items.append(item)
```

## PipelineContext

`PipelineContext` tracks the state of a running execution pipeline. It is primarily used by module developers building custom processing steps.

```python theme={null}
from kodexa_document.platform.kodexa import PipelineContext

context = PipelineContext(
    execution_id="exec-abc-123",
    content_provider=my_content_provider
)
```

### Fields

| Field               | Type                 | Description                                              |
| ------------------- | -------------------- | -------------------------------------------------------- |
| `execution_id`      | `str`                | UUID for this execution (auto-generated if not provided) |
| `statistics`        | `PipelineStatistics` | Tracks documents processed                               |
| `output_document`   | `Document`           | The output document                                      |
| `content_objects`   | `list`               | Content objects in the pipeline                          |
| `stop_on_exception` | `bool`               | Whether to halt on errors (default: `True`)              |
| `current_document`  | `Document`           | The document currently being processed                   |
| `document_family`   | `DocumentFamily`     | The document family context                              |
| `document_store`    | `Store`              | The associated document store                            |

### Status Updates

Report progress during execution:

```python theme={null}
context.update_status("Extracting page 5 of 20", progress=5, progress_max=20)
```

### Cancellation

Check for user-initiated cancellation:

```python theme={null}
if context.is_cancelled():
    print("Execution cancelled by user")
    return
```

## RemoteStep

`RemoteStep` wraps a reference to a module on the platform and can process documents remotely:

```python theme={null}
from kodexa_document.platform.client import RemoteStep

step = RemoteStep(
    ref="my-org/fast-pdf-model:1.0.0",
    step_type="ACTION",
    options={"dpi": 300, "language": "en"}
)

# Process a document through the remote module
result_doc = step.process(document, context)
```

## Complete Example

```python theme={null}
from datetime import datetime
from kodexa_document import ProcessingStep
from kodexa_document.processing.processing_step import KnowledgeItem, KnowledgeFeature

# Build a processing pipeline history
pipeline = ProcessingStep(
    name="Invoice Processing Pipeline",
    start_timestamp=datetime.now(),
    metadata={"version": "3.0"}
)

# Stage 1: PDF extraction
extraction = ProcessingStep(
    name="PDF Extraction",
    metadata={"model": "fast-pdf-model", "pages": 3}
)
pipeline.add_child(extraction)

# Stage 2: Classification
classification = ProcessingStep(
    name="Document Classification",
    metadata={"result": "invoice", "confidence": 0.98}
)
pipeline.add_child(classification)

# Stage 3: Data extraction (depends on both previous steps)
data_extraction = ProcessingStep(
    name="Data Extraction",
    metadata={"fields_extracted": 12}
)
extraction.add_child(data_extraction)
classification.add_child(data_extraction)

# Attach knowledge item to the data extraction step
item = KnowledgeItem(
    title="Invoice #12345",
    knowledge_item_type_ref="knowledge-item-type://acme/invoice",
    properties={"vendor": "Acme Corp", "total": 1234.56}
)

item.features.append(KnowledgeFeature(
    feature_type_ref="knowledge-feature-type://acme/extraction-score",
    properties={"overall_confidence": 0.95}
))

data_extraction.knowledge_items.append(item)

# Serialize the full pipeline
pipeline_json = pipeline.to_json()

# Deserialize later
restored = ProcessingStep.from_json(pipeline_json)
print(f"Pipeline: {restored.name}")
print(f"Steps: {len(restored.children)}")
```
