Getting Started

This guide covers the essential operations for working with Kodexa Documents: creating, loading, manipulating, and saving documents.

Creating Documents

Empty Document

Create a new document and build its structure:

from kodexa_document import Document

# Always use context managers for automatic cleanup
with Document() as doc:
    # Create the root node
    root = doc.create_node("document", "My Document")
    doc.content_node = root

    # Add child nodes
    section = doc.create_node("section", "Introduction", parent=root)
    para1 = doc.create_node("paragraph", "This is the first paragraph.", parent=section)
    para2 = doc.create_node("paragraph", "This is the second paragraph.", parent=section)

    print(f"Created document with {len(root.get_children())} sections")

From Text

Automatically parse text into paragraphs:

text = """First paragraph of content.
Second paragraph with more details.
Third paragraph to conclude."""

with Document.from_text(text, separator="\n") as doc:
    paragraphs = doc.select("//paragraph")
    print(f"Created {len(paragraphs)} paragraphs from text")

With Metadata

Initialize documents with metadata:

with Document(metadata={
    "title": "Invoice Analysis",
    "author": "Processing System",
    "created": "2024-01-15"
}) as doc:
    # Access metadata
    title = doc.get_metadata("title")
    print(f"Document title: {title}")

Loading Documents

From KDDB File

Load an existing document:

# Load into memory for fast processing (creates a copy)
with Document.from_kddb("document.kddb", detached=True) as doc:
    print(f"Loaded document: {doc.uuid}")
    nodes = doc.select("//*")
    print(f"Total nodes: {len(nodes)}")

# Load for in-place editing (modifies original file)
with Document.from_kddb("document.kddb", detached=False) as doc:
    # Changes are saved to the original file
    doc.set_metadata("last_accessed", "2024-01-15")

From Bytes

Load from API responses or downloads:

import requests

# Example: Load from an API response
response = requests.get("https://api.example.com/documents/123")
kddb_bytes = response.content

with Document.from_kddb(kddb_bytes) as doc:
    print(f"Loaded document from API: {doc.uuid}")

From JSON

Load from JSON representation:

json_data = '{"uuid": "...", "metadata": {"title": "Test"}}'
with Document.from_json(json_data) as doc:
    print(f"Loaded from JSON: {doc.uuid}")

Working with Content Nodes

Traverse the document tree:

with Document.from_kddb("document.kddb") as doc:
    root = doc.content_node

    # Get all children
    children = root.get_children()

    # Navigate relationships
    for child in children:
        parent = child.get_parent()       # Back to root
        siblings = child.get_siblings()   # Other children
        next_node = child.next_node()     # Next sibling
        depth = child.get_depth()         # Depth in tree

        print(f"Node type: {child.type}, depth: {depth}")

Content Access

Read and modify node content:

with Document() as doc:
    root = doc.create_node("document")
    doc.content_node = root

    para = doc.create_node("paragraph", "Initial content", parent=root)

    # Read content
    print(f"Content: {para.content}")

    # Update content
    para.content = "Updated content"

    # Multi-part content
    para.set_content_parts(["Part 1", "Part 2", "Part 3"])
    parts = para.get_content_parts()

    # Get all content from node and descendants
    all_text = root.get_all_content(separator=" ")

Querying with Selectors

Use XPath-like selectors to find nodes:

with Document.from_text("Para 1\nPara 2\nPara 3", separator="\n") as doc:
    # Select all nodes of a type
    all_paragraphs = doc.select("//paragraph")

    # Select first match only
    first_para = doc.select_first("//paragraph")

    # Filter by content
    matching = doc.select("//paragraph[contains(@content, 'Para 2')]")

    # Select tagged nodes
    tagged = doc.select("//*[@tag='important']")

    # Select with variables
    variables = {"search_term": "Para 1"}
    results = doc.select("//paragraph[contains(@content, $search_term)]", variables)

    print(f"Found {len(all_paragraphs)} paragraphs")

Common Selector Patterns

Selector	Description
`//*`	All nodes
`//paragraph`	All paragraphs
`//section/paragraph`	Direct child paragraphs of sections
`//paragraph[1]`	First paragraph
`//*[@tag='important']`	Nodes with ‘important’ tag
`//paragraph[contains(@content, 'text')]`	Paragraphs containing ‘text’

Adding Features

Attach metadata to nodes:

with Document() as doc:
    root = doc.create_node("document")
    doc.content_node = root
    para = doc.create_node("paragraph", "Styled text", parent=root)

    # Add features (type, name, value)
    para.add_feature("style", "font-family", "Arial")
    para.add_feature("style", "font-size", "12pt")
    para.add_feature("analysis", "word-count", 2)
    para.add_feature("position", "bbox", {"x": 100, "y": 200, "w": 300, "h": 50})

    # Retrieve features
    font = para.get_feature("style", "font-family")
    if font:
        print(f"Font: {font.get_value()}")

    # Get all features of a type
    style_features = para.get_features_of_type("style")

    # Get all features
    all_features = para.get_features()

Adding Tags

Annotate nodes with tags:

with Document() as doc:
    root = doc.create_node("document")
    doc.content_node = root
    para = doc.create_node("paragraph", "Important invoice total: $1,234.56", parent=root)

    # Simple tag
    para.tag("important")

    # Tag with confidence and value
    para.tag("invoice-total", confidence=0.95, value="$1,234.56")

    # Check for tags
    if para.has_tag("important"):
        print("This paragraph is marked as important")

    # Get tag details
    tag = para.get_tag("invoice-total")
    if tag:
        confidence = tag.get("Confidence")
        value = tag.get("Value")
        print(f"Invoice total: {value} (confidence: {confidence})")

    # List all tags
    all_tags = para.get_tags()
    print(f"Tags: {all_tags}")

Saving Documents

To KDDB File

Save to the native format:

with Document() as doc:
    root = doc.create_node("document", "Content to save")
    doc.content_node = root

    # Save to file
    doc.save("output.kddb")

To Bytes

Export for API responses:

with Document() as doc:
    root = doc.create_node("document", "API response")
    doc.content_node = root

    # Get as bytes
    kddb_bytes = doc.to_kddb()

    # Send in API response
    # return Response(content=kddb_bytes, media_type="application/octet-stream")

To JSON

Export for debugging or interoperability:

with Document() as doc:
    root = doc.create_node("document", "Debug output")
    doc.content_node = root

    # Pretty-printed JSON
    json_str = doc.to_json(indent=2)
    print(json_str)

    # As dictionary
    doc_dict = doc.to_dict()

Document Metadata

Setting Metadata

with Document() as doc:
    # Set individual values
    doc.set_metadata("title", "My Document")
    doc.set_metadata("author", "John Doe")
    doc.set_metadata("tags", ["invoice", "2024", "processed"])
    doc.set_metadata("config", {"threshold": 0.8, "model": "v2"})

    # Access all metadata
    metadata = doc.metadata

Labels

Categorize documents:

with Document() as doc:
    # Add labels
    doc.add_label("invoice")
    doc.add_label("financial")
    doc.add_label("q1-2024")

    # Get all labels
    labels = doc.labels
    print(f"Document labels: {labels}")

Error Handling

Handle common errors gracefully:

from kodexa_document import Document
from kodexa_document.errors import DocumentError, DocumentNotFoundError

try:
    with Document.from_kddb("missing.kddb") as doc:
        nodes = doc.select("//paragraph")

except DocumentNotFoundError:
    print("Document file not found")

except DocumentError as e:
    print(f"Document error: {e}")

except RuntimeError as e:
    print(f"Runtime error (possibly closed document): {e}")

Complete Example

Here’s a full workflow combining the concepts:

from kodexa_document import Document

def process_document():
    # Create a new document
    with Document() as doc:
        # Set document metadata
        doc.set_metadata("title", "Invoice Processing Result")
        doc.set_metadata("processor", "kodexa-document-example")
        doc.add_label("invoice")

        # Build document structure
        root = doc.create_node("document", "Invoice #12345")
        doc.content_node = root

        # Add header section
        header = doc.create_node("section", "Header", parent=root)
        doc.create_node("paragraph", "Vendor: Acme Corp", parent=header)
        doc.create_node("paragraph", "Date: 2024-01-15", parent=header)

        # Add line items
        items = doc.create_node("section", "Line Items", parent=root)

        for i, (desc, amount) in enumerate([
            ("Widget A", 100.00),
            ("Widget B", 250.00),
            ("Service Fee", 50.00)
        ]):
            item = doc.create_node("paragraph", f"{desc}: ${amount:.2f}", parent=items)
            item.add_feature("line-item", "amount", amount)
            item.add_feature("line-item", "index", i)
            item.tag("line-item", value=str(amount))

        # Add total
        total = doc.create_node("paragraph", "Total: $400.00", parent=root)
        total.tag("invoice-total", confidence=1.0, value="400.00")
        total.add_feature("summary", "calculated", True)

        # Query the document
        line_items = doc.select("//*[@tag='line-item']")
        print(f"Found {len(line_items)} line items")

        total_node = doc.select_first("//*[@tag='invoice-total']")
        if total_node:
            print(f"Invoice total: {total_node.content}")

        # Save the result
        doc.save("processed_invoice.kddb")
        print("Document saved successfully")

if __name__ == "__main__":
    process_document()

Working with Accessors

Once a document has been processed and contains extracted data, you can use accessors to work with data objects and attributes programmatically.

Data Objects

from kodexa_document import Document, DataObjectAccessor, DataObjectInput

with Document.from_kddb("processed.kddb") as doc:
    accessor = DataObjectAccessor(doc)

    # List all data objects
    all_objects = accessor.get_all()

    # Get root-level data objects (no parent)
    roots = accessor.get_roots()

    # Get children of a specific group
    children = accessor.get_children(parent_group_uuid="some-uuid")

    # Create a new data object
    new_obj = accessor.create(DataObjectInput(
        taxonomy_ref="taxonomy://my-org/invoice",
        path="/invoice"
    ))

    # Look up by UUID
    obj = accessor.get_by_uuid("abc-123")

Data Attributes

from kodexa_document import Document, DataAttributeAccessor, DataAttributeInput

with Document.from_kddb("processed.kddb") as doc:
    accessor = DataAttributeAccessor(doc)

    # Get attributes for a data object
    attrs = accessor.get_for_data_object(data_object_id=1)

    # Create an attribute
    attr = accessor.create(
        data_object_id=1,
        input=DataAttributeInput(
            tag="invoice/total",
            value="1234.56",
            confidence=0.95
        )
    )

    # Update an attribute
    accessor.update(
        attr_id=1,
        input=DataAttributeInput(value="1300.00")
    )

Audit Trail

from kodexa_document import Document, AuditAccessor

with Document.from_kddb("processed.kddb") as doc:
    audit = AuditAccessor(doc)

    # List all revisions
    revisions = audit.list_revisions()

    # Get details for a specific revision
    details = audit.get_revision_details(revision_id=1)

    # View history for a specific data object
    history = audit.get_data_object_history(data_object_id=1)

Next Steps

Platform Models

Auto-generated Pydantic models from the OpenAPI spec

Platform Client

Connect to the Kodexa API

Extraction

Extract structured data using taxonomies

Processing

Track processing steps and knowledge items

Overview

Document Data

Document Structure

Structured Data

Change Management

Python SDK

Creating Documents

Empty Document

From Text

With Metadata

Loading Documents

From KDDB File

From Bytes

From JSON

Working with Content Nodes

Navigation

Content Access

Querying with Selectors

Common Selector Patterns

Adding Features

Adding Tags

Saving Documents

To KDDB File

To Bytes

To JSON

Document Metadata

Setting Metadata

Labels

Error Handling

Complete Example

Working with Accessors

Data Objects

Data Attributes

Audit Trail

Next Steps

Platform Models

Platform Client

Extraction

Processing

Overview

Document Data

Document Structure

Structured Data

Change Management

Python SDK

​Creating Documents

​Empty Document

​From Text

​With Metadata

​Loading Documents

​From KDDB File

​From Bytes

​From JSON

​Working with Content Nodes

​Navigation

​Content Access

​Querying with Selectors

​Common Selector Patterns

​Adding Features

​Adding Tags

​Saving Documents

​To KDDB File

​To Bytes

​To JSON

​Document Metadata

​Setting Metadata

​Labels

​Error Handling

​Complete Example

​Working with Accessors

​Data Objects

​Data Attributes

​Audit Trail

​Next Steps

Platform Models

Platform Client

Extraction

Processing

Creating Documents

Empty Document

From Text

With Metadata

Loading Documents

From KDDB File

From Bytes

From JSON

Working with Content Nodes

Navigation

Content Access

Querying with Selectors

Common Selector Patterns

Adding Features

Adding Tags

Saving Documents

To KDDB File

To Bytes

To JSON

Document Metadata

Setting Metadata

Labels

Error Handling

Complete Example

Working with Accessors

Data Objects

Data Attributes

Audit Trail

Next Steps