Skip to main content
Kodexa is a powerful document processing platform that allows developers to work with documents in a structured and efficient manner. This guide will walk you through the basics of creating, saving, loading, and working with Kodexa Documents.

Creating a Document

You can create documents in several ways using the Kodexa Document SDK:
from kodexa_document import Document

# Create an empty document
doc = Document()

# Create from text
doc = Document.from_text("Some content")

# Create from JSON
doc = Document.from_json(json_string)

Saving a Document

You can save a Kodexa Document to a file or a store. Documents are saved in the KDDB format (a SQLite database):
doc.to_kddb('my-document.kddb')
doc.close()
By convention, we use the .kddb extension for Kodexa Document Database files.

Loading a Document

To load a previously saved Kodexa Document:
another_document = Document.from_kddb('my-document.kddb')
another_document.close()

Detached Documents

Sometimes you may want to make changes to a document without affecting the original file. In Python, you can load the document in detached mode:
detached_document = Document.from_kddb('my-document.kddb', detached=True)

Anatomy of a Kodexa Document

The Kodexa Document Model provides a flexible and powerful way to represent structured and unstructured documents. At its core, it consists of a Document object that contains metadata and a hierarchical tree of ContentNodes, each of which can have features and tags attached to them. Let’s explore the key components of the model.

Core Components

Document Structure

A Kodexa Document consists of:
  1. Document Metadata: Flexible dictionary-based metadata about the document
  2. Content Node Tree: Hierarchical structure of content nodes
  3. Source Metadata: Information about the document’s origin (filename, MIME type, checksum)
  4. Native Documents: Embedded binary files (original PDFs, images, etc.)
  5. Data Objects & Attributes: Structured extracted data organized by taxonomy
  6. Audit Trail: Change history and revision tracking

Content Nodes

ContentNodes are the building blocks of the document structure. Each ContentNode represents a logical section of content and has the following properties: Key attributes:
  • node_type: Identifies the type of node (e.g., ‘page’, ‘line’, ‘word’, ‘cell’)
  • content: The actual text content of the node
  • features: List of attached features (metadata)
  • tags: Annotations linking content to extracted data
  • children: Child nodes in the hierarchy
  • id: Unique numeric identifier
  • index: Position among siblings
  • virtual: Whether the node is a virtual/synthesized node

Features

Features are flexible metadata containers attached to ContentNodes: Each feature has:
  • feature_type: Category of the feature (e.g., ‘tag’, ‘spatial’)
  • name: Identifier for the feature
  • value: The feature’s data (always stored as an array)

Working with Documents

Creating Document Structure

from kodexa_document import Document

doc = Document()

# Create a root node
root = doc.create_node(node_type="document")
doc.content_node = root

# Add child nodes
page = doc.create_node(node_type="page", content="Page content")
root.add_child(page)

line = doc.create_node(node_type="line", content="A line of text")
page.add_child(line)

Working with Features

# Add a feature
node.add_feature("tag", "paragraph", "body")

# Add spatial information (bounding box)
node.set_bbox([10, 20, 100, 200])

# Get feature value
value = node.get_feature_value("tag", "paragraph")

# Check if feature exists
has_it = node.has_feature("tag", "paragraph")

# Get all features
features = node.get_features()

# Get features by type
tag_features = node.get_features_of_type("tag")

# Remove a feature
node.remove_feature("tag", "paragraph")

Working with Tags

Tags are annotations on nodes that link content to extracted data:
# Apply a tag
node.tag("company_name", confidence=0.95, value="Acme Corp")

# Check if node has a tag
if node.has_tag("company_name"):
    tags = node.get_tags()
    for tag in tags:
        print(f"Tag: {tag.value} (confidence: {tag.confidence})")

# Remove a tag
node.remove_tag("company_name")

Node Navigation and Selection

The document model provides powerful ways to navigate and select nodes:
  1. Direct Navigation:
    • get_children() / getChildren(): Get immediate child nodes
    • get_parent() / getParent(): Get parent node
    • next_node() / nextNode(): Get next sibling
    • previous_node() / previousNode(): Get previous sibling
    • get_child(index) / getChild(index): Get child by index
  2. Selector-based Navigation:
# Select all nodes of type 'page'
pages = document.select("//page")

# Select nodes with specific tags
tagged = document.select("//*[hasTag('paragraph')]")

# Select first match only
first_page = document.select_first("//page")

Data Objects and Attributes

Documents can contain structured extracted data organized by taxonomy:
# Get all data objects
objects = doc.data_objects.get_all()

# Get attributes for an object
attrs = doc.data_attributes.get_for_data_object(obj_id)

Document Metadata

# Access metadata
doc.metadata["schema_version"] = "1.0"
print(doc.metadata)

# Source metadata
print(doc.source)

Working with Document Content

Kodexa uses a powerful selector syntax to find and manipulate content within documents. Selectors work similarly to CSS selectors or XPath, allowing you to build queries that can be executed on a document instance.

Basic Selector Example

To find all content nodes matching a regex:
nodes = document.select('//*[contentRegex("Name")]')
This returns a list of the matching content nodes.

Selector Syntax

The selector syntax is composed of several parts:
  1. Axis & Node Type: Defines how to navigate the tree structure.
  2. Predicate: Further filters the selected nodes based on conditions.

Axis Examples

  • //: Current node and all children
  • /: Root node
  • .: Current Node (or root if from the document)
  • ./line/.: All nodes of type line under the current node
  • parent::line: Any node in the parent structure of this node that is of node type line

Predicate Functions

Predicates can use various functions, such as:
  • contentRegex: Matches content against a regular expression
  • typeRegex: Matches node type name against a regular expression
  • hasTag: Checks if a node has a specific tag
  • hasFeature: Checks if a node has a specific feature
  • content: Returns the content of the node
  • uuid: Returns the UUID of the node

Operators

Operators can be used to combine functions:
  • |: Union the results of two sides
  • =: Test that two sides are equal
  • and: Boolean AND operation
  • or: Boolean OR operation

Pipeline Selectors

Kodexa also supports “pipeline” selectors, allowing you to chain multiple selectors:
document.select('//word stream //*[hasTag("ORG")] stream * [hasTag("PERSON")]')
This example streams all nodes of type word, then filters those with the “ORG” tag, and finally filters those with the “PERSON” tag.

Best Practices

  1. Node Types: Use consistent node types throughout your document to make selection and processing easier
  2. Features:
    • Use features to add metadata rather than modifying node content
    • Keep feature names consistent across your application
    • Use appropriate feature types for different kinds of metadata
  3. Content Structure:
    • Maintain a logical hierarchy that reflects the document’s structure
    • Use indexes appropriately to maintain node order
    • Consider using virtual nodes for sparse content
  4. Performance:
    • Use selectors efficiently
    • Batch operations when possible
    • Use KDDB format for large documents
  5. Resource Management:
    • Always close documents when done (Python: doc.close(), TypeScript: await doc.dispose())
    • Use context managers in Python: with Document() as doc:

Error Handling

The document model includes error handling through exceptions:
from kodexa_document import Document, ContentException

try:
    doc = Document.from_kddb("my-document.kddb")
    # ... work with document
except Exception as e:
    print(f"Error: {str(e)}")
finally:
    doc.close()

Conclusion

Kodexa Documents provide a powerful way to work with structured content. By understanding how to create, save, load, and query documents using selectors, you can efficiently process and analyze complex document structures in your applications.