Working with a Document

Kodexa is a powerful document processing platform that allows developers to work with documents in a structured and efficient manner. This guide will walk you through the basics of creating, saving, loading, and working with Kodexa Documents.

Creating a Document

The first step in working with Kodexa is typically creating a new document. Let’s assume we have a document in PDF format:

from kodexa import Document

my_document = Document.from_file('example.pdf')

This step creates an empty document with a reference to the PDF file. At this point, the document hasn’t been parsed, but metadata has been added to allow Kodexa to understand where to find the original document.

Saving a Document

You can save a Kodexa Document to a file or a store. By default, documents are saved in the Kodexa format, which is a SQLite database:

my_document.to_kddb('my-document.kddb')
my_document.close()

By convention, we use the .kddb extension for Kodexa Document Database files.

Loading a Document

To load a previously saved Kodexa Document:

another_document = Document.from_kddb('my-document.kddb')
another_document.close()

Detached Documents

Sometimes you may want to make changes to a document without affecting the original file. For this, you can load the document in detached mode:

detached_document = Document.from_kddb('my-document.kddb', detached=True)

Anatomy of a Kodexa Document

The Kodexa Document Model provides a flexible and powerful way to represent structured and unstructured documents. At its core, it consists of a Document object that contains metadata and a hierarchical tree of ContentNodes, each of which can have features attached to them. Let’s explore the key components of the model.

Core Components

Document Structure

A Kodexa Document consists of:

Document Metadata: Flexible dictionary-based metadata about the document
Content Node Tree: Hierarchical structure of content nodes
Source Metadata: Information about the document’s origin
Labels: Document-level labels

Content Nodes

ContentNodes are the building blocks of the document structure. Each ContentNode represents a logical section of content and has the following properties: Key attributes:

node_type: Identifies the type of node (e.g., ‘page’, ‘line’, ‘cell’)
content: The actual content of the node
features: List of attached features (metadata, tags, etc.)
children: Child nodes in the hierarchy
uuid: Unique identifier
index: Position among siblings

Features

Features are flexible metadata containers attached to ContentNodes. They come in different types: Each feature has:

feature_type: Category of the feature (e.g., ‘tag’, ‘spatial’)
name: Identifier for the feature
value: The feature’s data
single: Boolean indicating if it’s a single value or collection

Working with Documents

Creating Documents

# Create a new document
doc = Document()

# Create from text
doc = Document.from_text("Some content")

# Create from file
doc = Document.from_file("path/to/file")

Adding Content

# Create a root node
root = doc.create_node(node_type="root")
doc.content_node = root

# Add child nodes
page = doc.create_node(node_type="page", content="Page content")
root.add_child(page)

Working with Features

# Add a tag feature
node.add_feature("tag", "paragraph", "body")

# Add spatial information
node.set_bbox([10, 20, 100, 200])

# Get feature value
value = node.get_feature_value("tag", "paragraph")

The document model provides powerful ways to navigate and select nodes:

Direct Navigation:
- get_children(): Get immediate child nodes
- get_parent(): Get parent node
- next_node(): Get next sibling
- previous_node(): Get previous sibling

Selector-based Navigation:

# Select all nodes of type 'page'
pages = document.select("//page")

# Select nodes with specific tags
tagged = document.select("//*[hasTag('paragraph')]")

Best Practices

Node Types: Use consistent node types throughout your document to make selection and processing easier
Features:
- Use features to add metadata rather than modifying node content
- Keep feature names consistent across your application
- Use appropriate feature types for different kinds of metadata
Content Structure:
- Maintain a logical hierarchy that reflects the document’s structure
- Use indexes appropriately to maintain node order
- Consider using virtual nodes for sparse content
Performance:
- Use selectors efficiently
- Batch operations when possible
- Consider using KDDB format for large documents

Error Handling

The document model includes robust error handling through the ContentException class:

# Add an exception to the document
doc.add_exception(ContentException(
    exception_type="validation",
    message="Invalid content structure",
    severity="ERROR"
))

Metadata

This is a dictionary containing metadata about the document, such as the source, title, author, etc:

print(my_document.metadata)

SourceMetadata

This contains metadata about the source document and works with connectors to allow you to access the original source document:

print(my_document.source)

Working with Document Content

Kodexa uses a powerful selector syntax to find and manipulate content within documents.

Selectors work similarly to CSS selectors or XPath, allowing you to build queries that can be executed on a document instance.

Basic Selector Example

To find all content nodes with the value “Name”:

nodes = document.select('//*[contentRegex("Name")]')

This returns an iterator of the matching content nodes.

Selector Syntax

The selector syntax is composed of several parts:

Axis & Node Type: Defines how to navigate the tree structure.
Predicate: Further filters the selected nodes based on conditions.

Axis Examples

//: Current node and all children
/: Root node
.: Current Node (or root if from the document)
./line/.: All nodes of type line under the current node
parent::line: Any node in the parent structure of this node that is of node type line

Predicate Functions

Predicates can use various functions, such as:

contentRegex: Matches content against a regular expression
typeRegex: Matches node type name against a regular expression
hasTag: Checks if a node has a specific tag
hasFeature: Checks if a node has a specific feature
content: Returns the content of the node
uuid: Returns the UUID of the node

Operators

Operators can be used to combine functions:

|: Union the results of two sides
=: Test that two sides are equal
and: Boolean AND operation
or: Boolean OR operation

Pipeline Selectors

Kodexa also supports “pipeline” selectors, allowing you to chain multiple selectors:

document.select('//word stream //*[hasTag("ORG")] stream * [hasTag("PERSON")]')

This example streams all nodes of type word, then filters those with the “ORG” tag, and finally filters those with the “PERSON” tag.

Conclusion

Kodexa Documents provide a powerful way to work with structured content. By understanding how to create, save, load, and query documents using selectors, you can efficiently process and analyze complex document structures in your applications.

Introduction

Getting Started

Organization & Projects

Resources

Models

Data Definition

Reference

Creating a Document

Saving a Document

Loading a Document

Detached Documents

Anatomy of a Kodexa Document

Core Components

Document Structure

Content Nodes

Features

Working with Documents

Creating Documents

Adding Content

Working with Features

Node Navigation and Selection

Best Practices

Error Handling

Metadata

SourceMetadata

Working with Document Content

Basic Selector Example

Selector Syntax

Axis Examples

Predicate Functions

Operators

Pipeline Selectors

Conclusion

Introduction

Getting Started

Organization & Projects

Resources

Models

Data Definition

Reference

​Creating a Document

​Saving a Document

​Loading a Document

​Detached Documents

​Anatomy of a Kodexa Document

​Core Components

​Document Structure

​Content Nodes

​Features

​Working with Documents

​Creating Documents

​Adding Content

​Working with Features

​Node Navigation and Selection

​Best Practices

​Error Handling

​Metadata

​SourceMetadata

​Working with Document Content

​Basic Selector Example

​Selector Syntax

​Axis Examples

​Predicate Functions

​Operators

​Pipeline Selectors

​Conclusion

Creating a Document

Saving a Document

Loading a Document

Detached Documents

Anatomy of a Kodexa Document

Core Components

Document Structure

Content Nodes

Features

Working with Documents

Creating Documents

Adding Content

Working with Features

Node Navigation and Selection

Best Practices

Error Handling

Metadata

SourceMetadata

Working with Document Content

Basic Selector Example

Selector Syntax

Axis Examples

Predicate Functions

Operators

Pipeline Selectors

Conclusion