Creating a Document
The first step in working with Kodexa is typically creating a new document. Let’s assume we have a document in PDF format:Saving a Document
You can save a Kodexa Document to a file or a store. By default, documents are saved in the Kodexa format, which is a SQLite database:.kddb
extension for Kodexa Document Database files.
Loading a Document
To load a previously saved Kodexa Document:Detached Documents
Sometimes you may want to make changes to a document without affecting the original file. For this, you can load the document in detached mode:Anatomy of a Kodexa Document
The Kodexa Document Model provides a flexible and powerful way to represent structured and unstructured documents. At its core, it consists of a Document object that contains metadata and a hierarchical tree of ContentNodes, each of which can have features attached to them. Let’s explore the key components of the model.Core Components
Document Structure
A Kodexa Document consists of:- Document Metadata: Flexible dictionary-based metadata about the document
- Content Node Tree: Hierarchical structure of content nodes
- Source Metadata: Information about the document’s origin
- Labels: Document-level labels
Content Nodes
ContentNodes are the building blocks of the document structure. Each ContentNode represents a logical section of content and has the following properties: Key attributes:- node_type: Identifies the type of node (e.g., ‘page’, ‘line’, ‘cell’)
- content: The actual content of the node
- features: List of attached features (metadata, tags, etc.)
- children: Child nodes in the hierarchy
- uuid: Unique identifier
- index: Position among siblings
Features
Features are flexible metadata containers attached to ContentNodes. They come in different types: Each feature has:- feature_type: Category of the feature (e.g., ‘tag’, ‘spatial’)
- name: Identifier for the feature
- value: The feature’s data
- single: Boolean indicating if it’s a single value or collection
Working with Documents
Creating Documents
Adding Content
Working with Features
Node Navigation and Selection
The document model provides powerful ways to navigate and select nodes:-
Direct Navigation:
get_children()
: Get immediate child nodesget_parent()
: Get parent nodenext_node()
: Get next siblingprevious_node()
: Get previous sibling
-
Selector-based Navigation:
Best Practices
- Node Types: Use consistent node types throughout your document to make selection and processing easier
- Features:
- Use features to add metadata rather than modifying node content
- Keep feature names consistent across your application
- Use appropriate feature types for different kinds of metadata
- Content Structure:
- Maintain a logical hierarchy that reflects the document’s structure
- Use indexes appropriately to maintain node order
- Consider using virtual nodes for sparse content
- Performance:
- Use selectors efficiently
- Batch operations when possible
- Consider using KDDB format for large documents
Error Handling
The document model includes robust error handling through the ContentException class:Metadata
This is a dictionary containing metadata about the document, such as the source, title, author, etc:SourceMetadata
This contains metadata about the source document and works with connectors to allow you to access the original source document:Working with Document Content
Kodexa uses a powerful selector syntax to find and manipulate content within documents.
Basic Selector Example
To find all content nodes with the value “Name”:Selector Syntax
The selector syntax is composed of several parts:- Axis & Node Type: Defines how to navigate the tree structure.
- Predicate: Further filters the selected nodes based on conditions.
Axis Examples
//
: Current node and all children/
: Root node.
: Current Node (or root if from the document)./line/.
: All nodes of type line under the current nodeparent::line
: Any node in the parent structure of this node that is of node type line
Predicate Functions
Predicates can use various functions, such as:contentRegex
: Matches content against a regular expressiontypeRegex
: Matches node type name against a regular expressionhasTag
: Checks if a node has a specific taghasFeature
: Checks if a node has a specific featurecontent
: Returns the content of the nodeuuid
: Returns the UUID of the node
Operators
Operators can be used to combine functions:|
: Union the results of two sides=
: Test that two sides are equaland
: Boolean AND operationor
: Boolean OR operation