Search

Working with a Document

Working with a Document

Kodexa is a powerful document processing platform that allows developers to work with documents in a structured and efficient manner. This guide will walk you through the basics of creating, saving, loading, and working with Kodexa Documents.

Creating a Document

The first step in working with Kodexa is typically creating a new document. Let's assume we have a document in PDF format:

from kodexa import Document

my_document = Document.from_file('example.pdf')

This step creates an empty document with a reference to the PDF file. At this point, the document hasn't been parsed, but metadata has been added to allow Kodexa to understand where to find the original document.

Saving a Document

You can save a Kodexa Document to a file or a store. By default, documents are saved in the Kodexa format, which is a SQLite database:

my_document.to_kddb('my-document.kddb')
my_document.close()

By convention, we use the .kddb extension for Kodexa Document Database files.

Loading a Document

To load a previously saved Kodexa Document:

another_document = Document.from_kddb('my-document.kddb')
another_document.close()

Detached Documents

Sometimes you may want to make changes to a document without affecting the original file. For this, you can load the document in detached mode:

detached_document = Document.from_kddb('my-document.kddb', detached=True)

Anatomy of a Kodexa Document

A Kodexa Document manages several key pieces of information:

Metadata

This is a dictionary containing metadata about the document, such as the source, title, author, etc:

print(my_document.metadata)

SourceMetadata

This contains metadata about the source document and works with connectors to allow you to access the original source document:

print(my_document.source)

Working with Document Content

Kodexa uses a powerful selector syntax to find and manipulate content within documents.

image

Selectors work similarly to CSS selectors or XPath, allowing you to build queries that can be executed on a document instance.

Basic Selector Example

To find all content nodes with the value "Name":

nodes = document.select('//*[contentRegex("Name")]')

This returns an iterator of the matching content nodes.

Selector Syntax

The selector syntax is composed of several parts:

  1. Axis & Node Type: Defines how to navigate the tree structure.
  2. Predicate: Further filters the selected nodes based on conditions.

Axis Examples

  • //: Current node and all children
  • /: Root node
  • .: Current Node (or root if from the document)
  • ./line/.: All nodes of type line under the current node
  • parent::line: Any node in the parent structure of this node that is of node type line

Predicate Functions

Predicates can use various functions, such as:

  • contentRegex: Matches content against a regular expression
  • typeRegex: Matches node type name against a regular expression
  • hasTag: Checks if a node has a specific tag
  • hasFeature: Checks if a node has a specific feature
  • content: Returns the content of the node
  • uuid: Returns the UUID of the node

Operators

Operators can be used to combine functions:

  • |: Union the results of two sides
  • =: Test that two sides are equal
  • and: Boolean AND operation
  • or: Boolean OR operation

Pipeline Selectors

Kodexa also supports "pipeline" selectors, allowing you to chain multiple selectors:

document.select('//word stream //*[hasTag("ORG")] stream * [hasTag("PERSON")]')

This example streams all nodes of type word, then filters those with the "ORG" tag, and finally filters those with the "PERSON" tag.

Conclusion

Kodexa Documents provide a powerful way to work with structured content. By understanding how to create, save, load, and query documents using selectors, you can efficiently process and analyze complex document structures in your applications.

← Previous

Documents