What is a Kodexa Document?

Traditional document formats like PDF, DOCX, and images are built for humans to read — not for machines to process. When you need to extract structured data from invoices, annotate contracts with tags, or build hierarchical models from flat content, you need something more powerful. Kodexa Document solves this. It’s a rich, queryable document format that bridges the gap between raw content and structured data.

The KDDB Format

Every Kodexa Document is stored as a KDDB file — a SQLite database with a well-defined schema of 40+ tables. Think of it as “a document that is also a database.” This approach gives you:

Rich querying — XPath-like selectors to find any content
Transactional updates — safe concurrent modifications with audit trails
Efficient storage — zstd-compressed content for documents with thousands of pages
Cross-platform access — the same document works in Python, TypeScript/WASM, and Go

Three Layers of a Document

Every Kodexa Document has three conceptual layers that work together:

Layer	Purpose	Think of it as…
Metadata	Document-level properties — UUID, version, labels, source	The document’s “passport”
Content Nodes	Hierarchical tree of the document’s content	The document’s “DOM tree”
Data Objects	Structured data extracted from the content	The document’s “spreadsheet”

These layers are independent but connected. Content nodes hold the raw text and spatial positions; data objects hold the semantic meaning extracted from that content.

Content Nodes — The Document Tree

Content nodes form a tree that represents the document structure, much like an HTML DOM tree represents a web page.

document

page (page 1)

content-area

line: “INVOICE #12345”

line: “ACME Corp”

content-area

line: “Widget A $1,234.00”

word: “Widget”
word: “A”
word: “$1,234.00”

line: “Widget B $567.89”

page (page 2)

…

Each content node carries:

Property	Description
Type	What kind of node — `page`, `content-area`, `line`, `word`, etc.
Content	The text content (computed from content parts)
Features	Key-value metadata in `type:name` format
Tags	Annotations that mark content for extraction
Bounding Box	Spatial position on the page (x, y, width, height)
Children	Child nodes in the tree

A node’s content is never stored directly on the node itself. It’s computed from the content parts table, which stores the actual text segments. This allows multi-part content (e.g., a line with mixed formatting) without duplication.

Tags and Features — Annotating Content

Tags and features are how Kodexa annotates content nodes with meaning. Features are simple key-value metadata on a node:

spatial:bbox → [0.5, 1.2, 3.4, 1.5]
format:font → "Arial"
format:bold → true

Tags are richer annotations that drive the extraction pipeline:

Tag: "invoice_number"  → confidence: 0.98, value: "12345"
Tag: "line_item"       → index: 0, group_uuid: "abc-123"
Tag: "line_item"       → index: 1, group_uuid: "def-456"

Tags support indexing for repeating elements. When a model identifies multiple line items in an invoice, each one gets its own index, allowing the extraction engine to group related tags together.

Data Objects — Structured Extraction Results

Data objects represent the semantic meaning extracted from the document. They form their own hierarchy, independent of the content node tree. Each data object has:

Taxonomy Reference — Points to the schema definition (e.g., acme/invoice:1.0.0)
Path — Hierarchical path (e.g., /invoice/line_item/amount)
Attributes — Typed values (string, decimal, boolean, date) with confidence scores
Children — Nested data objects for repeating or complex structures

Data objects are created by the extraction engine, which reads tags from content nodes and builds the structured output according to a taxonomy (data definition).

How Documents Flow Through the Platform

The key stages are:

Upload — Raw documents (PDF, DOCX, images) are uploaded and stored in object storage
Process — AI models parse the document, creating the content node tree with spatial data
Tag — Models annotate content nodes with tags identifying what each piece of content means
Extract — The extraction engine reads tags and builds structured data objects
Review — Users view documents in the UI, review extracted data, and correct errors
Export — Extracted data is exported as JSON, CSV, or pushed to downstream systems

Querying Documents

Kodexa Documents support an XPath-like selector language for navigating and querying the content tree:

# Find all lines
doc.select("//line")

# Find nodes tagged as 'amount'
doc.select("//*[hasTag('amount')]")

# Find lines containing 'Total'
doc.select("//line[contains(@content, 'Total')]")

# Find the first page
doc.select("/page[0]")

This selector language works identically across Python, TypeScript, and Go.

Key Database Tables

Under the hood, the .kddb file contains these core tables:

Table	Purpose
`kddb_metadata`	Document-level metadata (JSON)
`kddb_content_nodes`	The hierarchical node tree
`kddb_content_node_parts`	Actual text content storage
`kddb_content_node_features`	Key-value metadata on nodes
`kddb_content_node_tags`	Tag annotations with grouping
`kddb_data_objects`	Extracted structured data
`kddb_data_attributes`	Typed values within data objects
`kddb_taxonomies`	Schema definitions for extraction
`kddb_data_exceptions`	Validation errors and exceptions

What’s Next?

Getting Started with Python

Set up the Python SDK and start working with documents programmatically.

SDK Reference

Full SDK documentation for Python and TypeScript.

Data Definitions

Learn how to define taxonomies that drive document extraction.

CLI Document Commands

Inspect and manipulate documents from the command line.

Welcome

Essentials

The KDDB Format

Three Layers of a Document

Content Nodes — The Document Tree

Tags and Features — Annotating Content

Data Objects — Structured Extraction Results

How Documents Flow Through the Platform

Querying Documents

Key Database Tables

What’s Next?

Getting Started with Python

SDK Reference

Data Definitions

CLI Document Commands

Welcome

Essentials

​The KDDB Format

​Three Layers of a Document

​Content Nodes — The Document Tree

​Tags and Features — Annotating Content

​Data Objects — Structured Extraction Results

​How Documents Flow Through the Platform

​Querying Documents

​Key Database Tables

​What’s Next?

Getting Started with Python

SDK Reference

Data Definitions

CLI Document Commands

The KDDB Format

Three Layers of a Document

Content Nodes — The Document Tree

Tags and Features — Annotating Content

Data Objects — Structured Extraction Results

How Documents Flow Through the Platform

Querying Documents

Key Database Tables

What’s Next?