The KDDB Format
Every Kodexa Document is stored as a KDDB file — a SQLite database with a well-defined schema of 40+ tables. Think of it as “a document that is also a database.” This approach gives you:- Rich querying — XPath-like selectors to find any content
- Transactional updates — safe concurrent modifications with audit trails
- Efficient storage — zstd-compressed content for documents with thousands of pages
- Cross-platform access — the same document works in Python, TypeScript/WASM, and Go
Three Layers of a Document
Every Kodexa Document has three conceptual layers that work together:| Layer | Purpose | Think of it as… |
|---|---|---|
| Metadata | Document-level properties — UUID, version, labels, source | The document’s “passport” |
| Content Nodes | Hierarchical tree of the document’s content | The document’s “DOM tree” |
| Data Objects | Structured data extracted from the content | The document’s “spreadsheet” |
Content Nodes — The Document Tree
Content nodes form a tree that represents the document structure, much like an HTML DOM tree represents a web page.document
page (page 1)
content-area
line: “INVOICE #12345”
line: “ACME Corp”
content-area
line: “Widget A $1,234.00”
word: “Widget”
word: “A”
word: “$1,234.00”
word: “A”
word: “$1,234.00”
line: “Widget B $567.89”
page (page 2)
…
| Property | Description |
|---|---|
| Type | What kind of node — page, content-area, line, word, etc. |
| Content | The text content (computed from content parts) |
| Features | Key-value metadata in type:name format |
| Tags | Annotations that mark content for extraction |
| Bounding Box | Spatial position on the page (x, y, width, height) |
| Children | Child nodes in the tree |
A node’s content is never stored directly on the node itself. It’s computed from the content parts table, which stores the actual text segments. This allows multi-part content (e.g., a line with mixed formatting) without duplication.
Tags and Features — Annotating Content
Tags and features are how Kodexa annotates content nodes with meaning. Features are simple key-value metadata on a node:Data Objects — Structured Extraction Results
Data objects represent the semantic meaning extracted from the document. They form their own hierarchy, independent of the content node tree. Each data object has:- Taxonomy Reference — Points to the schema definition (e.g.,
acme/invoice:1.0.0) - Path — Hierarchical path (e.g.,
/invoice/line_item/amount) - Attributes — Typed values (string, decimal, boolean, date) with confidence scores
- Children — Nested data objects for repeating or complex structures
How Documents Flow Through the Platform
The key stages are:- Upload — Raw documents (PDF, DOCX, images) are uploaded and stored in object storage
- Process — AI models parse the document, creating the content node tree with spatial data
- Tag — Models annotate content nodes with tags identifying what each piece of content means
- Extract — The extraction engine reads tags and builds structured data objects
- Review — Users view documents in the UI, review extracted data, and correct errors
- Export — Extracted data is exported as JSON, CSV, or pushed to downstream systems
Querying Documents
Kodexa Documents support an XPath-like selector language for navigating and querying the content tree:Key Database Tables
Under the hood, the.kddb file contains these core tables:
| Table | Purpose |
|---|---|
kddb_metadata | Document-level metadata (JSON) |
kddb_content_nodes | The hierarchical node tree |
kddb_content_node_parts | Actual text content storage |
kddb_content_node_features | Key-value metadata on nodes |
kddb_content_node_tags | Tag annotations with grouping |
kddb_data_objects | Extracted structured data |
kddb_data_attributes | Typed values within data objects |
kddb_taxonomies | Schema definitions for extraction |
kddb_data_exceptions | Validation errors and exceptions |
What’s Next?
Getting Started with Python
Set up the Python SDK and start working with documents programmatically.
SDK Reference
Full SDK documentation for Python and TypeScript.
Data Definitions
Learn how to define taxonomies that drive document extraction.
CLI Document Commands
Inspect and manipulate documents from the command line.
