Skip to main content
Traditional document formats like PDF, DOCX, and images are built for humans to read — not for machines to process. When you need to extract structured data from invoices, annotate contracts with tags, or build hierarchical models from flat content, you need something more powerful. Kodexa Document solves this. It’s a rich, queryable document format that bridges the gap between raw content and structured data.

The KDDB Format

Every Kodexa Document is stored as a KDDB file — a SQLite database with a well-defined schema of 40+ tables. Think of it as “a document that is also a database.” This approach gives you:
  • Rich querying — XPath-like selectors to find any content
  • Transactional updates — safe concurrent modifications with audit trails
  • Efficient storage — zstd-compressed content for documents with thousands of pages
  • Cross-platform access — the same document works in Python, TypeScript/WASM, and Go

Three Layers of a Document

Every Kodexa Document has three conceptual layers that work together:
LayerPurposeThink of it as…
MetadataDocument-level properties — UUID, version, labels, sourceThe document’s “passport”
Content NodesHierarchical tree of the document’s contentThe document’s “DOM tree”
Data ObjectsStructured data extracted from the contentThe document’s “spreadsheet”
These layers are independent but connected. Content nodes hold the raw text and spatial positions; data objects hold the semantic meaning extracted from that content.

Content Nodes — The Document Tree

Content nodes form a tree that represents the document structure, much like an HTML DOM tree represents a web page.
document
page (page 1)
content-area
line: “INVOICE #12345”
line: “ACME Corp”
content-area
line: “Widget A $1,234.00”
word: “Widget”
word: “A”
word: “$1,234.00”
line: “Widget B $567.89”
page (page 2)
Each content node carries:
PropertyDescription
TypeWhat kind of node — page, content-area, line, word, etc.
ContentThe text content (computed from content parts)
FeaturesKey-value metadata in type:name format
TagsAnnotations that mark content for extraction
Bounding BoxSpatial position on the page (x, y, width, height)
ChildrenChild nodes in the tree
A node’s content is never stored directly on the node itself. It’s computed from the content parts table, which stores the actual text segments. This allows multi-part content (e.g., a line with mixed formatting) without duplication.

Tags and Features — Annotating Content

Tags and features are how Kodexa annotates content nodes with meaning. Features are simple key-value metadata on a node:
spatial:bbox → [0.5, 1.2, 3.4, 1.5]
format:font → "Arial"
format:bold → true
Tags are richer annotations that drive the extraction pipeline:
Tag: "invoice_number"  → confidence: 0.98, value: "12345"
Tag: "line_item"       → index: 0, group_uuid: "abc-123"
Tag: "line_item"       → index: 1, group_uuid: "def-456"
Tags support indexing for repeating elements. When a model identifies multiple line items in an invoice, each one gets its own index, allowing the extraction engine to group related tags together.

Data Objects — Structured Extraction Results

Data objects represent the semantic meaning extracted from the document. They form their own hierarchy, independent of the content node tree. Each data object has:
  • Taxonomy Reference — Points to the schema definition (e.g., acme/invoice:1.0.0)
  • Path — Hierarchical path (e.g., /invoice/line_item/amount)
  • Attributes — Typed values (string, decimal, boolean, date) with confidence scores
  • Children — Nested data objects for repeating or complex structures
Data objects are created by the extraction engine, which reads tags from content nodes and builds the structured output according to a taxonomy (data definition).

How Documents Flow Through the Platform

The key stages are:
  1. Upload — Raw documents (PDF, DOCX, images) are uploaded and stored in object storage
  2. Process — AI models parse the document, creating the content node tree with spatial data
  3. Tag — Models annotate content nodes with tags identifying what each piece of content means
  4. Extract — The extraction engine reads tags and builds structured data objects
  5. Review — Users view documents in the UI, review extracted data, and correct errors
  6. Export — Extracted data is exported as JSON, CSV, or pushed to downstream systems

Querying Documents

Kodexa Documents support an XPath-like selector language for navigating and querying the content tree:
# Find all lines
doc.select("//line")

# Find nodes tagged as 'amount'
doc.select("//*[hasTag('amount')]")

# Find lines containing 'Total'
doc.select("//line[contains(@content, 'Total')]")

# Find the first page
doc.select("/page[0]")
This selector language works identically across Python, TypeScript, and Go.

Key Database Tables

Under the hood, the .kddb file contains these core tables:
TablePurpose
kddb_metadataDocument-level metadata (JSON)
kddb_content_nodesThe hierarchical node tree
kddb_content_node_partsActual text content storage
kddb_content_node_featuresKey-value metadata on nodes
kddb_content_node_tagsTag annotations with grouping
kddb_data_objectsExtracted structured data
kddb_data_attributesTyped values within data objects
kddb_taxonomiesSchema definitions for extraction
kddb_data_exceptionsValidation errors and exceptions

What’s Next?