Document Structure Deep Dive

This page takes a deeper look at how a Kodexa Document represents a real-world document internally. If you haven’t read the overview, start there first.

From PDF to Content Tree

When a document is processed, the AI model reads the raw content and builds a content node tree that preserves the document’s structure. Here’s how a simple invoice maps to a content tree:

Original Document

ACME Corp

Invoice #12345

Item	Amount
Widget A	$1,234.00
Widget B	$567.89

Total: $1,801.89

Payment due in 30 days

Content Tree

document

page (index: 0)

content-area

line: “ACME Corp”

line: “Invoice #12345”

content-area

line: “Item Amount”

line: “Widget A

1,234.00"</span></div> <div>line: <span style={{ color: '#10b981' }}>"Widget B

567.89”

content-area

line: “Total: $1,801.89”

line: “Payment due in 30 days”

Node Types

The type field on each content node describes what kind of content it represents. Common types include:

Type	Description	Typical Children
`document`	Root node (always exactly one)	`page`
`page`	A single page of the document	`content-area`
`content-area`	A region of content on a page	`line`
`line`	Single line of text	`word` (optional)
`word`	Individual word	—

Node types are not fixed — models can create any type they need. The types above are conventions used by Kodexa’s built-in document processing models.

Spatial Data — Bounding Boxes

Every content node can carry a bounding box that describes its physical position on the page. This is critical for document understanding, enabling the UI to highlight content and allowing models to reason about spatial relationships. A bounding box has four values: [x, y, width, height] measured from the top-left corner of the page. This allows:

Visual highlighting in the document viewer
Spatial queries like “find all nodes in the top-right quadrant”
Layout reconstruction from OCR’d content
Confidence visualization by overlaying tag colors on the source document

Content Parts — How Text is Stored

A node’s text content is not stored directly on the node. Instead, it’s stored as content parts — separate text segments that are assembled to produce the final content. This design supports:

Mixed formatting — parts can have different font styles
Efficient updates — change one word without rewriting the whole line
Compression — content parts are stored as zstd-compressed BLOBs

Tag Groups — Handling Repeating Data

When a document contains repeating elements (line items, addresses), tags use indexing and group UUIDs to keep related items together. Consider an invoice with two line items: The extraction engine uses these groups to produce correctly structured data objects — each line item becomes its own data object with the right attributes.

The Extraction Pipeline

The journey from raw content to structured data follows these steps:

Parse

The document model processes the raw file and creates the content node tree with text, spatial data, and structural relationships.

Tag

AI models analyze the content and apply tags to nodes, marking what each piece of content represents (e.g., “this line is an amount”, “this line is an invoice number”).

Group

Tags with indices are grouped together. All tags with index=0 for a given path form one group, index=1 forms another, etc.

Extract

The extraction engine reads the taxonomy (data definition), walks the tagged content, and builds structured data objects with typed attributes.

Validate

Formulas, validation rules, and business logic run against the extracted data. Exceptions are created for any failures.

Review

Users review the extracted data in the Kodexa UI, correct any errors, and approve the results. Changes are tracked via the delta/audit system.

Data Definitions (Taxonomies)

A data definition (taxonomy) is the schema that tells the extraction engine what to look for and how to structure the output. It defines:

Taxons — The fields to extract (e.g., invoice_number, line_item/amount)
Data types — String, decimal, boolean, date, currency
Hierarchy — Parent-child relationships between fields
Group behavior — Whether a field repeats (like line items) or is singular

# Example: Invoice data definition
name: invoice
taxons:
  - name: invoice_number
    type: string
  - name: invoice_date
    type: date
  - name: total
    type: currency
  - name: line_item
    group: true
    children:
      - name: description
        type: string
      - name: quantity
        type: integer
      - name: amount
        type: currency

The data definition drives both the tagging models (which tags to apply) and the extraction engine (how to build data objects from those tags).

Learn More About Data Definitions

See the full guide on building data definitions for your document types.

Working with Documents Programmatically

Python
TypeScript
CLI

from kodexa_document import Document

# Load a KDDB file
with Document.from_kddb("invoice.kddb") as doc:
    # Navigate the content tree
    root = doc.content_node
    pages = root.get_children()

    # Query with selectors
    amounts = doc.select("//*[hasTag('amount')]")
    for node in amounts:
        print(f"Amount: {node.content}")

    # Access extracted data
    for data_obj in doc.get_data_objects():
        print(f"Path: {data_obj.path}")
        for attr in data_obj.get_attributes():
            print(f"  {attr.name}: {attr.value}")

import { Kodexa } from '@kodexa-ai/document-wasm-ts';

await Kodexa.init();

// Load a KDDB file
const doc = await Kodexa.fromBlob(kddbBytes);

// Navigate the content tree
const root = await doc.getRoot();
const pages = await root.getChildren();

// Query with selectors
const amounts = await doc.select("//*[hasTag('amount')]");
for (const node of amounts) {
    console.log(`Amount: ${await node.getContent()}`);
}

# View document structure
kdx document structure invoice.kddb

# View extracted data
kdx document data invoice.kddb

# Query content nodes
kdx document query invoice.kddb "//*[hasTag('amount')]"

# View tags
kdx document tags invoice.kddb

Welcome

Essentials

From PDF to Content Tree

Node Types

Spatial Data — Bounding Boxes

Content Parts — How Text is Stored

Tag Groups — Handling Repeating Data

The Extraction Pipeline

Data Definitions (Taxonomies)

Learn More About Data Definitions

Working with Documents Programmatically

Welcome

Essentials

​From PDF to Content Tree

​Node Types

​Spatial Data — Bounding Boxes

​Content Parts — How Text is Stored

​Tag Groups — Handling Repeating Data

​The Extraction Pipeline

​Data Definitions (Taxonomies)

Learn More About Data Definitions

​Working with Documents Programmatically

From PDF to Content Tree

Node Types

Spatial Data — Bounding Boxes

Content Parts — How Text is Stored

Tag Groups — Handling Repeating Data

The Extraction Pipeline

Data Definitions (Taxonomies)

Working with Documents Programmatically