Skip to main content
This page takes a deeper look at how a Kodexa Document represents a real-world document internally. If you haven’t read the overview, start there first.

From PDF to Content Tree

When a document is processed, the AI model reads the raw content and builds a content node tree that preserves the document’s structure. Here’s how a simple invoice maps to a content tree:

Original Document

ACME Corp
Invoice #12345
ItemAmount
Widget A$1,234.00
Widget B$567.89
Total: $1,801.89
Payment due in 30 days

Content Tree

document
page (index: 0)
content-area
line: “ACME Corp”
line: “Invoice #12345”
content-area
line: “Item Amount”
line: “Widget A 1,234.00"</span></div> <div>line: <span style={{ color: '#10b981' }}>"Widget B 567.89”
content-area
line: “Total: $1,801.89”
line: “Payment due in 30 days”

Node Types

The type field on each content node describes what kind of content it represents. Common types include:
TypeDescriptionTypical Children
documentRoot node (always exactly one)page
pageA single page of the documentcontent-area
content-areaA region of content on a pageline
lineSingle line of textword (optional)
wordIndividual word
Node types are not fixed — models can create any type they need. The types above are conventions used by Kodexa’s built-in document processing models.

Spatial Data — Bounding Boxes

Every content node can carry a bounding box that describes its physical position on the page. This is critical for document understanding, enabling the UI to highlight content and allowing models to reason about spatial relationships. A bounding box has four values: [x, y, width, height] measured from the top-left corner of the page. This allows:
  • Visual highlighting in the document viewer
  • Spatial queries like “find all nodes in the top-right quadrant”
  • Layout reconstruction from OCR’d content
  • Confidence visualization by overlaying tag colors on the source document

Content Parts — How Text is Stored

A node’s text content is not stored directly on the node. Instead, it’s stored as content parts — separate text segments that are assembled to produce the final content. This design supports:
  • Mixed formatting — parts can have different font styles
  • Efficient updates — change one word without rewriting the whole line
  • Compression — content parts are stored as zstd-compressed BLOBs

Tag Groups — Handling Repeating Data

When a document contains repeating elements (line items, addresses), tags use indexing and group UUIDs to keep related items together. Consider an invoice with two line items: The extraction engine uses these groups to produce correctly structured data objects — each line item becomes its own data object with the right attributes.

The Extraction Pipeline

The journey from raw content to structured data follows these steps:
1

Parse

The document model processes the raw file and creates the content node tree with text, spatial data, and structural relationships.
2

Tag

AI models analyze the content and apply tags to nodes, marking what each piece of content represents (e.g., “this line is an amount”, “this line is an invoice number”).
3

Group

Tags with indices are grouped together. All tags with index=0 for a given path form one group, index=1 forms another, etc.
4

Extract

The extraction engine reads the taxonomy (data definition), walks the tagged content, and builds structured data objects with typed attributes.
5

Validate

Formulas, validation rules, and business logic run against the extracted data. Exceptions are created for any failures.
6

Review

Users review the extracted data in the Kodexa UI, correct any errors, and approve the results. Changes are tracked via the delta/audit system.

Data Definitions (Taxonomies)

A data definition (taxonomy) is the schema that tells the extraction engine what to look for and how to structure the output. It defines:
  • Taxons — The fields to extract (e.g., invoice_number, line_item/amount)
  • Data types — String, decimal, boolean, date, currency
  • Hierarchy — Parent-child relationships between fields
  • Group behavior — Whether a field repeats (like line items) or is singular
# Example: Invoice data definition
name: invoice
taxons:
  - name: invoice_number
    type: string
  - name: invoice_date
    type: date
  - name: total
    type: currency
  - name: line_item
    group: true
    children:
      - name: description
        type: string
      - name: quantity
        type: integer
      - name: amount
        type: currency
The data definition drives both the tagging models (which tags to apply) and the extraction engine (how to build data objects from those tags).

Learn More About Data Definitions

See the full guide on building data definitions for your document types.

Working with Documents Programmatically

from kodexa_document import Document

# Load a KDDB file
with Document.from_kddb("invoice.kddb") as doc:
    # Navigate the content tree
    root = doc.content_node
    pages = root.get_children()

    # Query with selectors
    amounts = doc.select("//*[hasTag('amount')]")
    for node in amounts:
        print(f"Amount: {node.content}")

    # Access extracted data
    for data_obj in doc.get_data_objects():
        print(f"Path: {data_obj.path}")
        for attr in data_obj.get_attributes():
            print(f"  {attr.name}: {attr.value}")