From PDF to Content Tree
When a document is processed, the AI model reads the raw content and builds a content node tree that preserves the document’s structure. Here’s how a simple invoice maps to a content tree:Original Document
ACME Corp
Invoice #12345
| Item | Amount |
|---|---|
| Widget A | $1,234.00 |
| Widget B | $567.89 |
Total: $1,801.89
Payment due in 30 days
Content Tree
document
page (index: 0)
content-area
line: “ACME Corp”
line: “Invoice #12345”
content-area
line: “Item Amount”
line: “Widget A 1,234.00"</span></div> <div>line: <span style={{ color: '#10b981' }}>"Widget B 567.89”
content-area
line: “Total: $1,801.89”
line: “Payment due in 30 days”
Node Types
Thetype field on each content node describes what kind of content it represents. Common types include:
| Type | Description | Typical Children |
|---|---|---|
document | Root node (always exactly one) | page |
page | A single page of the document | content-area |
content-area | A region of content on a page | line |
line | Single line of text | word (optional) |
word | Individual word | — |
Node types are not fixed — models can create any type they need. The types above are conventions used by Kodexa’s built-in document processing models.
Spatial Data — Bounding Boxes
Every content node can carry a bounding box that describes its physical position on the page. This is critical for document understanding, enabling the UI to highlight content and allowing models to reason about spatial relationships. A bounding box has four values:[x, y, width, height] measured from the top-left corner of the page. This allows:
- Visual highlighting in the document viewer
- Spatial queries like “find all nodes in the top-right quadrant”
- Layout reconstruction from OCR’d content
- Confidence visualization by overlaying tag colors on the source document
Content Parts — How Text is Stored
A node’s text content is not stored directly on the node. Instead, it’s stored as content parts — separate text segments that are assembled to produce the final content. This design supports:- Mixed formatting — parts can have different font styles
- Efficient updates — change one word without rewriting the whole line
- Compression — content parts are stored as zstd-compressed BLOBs
Tag Groups — Handling Repeating Data
When a document contains repeating elements (line items, addresses), tags use indexing and group UUIDs to keep related items together. Consider an invoice with two line items: The extraction engine uses these groups to produce correctly structured data objects — each line item becomes its own data object with the right attributes.The Extraction Pipeline
The journey from raw content to structured data follows these steps:Parse
The document model processes the raw file and creates the content node tree with text, spatial data, and structural relationships.
Tag
AI models analyze the content and apply tags to nodes, marking what each piece of content represents (e.g., “this line is an amount”, “this line is an invoice number”).
Group
Tags with indices are grouped together. All tags with
index=0 for a given path form one group, index=1 forms another, etc.Extract
The extraction engine reads the taxonomy (data definition), walks the tagged content, and builds structured data objects with typed attributes.
Validate
Formulas, validation rules, and business logic run against the extracted data. Exceptions are created for any failures.
Data Definitions (Taxonomies)
A data definition (taxonomy) is the schema that tells the extraction engine what to look for and how to structure the output. It defines:- Taxons — The fields to extract (e.g.,
invoice_number,line_item/amount) - Data types — String, decimal, boolean, date, currency
- Hierarchy — Parent-child relationships between fields
- Group behavior — Whether a field repeats (like line items) or is singular
Learn More About Data Definitions
See the full guide on building data definitions for your document types.
Working with Documents Programmatically
- Python
- TypeScript
- CLI
