> ## Documentation Index > Fetch the complete documentation index at: https://developer.kodexa.ai/llms.txt > Use this file to discover all available pages before exploring further. # Document Structure Deep Dive > Deep dive into Kodexa Document structure: content nodes, spatial data, page mapping, and how the document tree represents real-world documents in KDDB. This page takes a deeper look at how a Kodexa Document represents a real-world document internally. If you haven't read the [overview](/guides/kodexa-document/index), start there first. ## From PDF to Content Tree When a document is processed, the AI model reads the raw content and builds a **content node tree** that preserves the document's structure. Here's how a simple invoice maps to a content tree:

Original Document

ACME Corp

Invoice #12345

Item	Amount
Widget A	\$1,234.00
Widget B	\$567.89

Total: \$1,801.89

Payment due in 30 days

Content Tree

document

page (index: 0)

content-area

line: "ACME Corp"

line: "Invoice #12345"

content-area

line: "Item Amount"

line: "Widget A $1,234.00"

line: "Widget B $567.89"

content-area

line: "Total: \$1,801.89"

line: "Payment due in 30 days"

## Node Types The `type` field on each content node describes what kind of content it represents. Common types include: | Type | Description | Typical Children | | -------------- | ------------------------------ | ----------------- | | `document` | Root node (always exactly one) | `page` | | `page` | A single page of the document | `content-area` | | `content-area` | A region of content on a page | `line` | | `line` | Single line of text | `word` (optional) | | `word` | Individual word | — | Node types are not fixed — models can create any type they need. The types above are conventions used by Kodexa's built-in document processing models. ## Spatial Data — Bounding Boxes Every content node can carry a **bounding box** that describes its physical position on the page. This is critical for document understanding, enabling the UI to highlight content and allowing models to reason about spatial relationships. ```mermaid theme={null} graph TB subgraph page["Page Layout (coordinates in inches)"] direction TB CA1["content-area bbox: [0.5, 0.5, 7.5, 1.2]"] CA2["content-area bbox: [0.5, 1.8, 7.5, 4.0]"] CA3["content-area bbox: [0.5, 4.5, 7.5, 5.2]"] end ``` A bounding box has four values: `[x, y, width, height]` measured from the top-left corner of the page. This allows: * **Visual highlighting** in the document viewer * **Spatial queries** like "find all nodes in the top-right quadrant" * **Layout reconstruction** from OCR'd content * **Confidence visualization** by overlaying tag colors on the source document ## Content Parts — How Text is Stored A node's text content is not stored directly on the node. Instead, it's stored as **content parts** — separate text segments that are assembled to produce the final content. ```mermaid theme={null} graph LR NODE["line node"] --> CP1["part 0: 'Total: '"] NODE --> CP2["part 1: '$1,801.89'"] CP1 & CP2 -->|assembled| CONTENT["content: 'Total: $1,801.89'"] ``` This design supports: * **Mixed formatting** — parts can have different font styles * **Efficient updates** — change one word without rewriting the whole line * **Compression** — content parts are stored as zstd-compressed BLOBs ## Tag Groups — Handling Repeating Data When a document contains repeating elements (line items, addresses), tags use **indexing** and **group UUIDs** to keep related items together. Consider an invoice with two line items: ```mermaid theme={null} graph TB subgraph g0["Group: line_item (index 0)"] N1["'Widget A' → tag: line_item/description"] N2["'$1,234.00' → tag: line_item/amount"] end subgraph g1["Group: line_item (index 1)"] N3["'Widget B' → tag: line_item/description"] N4["'$567.89' → tag: line_item/amount"] end subgraph result["Extraction Result"] DO0["/invoice/line_item[0] description: Widget A amount: 1234.00"] DO1["/invoice/line_item[1] description: Widget B amount: 567.89"] end g0 -.->|extract| DO0 g1 -.->|extract| DO1 ``` The extraction engine uses these groups to produce correctly structured data objects — each line item becomes its own data object with the right attributes. ## The Extraction Pipeline The journey from raw content to structured data follows these steps: ```mermaid theme={null} graph LR A["1. Parse Build content tree"] --> B["2. Tag AI annotates nodes"] B --> C["3. Group Cluster tags by index"] C --> D["4. Extract Build data objects from tags"] D --> E["5. Validate Run formulas & rules"] E --> F["6. Review Human correction"] ``` The document model processes the raw file and creates the content node tree with text, spatial data, and structural relationships. AI models analyze the content and apply tags to nodes, marking what each piece of content represents (e.g., "this line is an amount", "this line is an invoice number"). Tags with indices are grouped together. All tags with `index=0` for a given path form one group, `index=1` forms another, etc. The extraction engine reads the taxonomy (data definition), walks the tagged content, and builds structured data objects with typed attributes. Formulas, validation rules, and business logic run against the extracted data. Exceptions are created for any failures. Users review the extracted data in the Kodexa UI, correct any errors, and approve the results. Changes are tracked via the delta/audit system. ## Data Definitions (Taxonomies) A **data definition** (taxonomy) is the schema that tells the extraction engine what to look for and how to structure the output. It defines: * **Taxons** — The fields to extract (e.g., `invoice_number`, `line_item/amount`) * **Data types** — String, decimal, boolean, date, currency * **Hierarchy** — Parent-child relationships between fields * **Group behavior** — Whether a field repeats (like line items) or is singular ```yaml theme={null} # Example: Invoice data definition name: invoice taxons: - name: invoice_number type: string - name: invoice_date type: date - name: total type: currency - name: line_item group: true children: - name: description type: string - name: quantity type: integer - name: amount type: currency ``` The data definition drives both the tagging models (which tags to apply) and the extraction engine (how to build data objects from those tags). See the full guide on building data definitions for your document types. ## Working with Documents Programmatically ```python theme={null} from kodexa_document import Document # Load a KDDB file with Document.from_kddb("invoice.kddb") as doc: # Navigate the content tree root = doc.content_node pages = root.get_children() # Query with selectors amounts = doc.select("//*[hasTag('amount')]") for node in amounts: print(f"Amount: {node.content}") # Access extracted data for data_obj in doc.get_data_objects(): print(f"Path: {data_obj.path}") for attr in data_obj.get_attributes(): print(f" {attr.name}: {attr.value}") ``` ```typescript theme={null} import { Kodexa } from '@kodexa-ai/document-wasm-ts'; await Kodexa.init(); // Load a KDDB file const doc = await Kodexa.fromBlob(kddbBytes); // Navigate the content tree const root = await doc.getRoot(); const pages = await root.getChildren(); // Query with selectors const amounts = await doc.select("//*[hasTag('amount')]"); for (const node of amounts) { console.log(`Amount: ${await node.getContent()}`); } ``` ```bash theme={null} # View document structure kdx document structure invoice.kddb # View extracted data kdx document data invoice.kddb # Query content nodes kdx document query invoice.kddb "//*[hasTag('amount')]" # View tags kdx document tags invoice.kddb ```