> ## Documentation Index
> Fetch the complete documentation index at: https://developer.kodexa.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Document Structure Deep Dive

> Deep dive into Kodexa Document structure: content nodes, spatial data, page mapping, and how the document tree represents real-world documents in KDDB.

This page takes a deeper look at how a Kodexa Document represents a real-world document internally. If you haven't read the [overview](/guides/kodexa-document/index), start there first.

## From PDF to Content Tree

When a document is processed, the AI model reads the raw content and builds a **content node tree** that preserves the document's structure.

Here's how a simple invoice maps to a content tree:

<div style={{ display: 'grid', gridTemplateColumns: '1fr 1fr', gap: '1.5rem', alignItems: 'start' }}>
  <div>
    <p style={{ fontWeight: 600, marginBottom: '0.5rem' }}>Original Document</p>

    <div style={{ border: '1px solid var(--border)', borderRadius: '0.5rem', padding: '1.5rem', fontSize: '0.875rem', lineHeight: '1.6' }}>
      <div style={{ fontWeight: 700, fontSize: '1rem' }}>ACME Corp</div>
      <div style={{ color: 'var(--muted-foreground)' }}>Invoice #12345</div>

      <div style={{ margin: '1rem 0' }}>
        <table style={{ width: '100%', borderCollapse: 'collapse', fontSize: '0.875rem' }}>
          <thead>
            <tr style={{ borderBottom: '2px solid var(--border)' }}>
              <th style={{ textAlign: 'left', padding: '0.375rem 0.75rem' }}>Item</th>
              <th style={{ textAlign: 'right', padding: '0.375rem 0.75rem' }}>Amount</th>
            </tr>
          </thead>

          <tbody>
            <tr style={{ borderBottom: '1px solid var(--border)' }}>
              <td style={{ padding: '0.375rem 0.75rem' }}>Widget A</td>
              <td style={{ textAlign: 'right', padding: '0.375rem 0.75rem' }}>\$1,234.00</td>
            </tr>

            <tr>
              <td style={{ padding: '0.375rem 0.75rem' }}>Widget B</td>
              <td style={{ textAlign: 'right', padding: '0.375rem 0.75rem' }}>\$567.89</td>
            </tr>
          </tbody>
        </table>
      </div>

      <div style={{ fontWeight: 600 }}>Total: \$1,801.89</div>
      <div style={{ color: 'var(--muted-foreground)' }}>Payment due in 30 days</div>
    </div>
  </div>

  <div>
    <p style={{ fontWeight: 600, marginBottom: '0.5rem' }}>Content Tree</p>

    <div style={{ border: '1px solid var(--border)', borderRadius: '0.5rem', padding: '1rem 1.25rem', fontSize: '0.8125rem', lineHeight: '1.75' }}>
      <div style={{ fontWeight: 600, color: 'var(--primary)' }}>document</div>

      <div style={{ paddingLeft: '1.25rem', borderLeft: '1px solid var(--border)', marginLeft: '0.5rem' }}>
        <div style={{ fontWeight: 600, color: 'var(--primary)' }}>page <span style={{ fontWeight: 400, color: 'var(--muted-foreground)' }}>(index: 0)</span></div>

        <div style={{ paddingLeft: '1.25rem', borderLeft: '1px solid var(--border)', marginLeft: '0.5rem' }}>
          <div style={{ fontWeight: 600, color: 'var(--primary)' }}>content-area</div>

          <div style={{ paddingLeft: '1.25rem', borderLeft: '1px solid var(--border)', marginLeft: '0.5rem' }}>
            <div>line: <span style={{ color: '#10b981' }}>"ACME Corp"</span></div>
            <div>line: <span style={{ color: '#10b981' }}>"Invoice #12345"</span></div>
          </div>

          <div style={{ fontWeight: 600, color: 'var(--primary)', marginTop: '0.25rem' }}>content-area</div>

          <div style={{ paddingLeft: '1.25rem', borderLeft: '1px solid var(--border)', marginLeft: '0.5rem' }}>
            <div>line: <span style={{ color: '#10b981' }}>"Item  Amount"</span></div>
            <div>line: <span style={{ color: '#10b981' }}>"Widget A  $1,234.00"</span></div>             <div>line: <span style={{ color: '#10b981' }}>"Widget B  $567.89"</span></div>
          </div>

          <div style={{ fontWeight: 600, color: 'var(--primary)', marginTop: '0.25rem' }}>content-area</div>

          <div style={{ paddingLeft: '1.25rem', borderLeft: '1px solid var(--border)', marginLeft: '0.5rem' }}>
            <div>line: <span style={{ color: '#10b981' }}>"Total: \$1,801.89"</span></div>
            <div>line: <span style={{ color: '#10b981' }}>"Payment due in 30 days"</span></div>
          </div>
        </div>
      </div>
    </div>
  </div>
</div>

## Node Types

The `type` field on each content node describes what kind of content it represents. Common types include:

| Type           | Description                    | Typical Children  |
| -------------- | ------------------------------ | ----------------- |
| `document`     | Root node (always exactly one) | `page`            |
| `page`         | A single page of the document  | `content-area`    |
| `content-area` | A region of content on a page  | `line`            |
| `line`         | Single line of text            | `word` (optional) |
| `word`         | Individual word                | —                 |

<Info>
  Node types are not fixed — models can create any type they need. The types above are conventions used by Kodexa's built-in document processing models.
</Info>

## Spatial Data — Bounding Boxes

Every content node can carry a **bounding box** that describes its physical position on the page. This is critical for document understanding, enabling the UI to highlight content and allowing models to reason about spatial relationships.

```mermaid theme={null}
graph TB
    subgraph page["Page Layout (coordinates in inches)"]
        direction TB
        CA1["content-area
        bbox: [0.5, 0.5, 7.5, 1.2]"]
        CA2["content-area
        bbox: [0.5, 1.8, 7.5, 4.0]"]
        CA3["content-area
        bbox: [0.5, 4.5, 7.5, 5.2]"]
    end
```

A bounding box has four values: `[x, y, width, height]` measured from the top-left corner of the page. This allows:

* **Visual highlighting** in the document viewer
* **Spatial queries** like "find all nodes in the top-right quadrant"
* **Layout reconstruction** from OCR'd content
* **Confidence visualization** by overlaying tag colors on the source document

## Content Parts — How Text is Stored

A node's text content is not stored directly on the node. Instead, it's stored as **content parts** — separate text segments that are assembled to produce the final content.

```mermaid theme={null}
graph LR
    NODE["line node"] --> CP1["part 0: 'Total: '"]
    NODE --> CP2["part 1: '$1,801.89'"]
    CP1 & CP2 -->|assembled| CONTENT["content: 'Total: $1,801.89'"]
```

This design supports:

* **Mixed formatting** — parts can have different font styles
* **Efficient updates** — change one word without rewriting the whole line
* **Compression** — content parts are stored as zstd-compressed BLOBs

## Tag Groups — Handling Repeating Data

When a document contains repeating elements (line items, addresses), tags use **indexing** and **group UUIDs** to keep related items together.

Consider an invoice with two line items:

```mermaid theme={null}
graph TB
    subgraph g0["Group: line_item (index 0)"]
        N1["'Widget A' → tag: line_item/description"]
        N2["'$1,234.00' → tag: line_item/amount"]
    end

    subgraph g1["Group: line_item (index 1)"]
        N3["'Widget B' → tag: line_item/description"]
        N4["'$567.89' → tag: line_item/amount"]
    end

    subgraph result["Extraction Result"]
        DO0["/invoice/line_item[0]
        description: Widget A
        amount: 1234.00"]
        DO1["/invoice/line_item[1]
        description: Widget B
        amount: 567.89"]
    end

    g0 -.->|extract| DO0
    g1 -.->|extract| DO1
```

The extraction engine uses these groups to produce correctly structured data objects — each line item becomes its own data object with the right attributes.

## The Extraction Pipeline

The journey from raw content to structured data follows these steps:

```mermaid theme={null}
graph LR
    A["1. Parse
    Build content tree"] --> B["2. Tag
    AI annotates nodes"]
    B --> C["3. Group
    Cluster tags by index"]
    C --> D["4. Extract
    Build data objects from tags"]
    D --> E["5. Validate
    Run formulas & rules"]
    E --> F["6. Review
    Human correction"]
```

<Steps>
  <Step title="Parse">
    The document model processes the raw file and creates the content node tree with text, spatial data, and structural relationships.
  </Step>

  <Step title="Tag">
    AI models analyze the content and apply tags to nodes, marking what each piece of content represents (e.g., "this line is an amount", "this line is an invoice number").
  </Step>

  <Step title="Group">
    Tags with indices are grouped together. All tags with `index=0` for a given path form one group, `index=1` forms another, etc.
  </Step>

  <Step title="Extract">
    The extraction engine reads the taxonomy (data definition), walks the tagged content, and builds structured data objects with typed attributes.
  </Step>

  <Step title="Validate">
    Formulas, validation rules, and business logic run against the extracted data. Exceptions are created for any failures.
  </Step>

  <Step title="Review">
    Users review the extracted data in the Kodexa UI, correct any errors, and approve the results. Changes are tracked via the delta/audit system.
  </Step>
</Steps>

## Data Definitions (Taxonomies)

A **data definition** (taxonomy) is the schema that tells the extraction engine what to look for and how to structure the output. It defines:

* **Taxons** — The fields to extract (e.g., `invoice_number`, `line_item/amount`)
* **Data types** — String, decimal, boolean, date, currency
* **Hierarchy** — Parent-child relationships between fields
* **Group behavior** — Whether a field repeats (like line items) or is singular

```yaml theme={null}
# Example: Invoice data definition
name: invoice
taxons:
  - name: invoice_number
    type: string
  - name: invoice_date
    type: date
  - name: total
    type: currency
  - name: line_item
    group: true
    children:
      - name: description
        type: string
      - name: quantity
        type: integer
      - name: amount
        type: currency
```

The data definition drives both the tagging models (which tags to apply) and the extraction engine (how to build data objects from those tags).

<Card title="Learn More About Data Definitions" icon="sitemap" href="/guides/data-definitions/index">
  See the full guide on building data definitions for your document types.
</Card>

## Working with Documents Programmatically

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    from kodexa_document import Document

    # Load a KDDB file
    with Document.from_kddb("invoice.kddb") as doc:
        # Navigate the content tree
        root = doc.content_node
        pages = root.get_children()

        # Query with selectors
        amounts = doc.select("//*[hasTag('amount')]")
        for node in amounts:
            print(f"Amount: {node.content}")

        # Access extracted data
        for data_obj in doc.get_data_objects():
            print(f"Path: {data_obj.path}")
            for attr in data_obj.get_attributes():
                print(f"  {attr.name}: {attr.value}")
    ```
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    import { Kodexa } from '@kodexa-ai/document-wasm-ts';

    await Kodexa.init();

    // Load a KDDB file
    const doc = await Kodexa.fromBlob(kddbBytes);

    // Navigate the content tree
    const root = await doc.getRoot();
    const pages = await root.getChildren();

    // Query with selectors
    const amounts = await doc.select("//*[hasTag('amount')]");
    for (const node of amounts) {
        console.log(`Amount: ${await node.getContent()}`);
    }
    ```
  </Tab>

  <Tab title="CLI">
    ```bash theme={null}
    # View document structure
    kdx document structure invoice.kddb

    # View extracted data
    kdx document data invoice.kddb

    # Query content nodes
    kdx document query invoice.kddb "//*[hasTag('amount')]"

    # View tags
    kdx document tags invoice.kddb
    ```
  </Tab>
</Tabs>
