> ## Documentation Index
> Fetch the complete documentation index at: https://developer.kodexa.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# What is a Kodexa Document?

> Understand the KDDB document format at the heart of the Kodexa Platform, including content nodes, mixins, features, tags, and structured data layers.

Traditional document formats like PDF, DOCX, and images are built for humans to read — not for machines to process. When you need to extract structured data from invoices, annotate contracts with tags, or build hierarchical models from flat content, you need something more powerful.

**Kodexa Document** solves this. It's a rich, queryable document format that bridges the gap between raw content and structured data.

## The KDDB Format

Every Kodexa Document is stored as a **KDDB file** — a SQLite database with a well-defined schema of 40+ tables. Think of it as "a document that is also a database."

```mermaid theme={null}
graph LR
    PDF[PDF / DOCX / Image] -->|processed by| MODEL[AI Model]
    MODEL -->|creates| KDDB[".kddb file
    (SQLite database)"]
    KDDB -->|queried by| APP[Applications]
    KDDB -->|viewed in| UI[Kodexa UI]
    KDDB -->|exported as| JSON[JSON / CSV]
```

This approach gives you:

* **Rich querying** — XPath-like selectors to find any content
* **Transactional updates** — safe concurrent modifications with audit trails
* **Efficient storage** — zstd-compressed content for documents with thousands of pages
* **Cross-platform access** — the same document works in Python, TypeScript/WASM, and Go

## Three Layers of a Document

Every Kodexa Document has three conceptual layers that work together:

```mermaid theme={null}
graph TB
    subgraph doc["Kodexa Document (.kddb)"]
        direction TB
        META["<b>Metadata Layer</b>
        UUID, version, labels, source info"]

        subgraph content["Content Node Layer"]
            ROOT[document] --> P1[page 1]
            ROOT --> P2[page 2]
            P1 --> CA1[content-area]
            P1 --> CA2[content-area]
            CA1 --> L1[line]
            CA2 --> L2[line]
        end

        subgraph data["Data Object Layer"]
            INV["/invoice"] --> LI1["/line_item"]
            INV --> LI2["/line_item"]
            LI1 --> AMT["amount: $1,234"]
            LI1 --> DESC["description: Widget A"]
        end
    end

    style META fill:#3b82f6,color:#fff
    style content fill:#10b981,color:#fff
    style data fill:#f59e0b,color:#fff
```

| Layer             | Purpose                                                   | Think of it as...            |
| ----------------- | --------------------------------------------------------- | ---------------------------- |
| **Metadata**      | Document-level properties — UUID, version, labels, source | The document's "passport"    |
| **Content Nodes** | Hierarchical tree of the document's content               | The document's "DOM tree"    |
| **Data Objects**  | Structured data extracted from the content                | The document's "spreadsheet" |

These layers are independent but connected. Content nodes hold the raw text and spatial positions; data objects hold the semantic meaning extracted from that content.

## Content Nodes — The Document Tree

Content nodes form a tree that represents the document structure, much like an HTML DOM tree represents a web page.

<div style={{ border: '1px solid var(--border)', borderRadius: '0.5rem', padding: '1rem 1.25rem', fontSize: '0.8125rem', lineHeight: '1.75', maxWidth: '36rem' }}>
  <div style={{ fontWeight: 600, color: 'var(--primary)' }}>document</div>

  <div style={{ paddingLeft: '1.25rem', borderLeft: '1px solid var(--border)', marginLeft: '0.5rem' }}>
    <div style={{ fontWeight: 600, color: 'var(--primary)' }}>page <span style={{ fontWeight: 400, color: 'var(--muted-foreground)' }}>(page 1)</span></div>

    <div style={{ paddingLeft: '1.25rem', borderLeft: '1px solid var(--border)', marginLeft: '0.5rem' }}>
      <div style={{ fontWeight: 600, color: 'var(--primary)' }}>content-area</div>

      <div style={{ paddingLeft: '1.25rem', borderLeft: '1px solid var(--border)', marginLeft: '0.5rem' }}>
        <div>line: <span style={{ color: '#10b981' }}>"INVOICE #12345"</span></div>
        <div>line: <span style={{ color: '#10b981' }}>"ACME Corp"</span></div>
      </div>

      <div style={{ fontWeight: 600, color: 'var(--primary)', marginTop: '0.25rem' }}>content-area</div>

      <div style={{ paddingLeft: '1.25rem', borderLeft: '1px solid var(--border)', marginLeft: '0.5rem' }}>
        <div>line: <span style={{ color: '#10b981' }}>"Widget A  \$1,234.00"</span></div>

        <div style={{ paddingLeft: '1.25rem', borderLeft: '1px solid var(--border)', marginLeft: '0.5rem' }}>
          <span>word: <span style={{ color: '#10b981' }}>"Widget"</span></span><br />
          <span>word: <span style={{ color: '#10b981' }}>"A"</span></span><br />
          <span>word: <span style={{ color: '#10b981' }}>"\$1,234.00"</span></span>
        </div>

        <div>line: <span style={{ color: '#10b981' }}>"Widget B  \$567.89"</span></div>
      </div>
    </div>

    <div style={{ fontWeight: 600, color: 'var(--primary)', marginTop: '0.25rem' }}>page <span style={{ fontWeight: 400, color: 'var(--muted-foreground)' }}>(page 2)</span></div>
    <div style={{ paddingLeft: '1.25rem', borderLeft: '1px solid var(--border)', marginLeft: '0.5rem', color: 'var(--muted-foreground)' }}>...</div>
  </div>
</div>

Each content node carries:

| Property         | Description                                                      |
| ---------------- | ---------------------------------------------------------------- |
| **Type**         | What kind of node — `page`, `content-area`, `line`, `word`, etc. |
| **Content**      | The text content (computed from content parts)                   |
| **Features**     | Key-value metadata in `type:name` format                         |
| **Tags**         | Annotations that mark content for extraction                     |
| **Bounding Box** | Spatial position on the page (x, y, width, height)               |
| **Children**     | Child nodes in the tree                                          |

<Info>
  A node's content is never stored directly on the node itself. It's computed from the **content parts** table, which stores the actual text segments. This allows multi-part content (e.g., a line with mixed formatting) without duplication.
</Info>

## Tags and Features — Annotating Content

**Tags** and **features** are how Kodexa annotates content nodes with meaning.

**Features** are simple key-value metadata on a node:

```
spatial:bbox → [0.5, 1.2, 3.4, 1.5]
format:font → "Arial"
format:bold → true
```

**Tags** are richer annotations that drive the extraction pipeline:

```
Tag: "invoice_number"  → confidence: 0.98, value: "12345"
Tag: "line_item"       → index: 0, group_uuid: "abc-123"
Tag: "line_item"       → index: 1, group_uuid: "def-456"
```

Tags support **indexing** for repeating elements. When a model identifies multiple line items in an invoice, each one gets its own index, allowing the extraction engine to group related tags together.

```mermaid theme={null}
graph TB
    subgraph tagged["Tagged Content Nodes"]
        N1["'INVOICE #12345'
        tag: invoice_number (0.98)"]
        N2["'Widget A'
        tag: line_item/description (index 0)"]
        N3["'$1,234.00'
        tag: line_item/amount (index 0)"]
        N4["'Widget B'
        tag: line_item/description (index 1)"]
        N5["'$567.89'
        tag: line_item/amount (index 1)"]
    end

    subgraph extracted["Extracted Data Objects"]
        INV["/invoice
        number: 12345"]
        LI0["/invoice/line_item [0]
        description: Widget A
        amount: $1,234.00"]
        LI1["/invoice/line_item [1]
        description: Widget B
        amount: $567.89"]
    end

    N1 -.->|extraction| INV
    N2 -.-> LI0
    N3 -.-> LI0
    N4 -.-> LI1
    N5 -.-> LI1
```

## Data Objects — Structured Extraction Results

Data objects represent the **semantic meaning** extracted from the document. They form their own hierarchy, independent of the content node tree.

Each data object has:

* **Taxonomy Reference** — Points to the schema definition (e.g., `acme/invoice`)
* **Path** — Hierarchical path (e.g., `/invoice/line_item/amount`)
* **Attributes** — Typed values (string, decimal, boolean, date) with confidence scores
* **Children** — Nested data objects for repeating or complex structures

Data objects are created by the **extraction engine**, which reads tags from content nodes and builds the structured output according to a taxonomy (data definition).

## How Documents Flow Through the Platform

```mermaid theme={null}
sequenceDiagram
    participant User
    participant UI as Kodexa UI
    participant API as API Server
    participant S3 as Object Storage
    participant Sched as Scheduler
    participant Model as AI Model

    User->>UI: Upload PDF
    UI->>API: POST document
    API->>S3: Store original file

    Note over Sched: Processing pipeline begins
    Sched->>S3: Download file
    Sched->>Model: Process document
    Model->>Model: Create content tree
    Model->>Model: Tag content nodes
    Model-->>Sched: Return .kddb
    Sched->>Sched: Run extraction engine
    Note over Sched: Tags → Data Objects
    Sched->>S3: Store .kddb

    User->>UI: Open document
    UI->>S3: Load .kddb via WASM
    UI->>UI: Render content tree
    UI->>UI: Show extracted data
    User->>UI: Review & correct
    UI->>API: Save changes (delta)
```

The key stages are:

1. **Upload** — Raw documents (PDF, DOCX, images) are uploaded and stored in object storage
2. **Process** — AI models parse the document, creating the content node tree with spatial data
3. **Tag** — Models annotate content nodes with tags identifying what each piece of content means
4. **Extract** — The extraction engine reads tags and builds structured data objects
5. **Review** — Users view documents in the UI, review extracted data, and correct errors
6. **Export** — Extracted data is exported as JSON, CSV, or pushed to downstream systems

## Querying Documents

Kodexa Documents support an **XPath-like selector language** for navigating and querying the content tree:

```python theme={null}
# Find all lines
doc.select("//line")

# Find nodes tagged as 'amount'
doc.select("//*[hasTag('amount')]")

# Find lines containing 'Total'
doc.select("//line[contains(@content, 'Total')]")

# Find the first page
doc.select("/page[0]")
```

This selector language works identically across Python, TypeScript, and Go.

## Key Database Tables

Under the hood, the `.kddb` file contains these core tables:

| Table                        | Purpose                           |
| ---------------------------- | --------------------------------- |
| `kddb_metadata`              | Document-level metadata (JSON)    |
| `kddb_content_nodes`         | The hierarchical node tree        |
| `kddb_content_node_parts`    | Actual text content storage       |
| `kddb_content_node_features` | Key-value metadata on nodes       |
| `kddb_content_node_tags`     | Tag annotations with grouping     |
| `kddb_data_objects`          | Extracted structured data         |
| `kddb_data_attributes`       | Typed values within data objects  |
| `kddb_taxonomies`            | Schema definitions for extraction |
| `kddb_data_exceptions`       | Validation errors and exceptions  |

## What's Next?

<CardGroup cols={2}>
  <Card title="Document Structure Deep Dive" icon="sitemap" href="/guides/kodexa-document/structure">
    Content nodes, spatial data, and how the tree maps to real documents.
  </Card>

  <Card title="Content Structures & Mixins" icon="layer-group" href="/guides/kodexa-document/content-structures">
    How mixins adapt the document tree for spatial, markdown, email, and other content types.
  </Card>

  <Card title="Python SDK" icon="python" href="/sdk/python/index">
    Use the Python package with the same KDDB document model.
  </Card>

  <Card title="SDK Reference" icon="code" href="/sdk/index">
    Full SDK documentation for Python and TypeScript.
  </Card>

  <Card title="Data Definitions" icon="sitemap" href="/guides/data-definitions/index">
    Learn how to define taxonomies that drive document extraction.
  </Card>

  <Card title="CLI Document Commands" icon="terminal" href="/guides/kdx-cli/document/overview">
    Inspect and manipulate documents from the command line.
  </Card>
</CardGroup>
