> ## Documentation Index
> Fetch the complete documentation index at: https://developer.kodexa.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Content Structures & Mixins

> How Kodexa uses mixins and content structures in KDDB to handle spatial documents, markdown, email, and other content types in a unified node tree.

Not all content is a PDF. The Kodexa Document format (KDDB) uses **mixins** to adapt the same underlying content node tree to different kinds of content — spatial documents with bounding boxes, editable markdown content, email messages, and more.

This guide explains the content structure system, the available mixins, and how the UI, selectors, and extraction pipeline work with each one.

## What is a Mixin?

A **mixin** is a label on a document that tells the platform how to interpret its content node tree. It determines:

* **Which viewer/editor** the UI renders for the document
* **What node types** are expected in the tree
* **What features** are available on content nodes
* **How the content was produced** (spatial parsing, markdown conversion, email ingestion, etc.)

Mixins are set on the document's `mixins` field and are composable — a document can use multiple mixins when one builds on another.

```mermaid theme={null}
graph LR
    subgraph mixins["Available Mixins"]
        S["spatial"]
        T["text"]
        M["markdown"]
        E["email"]
        W["workbook"]
    end

    E -->|"composes"| M

    S -->|renders| SV["Spatial Viewer"]
    T -->|renders| TV["Text Viewer"]
    M -->|renders| BE["Block Editor"]
    E -->|renders| EV["Email Viewer + Block Editor"]
    W -->|renders| WV["Spreadsheet Viewer"]
```

## Content Structure Overview

| Mixin      | Root Node  | Tree Shape                                 | UI Component                       | Use Cases                                         |
| ---------- | ---------- | ------------------------------------------ | ---------------------------------- | ------------------------------------------------- |
| `spatial`  | `document` | Deep — page → content-area → line → word   | Spatial viewer with bounding boxes | PDFs, scans, images, PowerPoints                  |
| `text`     | `document` | Shallow — text nodes                       | Text viewer                        | Plain text files                                  |
| `markdown` | `document` | Shallow — block-level nodes                | Block editor                       | News articles, rich-text content, general writing |
| `email`    | `email`    | Shallow — block-level nodes                | Email header panel + block editor  | Email messages                                    |
| `workbook` | `workbook` | Medium — workbook → worksheet → row → cell | Spreadsheet viewer                 | Excel files, spreadsheets                         |

All content structures share the same underlying KDDB format, the same selector language, and the same extraction pipeline. The mixin simply changes how the content node tree is organized and how the UI presents it.

***

## Spatial Content Structure

The **spatial** mixin is the original Kodexa content structure, designed for documents where physical layout matters — PDFs, scanned images, PowerPoints, and similar formats.

### Tree structure

```
document
  └── page (index: 0)
      ├── content-area
      │   ├── line: "ACME Corp"
      │   │   ├── word: "ACME"
      │   │   └── word: "Corp"
      │   └── line: "Invoice #12345"
      │       ├── word: "Invoice"
      │       └── word: "#12345"
      └── content-area
          ├── line: "Widget A  $1,234.00"
          └── line: "Widget B  $567.89"
```

### Key characteristics

* **Deep tree** — `document` → `page` → `content-area` → `line` → `word`
* **Bounding boxes** — Every node carries spatial coordinates `[x, y, width, height]` describing its position on the page
* **Page-based** — Content is organized by physical pages
* **Word-level granularity** — Individual words are nodes, enabling precise tagging and spatial queries
* **Read-only in UI** — Users view the spatial layout and tag content, but don't edit the text directly

### Spatial features

Content nodes in a spatial document carry spatial features:

```
spatial:bbox  → [0.5, 1.2, 3.4, 1.5]    # Position on page
spatial:rotate → 90                       # Rotation in degrees
format:font   → "Arial"                  # Font information
format:bold   → true                     # Text formatting
```

### Example selectors

```
//page                              # All pages
//line                              # All lines across all pages
//word[contains(@content, 'Total')] # Words containing 'Total'
//page[0]//line                     # All lines on the first page
//*[hasTag('invoice_number')]       # Nodes tagged as invoice number
```

***

## Text Content Structure

The **text** mixin is the simplest structure, used for plain text content without spatial information or rich formatting.

### Tree structure

```
document
  └── (text content nodes)
```

### Key characteristics

* **Shallow tree** — Minimal hierarchy
* **No bounding boxes** — No spatial positioning
* **No formatting** — Plain text only
* **View-only in UI** — Rendered as plain text

***

## Markdown Content Structure

The **markdown** mixin represents rich-text content as a tree of **block-level** content nodes. Each block is an independently editable region in the UI, and inline formatting (bold, italic, links, inline code) is stored as markdown syntax within the block's content.

### Why block-level?

A full markdown AST would create nodes for every bold span, link, and inline code segment. This is unnecessarily complex for editing — every keystroke in a bold word would require tree restructuring. Instead, the markdown mixin uses block-level nodes only:

* **Block nodes** — `heading`, `paragraph`, `list`, `code-block`, etc. — are ContentNodes in the tree
* **Inline formatting** — `**bold**`, `*italic*`, `[links](url)`, `` `code` `` — stays as markdown syntax within each block's content string

This keeps the tree manageable, makes editing straightforward, and still allows selectors to query at the block level.

### Tree structure

```
document
  ├── heading (level: 1, "Breaking News: Market Update")
  ├── paragraph ("The stock market saw **significant gains** today...")
  ├── heading (level: 2, "Key Highlights")
  ├── list (ordered: false)
  │   ├── list-item ("S&P 500 up 2.3%")
  │   ├── list-item ("Tech sector leads gains")
  │   └── list-item ("Bond yields decline")
  ├── blockquote ("> Analysts expect continued growth...")
  ├── image (src: "chart.png", alt: "Market chart")
  ├── code-block (language: "json", "{ \"sp500\": 5234.12 }")
  ├── table
  │   ├── row
  │   │   ├── cell ("Index")
  │   │   └── cell ("Change")
  │   └── row
  │       ├── cell ("S&P 500")
  │       └── cell ("+2.3%")
  └── horizontal-rule
```

### Block node types

<Tabs>
  <Tab title="Text Blocks">
    | Node Type    | Features               | Content              |
    | ------------ | ---------------------- | -------------------- |
    | `heading`    | `markdown:level` (1-6) | Inline markdown text |
    | `paragraph`  | —                      | Inline markdown text |
    | `blockquote` | —                      | Inline markdown text |

    These blocks contain text with optional inline markdown formatting. The content is stored in `content_parts` and rendered as rich text in the editor.

    ```
    heading content:    "Breaking **News**: Market Update"
    paragraph content:  "The [S&P 500](https://...) hit a new high."
    blockquote content: "Analysts expect *continued* growth."
    ```
  </Tab>

  <Tab title="Container Blocks">
    | Node Type   | Features                            | Content                                |
    | ----------- | ----------------------------------- | -------------------------------------- |
    | `list`      | `markdown:ordered` (bool)           | Empty — children are `list-item` nodes |
    | `list-item` | `markdown:checked` (bool, optional) | Inline markdown text                   |
    | `table`     | —                                   | Empty — uses `row` → `cell` children   |

    Container blocks have children that hold the actual content. Lists contain `list-item` nodes; tables use the existing `row` → `cell` node types from the spatial structure.

    The optional `markdown:checked` feature on `list-item` supports task list items (`- [x] Done`, `- [ ] Todo`).
  </Tab>

  <Tab title="Media & Structural Blocks">
    | Node Type         | Features                       | Content       |
    | ----------------- | ------------------------------ | ------------- |
    | `code-block`      | `markdown:language` (string)   | Raw code text |
    | `image`           | `markdown:src`, `markdown:alt` | Empty         |
    | `horizontal-rule` | —                              | Empty         |

    Code blocks store raw text (no inline markdown processing) with an optional language identifier for syntax highlighting. Images reference external URLs via features. Horizontal rules are simple dividers with no content.
  </Tab>
</Tabs>

### Key characteristics

* **Shallow tree** — `document` → blocks, with nesting only for lists and tables
* **No bounding boxes** — Content is not spatially positioned
* **Editable in UI** — Block-based editor where each ContentNode is an editable block
* **Inline markdown** — Rich formatting preserved as markdown syntax within block content

### Example selectors

```
//heading                                                      # All headings
//heading[hasFeatureValue('markdown', 'level', '1')]           # All h1 headings
//paragraph                                                    # All paragraphs
//list-item                                                    # All list items
//blockquote                                                   # All blockquotes
//code-block[hasFeatureValue('markdown', 'language', 'python')]# Python code blocks
//table/row/cell                                               # All table cells
```

***

## Email Content Structure

The **email** mixin composes the `markdown` mixin to represent email messages. Email-specific metadata (from, to, subject, date) is stored in document metadata, while the email body uses the same block-level markdown nodes.

### Tree structure

```
email (root)
  ├── paragraph ("Hello team,")
  ├── paragraph ("Here are the Q4 results:")
  ├── list (ordered: false)
  │   ├── list-item ("Revenue: **$12.4M** (+15%)")
  │   └── list-item ("Operating margin: **23%**")
  ├── blockquote ("> From the previous quarterly report...")
  └── paragraph ("Best regards,\nPhil")
```

### Document metadata

Email headers are stored as document-level metadata, not as content nodes:

| Field       | Type      | Description                     |
| ----------- | --------- | ------------------------------- |
| `from`      | string    | Sender email address            |
| `to`        | string\[] | Recipient addresses             |
| `cc`        | string\[] | CC addresses                    |
| `bcc`       | string\[] | BCC addresses                   |
| `subject`   | string    | Email subject line              |
| `date`      | datetime  | Send date/time                  |
| `messageId` | string    | RFC 2822 Message-ID             |
| `inReplyTo` | string    | Parent message ID for threading |
| `threadId`  | string    | Conversation thread identifier  |
| `headers`   | object    | Additional raw email headers    |

### Attachments

Email attachments are **not** embedded in the email KDDB. Each attachment becomes its own document family in the store with the appropriate mixin:

* A PDF attachment → spatial mixin with its own KDDB
* A text file attachment → text mixin
* A forwarded email → email mixin

The parent email's metadata links to attachment document families. This keeps KDDBs focused and allows each attachment to go through its own processing pipeline.

### Key characteristics

* **Root node is `email`** (not `document`) — distinguishes email from general markdown
* **Composable** — The `email` mixin includes the `markdown` mixin for the body
* **Headers in metadata** — Email-specific data lives in document metadata, not the tree
* **Same block editor** — The body uses the same block editor as the `markdown` mixin
* **Same selectors** — Query the body with the same selector syntax

### Example selectors

```
//email/paragraph    # Body paragraphs
//email//list-item   # All list items in the email body
//email/blockquote   # Quoted text (often from replies)
```

***

## Workbook Content Structure

The **workbook** mixin represents spreadsheet content — Excel files and similar tabular formats. The content node tree mirrors the workbook's structure: sheets contain rows, rows contain cells.

### Tree structure

```
workbook (root)
  ├── worksheet ("Income Statement")
  │   ├── row
  │   │   ├── cell: "Revenue" (ref: A1)
  │   │   ├── cell: "Q1" (ref: B1)
  │   │   └── cell: "Q2" (ref: C1)
  │   └── row
  │       ├── cell: "Product A" (ref: A2)
  │       ├── cell: "1,234.00" (ref: B2)
  │       └── cell: "1,456.00" (ref: C2)
  └── worksheet ("Balance Sheet")
      └── row
          └── cell: "Assets" (ref: A1)
```

### Cell features

Content nodes in a workbook document carry cell-specific features:

```
workbook:ref     → "B2"                    # Cell reference (column letter + row number)
workbook:sheet   → "Income Statement"      # Parent worksheet name
workbook:formula → "=SUM(B2:B10)"          # Original formula (if cell contains one)
workbook:merge   → "A1:D1"                 # Merged cell range (on top-left cell only)
```

### Key characteristics

* **Medium-depth tree** — `workbook` → `worksheet` → `row` → `cell`
* **No bounding boxes** — Cells are addressed by reference, not spatial coordinates
* **Cell references** — Every cell carries a `workbook:ref` feature mapping it to its Excel address
* **Formula preservation** — Formulas are stored alongside calculated values
* **Read-only in UI** — Users view the spreadsheet and tag cells for extraction, but don't edit values
* **Sheet tabs** — Multiple worksheets render as tabs, similar to Excel

### Example selectors

```
//worksheet                                                    # All worksheets
//cell                                                         # All cells across all sheets
//cell[hasFeatureValue('workbook', 'ref', 'B2')]              # Cell at reference B2
//cell[contains(@content, 'Revenue')]                          # Cells containing 'Revenue'
//worksheet[contains(@content, 'Income')]//cell               # All cells in sheets with 'Income' in the name
//*[hasTag('revenue/total')]                                   # Nodes tagged as revenue total
```

***

## The Block Editor

The `markdown` and `email` mixins share a **block-based editor** in the Kodexa UI. Each content node is rendered as an independently editable block, similar to Notion or Google Docs.

### How editing maps to the KDDB

Every user action in the editor corresponds directly to a KDDB content node operation:

| User Action                                   | KDDB Operation                                             |
| --------------------------------------------- | ---------------------------------------------------------- |
| Edit text in a block                          | Update `content_parts` on the ContentNode                  |
| Reorder blocks (drag & drop)                  | Update `index` on affected ContentNodes                    |
| Change block type (e.g., paragraph → heading) | Update `node_type`, add/remove features                    |
| Delete a block                                | Remove ContentNode from tree                               |
| Add a new block                               | Create new ContentNode at the target index                 |
| Split a block (press Enter)                   | Split content\_parts, create new ContentNode after current |
| Merge blocks (Backspace at start)             | Merge content into previous node, remove current           |

### Block type selection

Users can change a block's type using a toolbar or `/` command. Compatible conversions include:

* `paragraph` ↔ `heading` ↔ `blockquote`
* `paragraph` → `list` (wraps in a list with one list-item)
* `paragraph` → `code-block`
* Any text block → `horizontal-rule` (clears content)

***

## Extraction Across All Content Structures

The extraction pipeline works the same way regardless of mixin. Tags are applied to content nodes, grouped by index, and converted to data objects:

```mermaid theme={null}
graph LR
    A["Content Nodes
    (any mixin)"] -->|"AI tags nodes"| B["Tagged Nodes"]
    B -->|"group by index"| C["Tag Groups"]
    C -->|"build from taxonomy"| D["Data Objects"]
```

This means you can extract structured data from markdown and email content using the same data definitions, tagging models, and extraction engine that work with spatial documents. For example:

* **News articles** — Extract entities, topics, dates, and quotes from markdown content
* **Emails** — Extract action items, deadlines, and referenced documents from email bodies
* **Reports** — Extract metrics, summaries, and key findings from rich-text documents

***

## Choosing a Content Structure

<AccordionGroup>
  <Accordion title="Use spatial when...">
    You're processing PDFs, scanned documents, images, or PowerPoints where **physical layout matters**. The spatial mixin preserves bounding boxes, page structure, and word-level positioning — essential for document understanding, table extraction, and content that needs to be visually overlaid on the original document.
  </Accordion>

  <Accordion title="Use text when...">
    You have plain text content with no formatting or layout requirements. This is the simplest structure and is appropriate for log files, raw text exports, or content that will be processed purely for its text.
  </Accordion>

  <Accordion title="Use markdown when...">
    You have rich-text content that users need to **view and edit** — news articles, reports, knowledge base entries, or any content that benefits from structured blocks (headings, lists, code, tables) with inline formatting. The block editor makes this content interactive.
  </Accordion>

  <Accordion title="Use email when...">
    You're ingesting email messages. The email mixin gives you structured metadata (from, to, subject, date, threading) plus a markdown body that users can view and edit. Attachments become their own document families with appropriate mixins.
  </Accordion>

  <Accordion title="Use workbook when...">
    You're processing Excel files, spreadsheets, or tabular data where **cell structure matters**. The workbook mixin preserves cell references, formulas, merged cells, and worksheet organization — essential for financial data extraction, tabular analysis, and content that needs to be visually mapped to a spreadsheet grid.
  </Accordion>
</AccordionGroup>

## What's Next?

<CardGroup cols={2}>
  <Card title="Document Structure Deep Dive" icon="sitemap" href="/guides/kodexa-document/structure">
    Detailed look at content nodes, spatial data, and how the tree maps to real documents.
  </Card>

  <Card title="SDK Reference" icon="code" href="/sdk/index">
    Work with documents programmatically in Python and TypeScript.
  </Card>

  <Card title="Data Definitions" icon="list-tree" href="/guides/data-definitions/index">
    Define taxonomies to extract structured data from any content structure.
  </Card>

  <Card title="Selectors" icon="magnifying-glass" href="/sdk/selectors">
    Query content nodes using the XPath-like selector language.
  </Card>
</CardGroup>
