> ## Documentation Index > Fetch the complete documentation index at: https://developer.kodexa.ai/llms.txt > Use this file to discover all available pages before exploring further. # What is a Kodexa Document? > Understand the KDDB document format at the heart of the Kodexa Platform, including content nodes, mixins, features, tags, and structured data layers. Traditional document formats like PDF, DOCX, and images are built for humans to read — not for machines to process. When you need to extract structured data from invoices, annotate contracts with tags, or build hierarchical models from flat content, you need something more powerful. **Kodexa Document** solves this. It's a rich, queryable document format that bridges the gap between raw content and structured data. ## The KDDB Format Every Kodexa Document is stored as a **KDDB file** — a SQLite database with a well-defined schema of 40+ tables. Think of it as "a document that is also a database." ```mermaid theme={null} graph LR PDF[PDF / DOCX / Image] -->|processed by| MODEL[AI Model] MODEL -->|creates| KDDB[".kddb file (SQLite database)"] KDDB -->|queried by| APP[Applications] KDDB -->|viewed in| UI[Kodexa UI] KDDB -->|exported as| JSON[JSON / CSV] ``` This approach gives you: * **Rich querying** — XPath-like selectors to find any content * **Transactional updates** — safe concurrent modifications with audit trails * **Efficient storage** — zstd-compressed content for documents with thousands of pages * **Cross-platform access** — the same document works in Python, TypeScript/WASM, and Go ## Three Layers of a Document Every Kodexa Document has three conceptual layers that work together: ```mermaid theme={null} graph TB subgraph doc["Kodexa Document (.kddb)"] direction TB META["Metadata Layer UUID, version, labels, source info"] subgraph content["Content Node Layer"] ROOT[document] --> P1[page 1] ROOT --> P2[page 2] P1 --> CA1[content-area] P1 --> CA2[content-area] CA1 --> L1[line] CA2 --> L2[line] end subgraph data["Data Object Layer"] INV["/invoice"] --> LI1["/line_item"] INV --> LI2["/line_item"] LI1 --> AMT["amount: $1,234"] LI1 --> DESC["description: Widget A"] end end style META fill:#3b82f6,color:#fff style content fill:#10b981,color:#fff style data fill:#f59e0b,color:#fff ``` | Layer | Purpose | Think of it as... | | ----------------- | --------------------------------------------------------- | ---------------------------- | | **Metadata** | Document-level properties — UUID, version, labels, source | The document's "passport" | | **Content Nodes** | Hierarchical tree of the document's content | The document's "DOM tree" | | **Data Objects** | Structured data extracted from the content | The document's "spreadsheet" | These layers are independent but connected. Content nodes hold the raw text and spatial positions; data objects hold the semantic meaning extracted from that content. ## Content Nodes — The Document Tree Content nodes form a tree that represents the document structure, much like an HTML DOM tree represents a web page.

document

page (page 1)

content-area

line: "INVOICE #12345"

line: "ACME Corp"

content-area

line: "Widget A \$1,234.00"

word: "Widget"
word: "A"
word: "\$1,234.00"

line: "Widget B \$567.89"

page (page 2)

...

Each content node carries: | Property | Description | | ---------------- | ---------------------------------------------------------------- | | **Type** | What kind of node — `page`, `content-area`, `line`, `word`, etc. | | **Content** | The text content (computed from content parts) | | **Features** | Key-value metadata in `type:name` format | | **Tags** | Annotations that mark content for extraction | | **Bounding Box** | Spatial position on the page (x, y, width, height) | | **Children** | Child nodes in the tree | A node's content is never stored directly on the node itself. It's computed from the **content parts** table, which stores the actual text segments. This allows multi-part content (e.g., a line with mixed formatting) without duplication. ## Tags and Features — Annotating Content **Tags** and **features** are how Kodexa annotates content nodes with meaning. **Features** are simple key-value metadata on a node: ``` spatial:bbox → [0.5, 1.2, 3.4, 1.5] format:font → "Arial" format:bold → true ``` **Tags** are richer annotations that drive the extraction pipeline: ``` Tag: "invoice_number" → confidence: 0.98, value: "12345" Tag: "line_item" → index: 0, group_uuid: "abc-123" Tag: "line_item" → index: 1, group_uuid: "def-456" ``` Tags support **indexing** for repeating elements. When a model identifies multiple line items in an invoice, each one gets its own index, allowing the extraction engine to group related tags together. ```mermaid theme={null} graph TB subgraph tagged["Tagged Content Nodes"] N1["'INVOICE #12345' tag: invoice_number (0.98)"] N2["'Widget A' tag: line_item/description (index 0)"] N3["'$1,234.00' tag: line_item/amount (index 0)"] N4["'Widget B' tag: line_item/description (index 1)"] N5["'$567.89' tag: line_item/amount (index 1)"] end subgraph extracted["Extracted Data Objects"] INV["/invoice number: 12345"] LI0["/invoice/line_item [0] description: Widget A amount: $1,234.00"] LI1["/invoice/line_item [1] description: Widget B amount: $567.89"] end N1 -.->|extraction| INV N2 -.-> LI0 N3 -.-> LI0 N4 -.-> LI1 N5 -.-> LI1 ``` ## Data Objects — Structured Extraction Results Data objects represent the **semantic meaning** extracted from the document. They form their own hierarchy, independent of the content node tree. Each data object has: * **Taxonomy Reference** — Points to the schema definition (e.g., `acme/invoice`) * **Path** — Hierarchical path (e.g., `/invoice/line_item/amount`) * **Attributes** — Typed values (string, decimal, boolean, date) with confidence scores * **Children** — Nested data objects for repeating or complex structures Data objects are created by the **extraction engine**, which reads tags from content nodes and builds the structured output according to a taxonomy (data definition). ## How Documents Flow Through the Platform ```mermaid theme={null} sequenceDiagram participant User participant UI as Kodexa UI participant API as API Server participant S3 as Object Storage participant Sched as Scheduler participant Model as AI Model User->>UI: Upload PDF UI->>API: POST document API->>S3: Store original file Note over Sched: Processing pipeline begins Sched->>S3: Download file Sched->>Model: Process document Model->>Model: Create content tree Model->>Model: Tag content nodes Model-->>Sched: Return .kddb Sched->>Sched: Run extraction engine Note over Sched: Tags → Data Objects Sched->>S3: Store .kddb User->>UI: Open document UI->>S3: Load .kddb via WASM UI->>UI: Render content tree UI->>UI: Show extracted data User->>UI: Review & correct UI->>API: Save changes (delta) ``` The key stages are: 1. **Upload** — Raw documents (PDF, DOCX, images) are uploaded and stored in object storage 2. **Process** — AI models parse the document, creating the content node tree with spatial data 3. **Tag** — Models annotate content nodes with tags identifying what each piece of content means 4. **Extract** — The extraction engine reads tags and builds structured data objects 5. **Review** — Users view documents in the UI, review extracted data, and correct errors 6. **Export** — Extracted data is exported as JSON, CSV, or pushed to downstream systems ## Querying Documents Kodexa Documents support an **XPath-like selector language** for navigating and querying the content tree: ```python theme={null} # Find all lines doc.select("//line") # Find nodes tagged as 'amount' doc.select("//*[hasTag('amount')]") # Find lines containing 'Total' doc.select("//line[contains(@content, 'Total')]") # Find the first page doc.select("/page[0]") ``` This selector language works identically across Python, TypeScript, and Go. ## Key Database Tables Under the hood, the `.kddb` file contains these core tables: | Table | Purpose | | ---------------------------- | --------------------------------- | | `kddb_metadata` | Document-level metadata (JSON) | | `kddb_content_nodes` | The hierarchical node tree | | `kddb_content_node_parts` | Actual text content storage | | `kddb_content_node_features` | Key-value metadata on nodes | | `kddb_content_node_tags` | Tag annotations with grouping | | `kddb_data_objects` | Extracted structured data | | `kddb_data_attributes` | Typed values within data objects | | `kddb_taxonomies` | Schema definitions for extraction | | `kddb_data_exceptions` | Validation errors and exceptions | ## What's Next? Content nodes, spatial data, and how the tree maps to real documents. How mixins adapt the document tree for spatial, markdown, email, and other content types. Use the Python package with the same KDDB document model. Full SDK documentation for Python and TypeScript. Learn how to define taxonomies that drive document extraction. Inspect and manipulate documents from the command line.