Skip to main content
At the heart of Kodexa is the concept of a Document. A document represents the content, metadata and features associated with any type of information. Rather than storing the information as text, we store it in KDDB (Kodexa Document Database) format — a SQLite-based structure containing a hierarchical tree of content nodes, metadata, tags, features, and extracted data.

Document Structure

A Kodexa Document consists of these core components:
  • Content Node Tree: A hierarchical tree of nodes representing the document’s structure (pages, paragraphs, lines, words, tables, cells, etc.)
  • Metadata: Flexible key-value pairs for document-level information
  • Source Metadata: Information about the document’s origin (filename, MIME type, checksum)
  • Native Documents: Embedded binary files (the original PDF, images, etc.)
  • Data Objects & Attributes: Structured extracted data organized by taxonomy
  • Tags: Annotations on content nodes linking them to extracted data
  • Audit Trail: Change history tracking

Creating Documents

Documents can be created using the SDK in Python or TypeScript:
from kodexa_document import Document

# Create an empty document
doc = Document()

# Create from text content
doc = Document.from_text("Hello, World!")

# Load from a KDDB file
doc = Document.from_kddb("my-document.kddb")

# Load from JSON
doc = Document.from_json(json_string)

Accessing Original Source Content

Kodexa documents can embed the original source files (PDFs, images, Word documents) as native documents within the KDDB. This allows you to access the raw file data at any point during processing. You can use the get_source utility to retrieve the first embedded native document as bytes:
from kodexa_document.utils import get_source

# Get the original file data as BytesIO
source_bytes = get_source(document)
Alternatively, you can access native documents directly through the accessor:
# List all embedded files
native_docs = doc.native_documents.get_all()

# Get file data by ID
data = doc.native_documents.get_data(native_docs[0]["id"])
This capability is particularly useful for tasks like OCR processing or extracting content directly from the original file format at any stage of the processing pipeline.

Saving Documents

Documents are saved in KDDB format (SQLite) for efficient storage and retrieval:
# Save to KDDB file
doc.to_kddb("my-document.kddb")
doc.close()

Next Steps