Document Structure
A Kodexa Document consists of these core components:- Content Node Tree: A hierarchical tree of nodes representing the document’s structure (pages, paragraphs, lines, words, tables, cells, etc.)
- Metadata: Flexible key-value pairs for document-level information
- Source Metadata: Information about the document’s origin (filename, MIME type, checksum)
- Native Documents: Embedded binary files (the original PDF, images, etc.)
- Data Objects & Attributes: Structured extracted data organized by taxonomy
- Tags: Annotations on content nodes linking them to extracted data
- Audit Trail: Change history tracking
Creating Documents
Documents can be created using the SDK in Python or TypeScript:Accessing Original Source Content
Kodexa documents can embed the original source files (PDFs, images, Word documents) as native documents within the KDDB. This allows you to access the raw file data at any point during processing. You can use theget_source utility to retrieve the first embedded native document as bytes:
Saving Documents
Documents are saved in KDDB format (SQLite) for efficient storage and retrieval:Next Steps
- Working with a Document - Learn how to navigate and manipulate document content
- SDK Getting Started - Detailed guide with code examples for both Python and TypeScript
- Content Nodes - Deep dive into the node hierarchy
- Document Tagging - Learn about tagging and annotation
