Skip to main content

Documentation Index

Fetch the complete documentation index at: https://developer.kodexa.ai/llms.txt

Use this file to discover all available pages before exploring further.

Data Definitions describe the structured information Kodexa should extract, validate, review, and use in downstream Activity Plans. They turn a business document into a clear data model: the fields you care about, the groups those fields belong to, the types those fields should normalize into, and the rules that determine whether the extracted data is ready to use.
In some APIs, SDKs, and configuration files, Data Definitions are still represented by the historical terms taxonomy and taxon. In user-facing documentation, think Data Definition for the overall model and Data Element for each field or group inside it.

What Is a Data Definition?

A Data Definition is a hierarchy of data elements. Each element represents either:
  • A piece of data to extract, validate, calculate, or review
  • A group that organizes related data elements
  • A repeating group, such as invoice line items or contract parties
This hierarchy becomes the shared model used by extraction, validation, review forms, Activity steps, and downstream systems.

Example Structure

Invoice (Data Definition)
├── Invoice Number (data element)
├── Invoice Date (data element)
├── Vendor (group)
│   ├── Name (data element)
│   ├── Address (data element)
│   └── Tax ID (data element)
└── Line Items (repeating group)
    ├── Description (data element)
    ├── Quantity (data element)
    ├── Unit Price (data element)
    └── Total (calculated element)

Data Definition Roles

Most business users think about Data Definitions as the final data they want from a document. Kodexa also uses Data Definition structures during processing so modules, Activity steps, and model outputs can share the same vocabulary.

Content Data Definition

Defines the business-level data extracted from documents. This is the main model used for final output, review, validation, and downstream integrations.Use for: Business data extraction, final output structure

Processing Data Definition

Supports intermediate processing work. These structures can be provided by modules or Activity Plan steps and become available when those resources are bound into a project.Use for: Intermediate labels, routing signals, AI model support

Module Data Definition

Comes from modules used for training or inference. These structures become available when you add a module to a project or reference it from an Activity Plan.Use for: ML module training, module-specific labels

Key Concepts

Data Elements

In configuration, data elements are written under the API field taxons. Each element can be a simple field, a group, or a repeating group. Simple Data Element:
taxons:
  - name: invoice_number
    label: Invoice Number
    taxonType: STRING
Group Container:
taxons:
  - name: vendor
    label: Vendor Information
    group: true
    children:
      - name: name
      - name: address
Repeating Group:
taxons:
  - name: line_items
    label: Line Items
    group: true
    allowsMultipleEntries: true
    children:
      - name: description
      - name: quantity

Hierarchy and Relationships

Data Definitions use parent-child relationships to organize data:
  • Root elements: Top-level fields or groups
  • Child elements: Fields nested inside a parent group
  • Sibling elements: Fields at the same level in the model
This structure helps Kodexa:
  • Organize related data logically
  • Mirror the way information appears in documents
  • Improve extraction and review accuracy
  • Produce output that downstream systems can understand

Data Definition Lifecycle

1. Design Phase

Define the model from the business problem:
  • What documents are involved?
  • What data must be extracted?
  • Which fields repeat?
  • Which values need review or validation?
  • Which downstream systems will consume the output?

2. Configuration Phase

Set properties for each data element:
  • Data type
  • Value source
  • Semantic definition
  • Validation rules
  • Conditional formatting
  • Event-based scripts

3. Training and Testing Phase

Use the Data Definition to:
  • Label representative documents
  • Train or evaluate extraction models
  • Refine semantic definitions
  • Test validation and review behavior

4. Production Phase

Use the Data Definition in live workflows to:
  • Extract structured data from new documents
  • Validate extracted values
  • Present data to reviewers through Data Forms
  • Feed Activity Plan steps and downstream integrations

Common Use Cases

Invoice Processing

taxons:
  - name: header
    group: true
    children:
      - name: invoice_number
      - name: invoice_date
      - name: due_date

  - name: vendor
    group: true
    children:
      - name: name
      - name: address

  - name: line_items
    group: true
    children:
      - name: description
      - name: quantity
      - name: unit_price

Contract Metadata

taxons:
  - name: contract_type
    taxonType: SELECTION

  - name: parties
    group: true
    children:
      - name: party_a
      - name: party_b

  - name: key_dates
    group: true
    children:
      - name: effective_date
      - name: expiration_date

Form Data

taxons:
  - name: applicant
    group: true
    children:
      - name: full_name
      - name: email
      - name: phone

  - name: application_details
    group: true
    children:
      - name: application_type
      - name: submission_date

Best Practices

Create Data Definitions that can be reused across similar document types:
  • Use business names that make sense across teams
  • Factor out common structures
  • Keep repeated document patterns consistent
Start with the fields that drive the workflow:
  • Begin with core data elements
  • Add groups only where they improve clarity
  • Introduce validation incrementally
Align the hierarchy with how users understand the document:
  • Match visual organization where it helps
  • Follow natural reading order
  • Group related information
Choose clear, stable names:
  • Use business terminology
  • Be specific and unambiguous
  • Follow consistent naming conventions

Learn More

Data Definition Structure

Configure data elements, groups, value sources, and extraction behavior

Data Types Reference

Detailed information about available data types and normalization

Building Data Classes

Generate Python data classes from Data Definitions for programmatic access

Data Definitions Overview

Overall Data Definition concepts and patterns

Next Steps

1

Review the structure guide

Read Data Definition Structure for detailed configuration instructions.
2

Explore examples

Check out Data Definition examples for common document types.
3

Try it in a project

Create your first Data Definition in a Kodexa project and test it with sample documents.
4

Iterate and refine

Use extraction results to improve semantic definitions, validation rules, and review behavior.