Skip to main content

Documentation Index

Fetch the complete documentation index at: https://developer.kodexa.ai/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Data definitions are the foundation of data extraction in Kodexa. They define the hierarchical structure of data elements you want to extract from documents, along with their types, validation rules, and extraction logic.

What is a Data Definition?

A Data Definition is a hierarchical structure of data elements. In configuration and API payloads, those elements are represented as taxons. A Data Definition defines:
  • What data to extract from documents
  • Where the data comes from (document content, metadata, formulas)
  • How to validate and format the data
  • What type of data it is (string, date, currency, etc.)

Key Concepts

Data Definition

Top-level container defining the complete data structure for extraction

Data Element

Individual field or group within a Data Definition

Data Group

Organizational container that groups related data elements without storing data itself

Value Path

Defines where the data element gets its value from (document, metadata, formula, etc.)

Data Definition Structure

Top-Level Configuration

Every data definition has these core properties:
slug: invoice-data
name: Invoice Data Extraction
description: Extract structured data from invoices
taxonomyType: CONTENT
enabled: true
taxons:
  - name: vendor_information
    label: Vendor Information
    # ... data element configuration

Data Definition Properties

PropertyTypeDefaultDescription
slugstring-Unique identifier for the data definition
namestring-Display name
descriptionstring-Description of the data definition’s purpose
taxonomyTypeenumCONTENTType of data definition (typically CONTENT)
enabledbooleantrueWhether the data definition is active
externalDataTaxonomyRefsstring[][]References to external data definitions
taxonsTaxon[][]Array of root-level data elements

Data Element Configuration

Data elements are the individual fields and groups within a Data Definition. In configuration, each one is a taxon with extensive options organized into several categories.

Basic Properties

Every data element requires these fundamental properties:
id: "auto-generated-uuid"
name: vendor_name
label: Vendor Name
description: The name of the vendor or supplier
enabled: true
color: "#4F46E5"
name
string
required
Internal identifier (alphanumeric, hyphens, underscores only)
label
string
required
Human-readable display name
description
string
Detailed explanation of what this data element represents
enabled
boolean
default:"true"
Whether this data element is active (disabled elements cascade to children)
color
string
Hex color code for UI display (auto-generated if not specified)
generateName
boolean
default:"true"
Auto-generate the internal name from the label
externalName
string
Name used when publishing to external systems (auto-generated from label if not specified)

Data Source (Value Path)

The valuePath determines where the data element gets its value from:
Extracts data directly from document content using AI/ML models or pattern matching.When to use: Standard document extraction (invoices, contracts, forms)Configuration:
valuePath: VALUE_OR_ALL_CONTENT
semanticDefinition: "Extract the vendor's business name as it appears on the invoice"
Features:
  • Uses semantic definition as extraction prompt
  • Can leverage document structure and layout
  • Supports AI-assisted extraction
Pulls data from document metadata (filename, creation date, owner, etc.).When to use: Document properties, system fields, audit trailConfiguration:
valuePath: METADATA
metadataValue: FILENAME  # or CREATED_DATETIME, OWNER_NAME, etc.
Available metadata values:
  • FILENAME - Document filename
  • TRANSACTION_UUID - Unique transaction identifier
  • CREATED_DATETIME - Document creation timestamp
  • DOCUMENT_LABELS - Applied labels
  • OWNER_NAME - Document owner
  • DOCUMENT_STATUS - Processing status
  • PAGE_NUMBER - Current page number
Calculates values using formulas that reference other data elements.When to use: Computed fields, calculations, aggregationsConfiguration:
valuePath: FORMULA
semanticDefinition: |
  sum({line_items/amount})
Features:
  • Reference other data elements with {field_name} or {group/field_name}
  • Built-in functions such as sum, average, if, isblank, and datemath
  • Conditional logic support
Generates review templates using Jinja2 templating.When to use: Human review interfaces, validation checklistsConfiguration:
valuePath: REVIEW
semanticDefinition: |
  ## Review Checklist

  - [ ] Vendor name matches PO: {{ vendor_name }}
  - [ ] Total amount is correct: {{ total_amount }}
  - [ ] All line items present: {{ line_items|length }} items
Placeholder for derived values (less common, use FORMULA instead).

Data Types

The taxonType defines how the data should be treated and validated:
taxonType: STRING
typeFeatures:
  longText: true           # Multi-line text field
  maxTextRows: 10          # Maximum rows for display
  markdown: true           # Enable markdown formatting
  expected: true           # Field is expected to be present
  stringExtract: '\d'      # Keep only matching characters (regex)
  stringReplace: '[-\s]'   # Remove matching characters (regex)
Use for: Names, addresses, descriptions, any text content
Use stringExtract and stringReplace to automatically clean extracted values. See String Filters below.

Data Groups and Hierarchies

Groups organize related data elements and can represent repeating structures:
name: line_items
label: Line Items
group: true                    # This is a group, not a value
children:
  - name: description
    label: Description
    taxonType: STRING

  - name: quantity
    label: Quantity
    taxonType: NUMBER

  - name: unit_price
    label: Unit Price
    taxonType: CURRENCY

  - name: total
    label: Total
    taxonType: CURRENCY
    valuePath: FORMULA
    semanticDefinition: "quantity * unit_price"

Group Configuration

group
boolean
default:"false"
Mark as a group (container for other data elements)
children
Taxon[]
Array of child data elements nested under this group
cardinality
object
Define how many instances of this group can exist:
cardinality:
  min: 1      # Minimum required instances
  max: 100    # Maximum allowed instances
naturalKeys
object[]
Define unique identifiers for group instances:
naturalKeys:
  - taxonRef: "invoice_number"
  - taxonRef: "line_number"
eventSubscriptions
TaxonEventSubscription[]
Attach reactive JavaScript scripts to a group data element. Event subscriptions can derive values, enforce business rules, call Service Bridges, create data exceptions, or emit follow-up events when modeled data changes.
eventSubscriptions:
  - name: derive-total
    on: "changed:dataAttribute:(quantity|unit_price)"
    script: |
      if (!currentObject) return;
      var qty = currentObject.getFirstAttributeValue("quantity");
      var price = currentObject.getFirstAttributeValue("unit_price");
      if (qty && price) {
        currentObject.setAttribute("line_total", qty * price);
      }
For the full runtime guide, including the JavaScript objects available to scripts, see Event-Based Scripting.

Validation Rules

Define business rules and data quality checks on the data element they apply to:
validationRules:
  - name: "Total matches sum of line items"
    description: "Ensure calculated total matches the invoice total"
    disabled: false
    conditional: false
    ruleFormula: |
      abs({total_amount} - sum({line_items/total})) < 0.01
    messageFormula: |
      concat(
        "Total mismatch: Invoice shows ",
        {total_amount},
        " but line items sum to ",
        sum({line_items/total})
      )
    detailFormula: |
      "Check line item totals and invoice-level charges."
    overridable: true
    exceptionId: TOTAL_MISMATCH
    supportArticleId: "9117988"

  - name: "Due date after invoice date"
    conditional: true
    conditionalFormula: "!isblank({due_date}) && !isblank({invoice_date})"
    ruleFormula: |
      isafterdate({due_date}, {invoice_date}) || {due_date} = {invoice_date}
    messageFormula: |
      "Due date must be after invoice date"
    overridable: false

Validation Rule Properties

PropertyTypeDescription
namestringRule name
descriptionstringDetailed explanation
disabledbooleanTemporarily disable this rule
conditionalbooleanOnly apply if condition is true
conditionalFormulastringFormula determining if rule applies
ruleFormulastringFormula that must be true (false = validation failure)
messageFormulastringFormula generating the error message
detailFormulastringFormula generating additional details
overridablebooleanCan users override this validation?
exceptionIdstringUnique exception identifier
supportArticleIdstringLink to help documentation

Validation and Conditional Formatting

Read the complete guide for rule placement, exception lifecycle, conditional formatting schema, and the formula language.

Conditional Formatting

Apply visual formatting based on data values:
conditionalFormats:
  - type: backgroundColor
    condition: "isbeforedate({due_date}, datemath('today')) && {status} != 'PAID'"
    properties:
      color: "#FEE2E2"

  - type: textColor
    condition: "isbeforedate({due_date}, datemath('today')) && {status} != 'PAID'"
    properties:
      color: "#991B1B"

  - type: icon
    condition: "{total_amount} > 10000"
    properties:
      icon: alert-circle-outline
      color: "#92400E"

Classification Features

Help AI/ML models understand and classify content:
Provides guidance for AI extraction:
semanticDefinition: |
  The vendor's legal business name as registered with tax authorities.
  Look for names near "Bill To", "Vendor", or "From" sections.
  Should be a proper business name, not an individual's name.
Best practices:
  • Be specific about what to look for
  • Describe location hints
  • Clarify edge cases
  • Provide examples if helpful
Helps with record-based chunking and classification:
additionContexts:
  - type: RECORD_DEFINITION
    context: |
      Each line item represents a product or service being billed.
      Line items typically appear in a table format.

  - type: RECORD_START_MARKER
    context: "Item #"

  - type: RECORD_END_MARKER
    context: "Subtotal"
Context types:
  • RECORD_DEFINITION - Describes the record structure
  • RECORD_START_MARKER - Text indicating record start
  • RECORD_END_MARKER - Text indicating record end
  • RECORD_SECTION_STARTER_MARKER - Section start marker
  • RECORD_SECTION_END_MARKER - Section end marker
Synonyms and antonyms for embedding-based classification:
lexicalRelations:
  - type: SYNONYM
    value: "Supplier, Provider, Seller, Merchant"

  - type: ANTONYM
    value: "Customer, Buyer, Client"
Use for:
  • Improving classification accuracy
  • Handling terminology variations
  • Training embedding models

Advanced Options

Display Configuration

Control how fields appear in the UI:
typeFeatures:
  overrideWidth: true
  displayWidth: 300            # Width in pixels
  expected: true               # Mark as required field

String Filters

Automatically clean extracted values using regex patterns. Extract keeps only matching characters, Replace removes matching characters. If both are set, extract runs first. The original value is preserved separately.
typeFeatures:
  stringExtract: '\d'            # Keep only digits
  stringReplace: '[^a-zA-Z0-9]' # Remove non-alphanumeric characters
PatternEffect
\dKeep digits only (use with stringExtract)
[a-zA-Z]Keep letters only (use with stringExtract)
[-\s]Strip dashes and spaces (use with stringReplace)
[^a-zA-Z0-9]Strip all non-alphanumeric (use with stringReplace)
[^a-zA-Z0-9 ]Strip special characters but keep spaces (use with stringReplace)

User Interaction

multiValue: true               # Allow multiple values
userEditable: true             # User can edit in forms
notUserLabelled: false         # Show in labeling interface
nullable: true                 # Allow null values
nullValue: "N/A"              # Display text for null

Common Patterns

Invoice Extraction

Complete example of a typical invoice data definition:
slug: invoice-extraction
name: Invoice Data Extraction
taxonomyType: CONTENT
enabled: true
taxons:
  # Header Information
  - name: invoice_number
    label: Invoice Number
    taxonType: STRING
    valuePath: VALUE_OR_ALL_CONTENT
    semanticDefinition: "The unique invoice number, typically at the top right"
    validationRules:
      - name: "Invoice number required"
        ruleFormula: "!isblank({invoice_number})"
        messageFormula: '"Invoice number is required"'
        overridable: false

  - name: invoice_date
    label: Invoice Date
    taxonType: DATE
    valuePath: VALUE_OR_ALL_CONTENT
    semanticDefinition: "The date the invoice was issued"
    typeFeatures:
      normalizeDate: true
      dateFormat: "yyyy-MM-dd"

  - name: due_date
    label: Due Date
    taxonType: DATE
    valuePath: VALUE_OR_ALL_CONTENT
    semanticDefinition: "The payment due date"

  # Vendor Information Group
  - name: vendor
    label: Vendor
    group: true
    children:
      - name: name
        label: Vendor Name
        taxonType: STRING
        valuePath: VALUE_OR_ALL_CONTENT
        semanticDefinition: "The vendor's business name"

      - name: address
        label: Address
        taxonType: STRING
        valuePath: VALUE_OR_ALL_CONTENT
        typeFeatures:
          longText: true

      - name: tax_id
        label: Tax ID
        taxonType: STRING
        valuePath: VALUE_OR_ALL_CONTENT

  # Line Items (Repeating Group)
  - name: line_items
    label: Line Items
    group: true
    children:
      - name: description
        label: Description
        taxonType: STRING

      - name: quantity
        label: Quantity
        taxonType: NUMBER

      - name: unit_price
        label: Unit Price
        taxonType: CURRENCY

      - name: line_total
        label: Line Total
        taxonType: CURRENCY
        valuePath: FORMULA
        semanticDefinition: "{quantity} * {unit_price}"

  # Totals
  - name: subtotal
    label: Subtotal
    taxonType: CURRENCY
    valuePath: FORMULA
    semanticDefinition: "sum({line_items/line_total})"

  - name: tax_amount
    label: Tax Amount
    taxonType: CURRENCY
    valuePath: VALUE_OR_ALL_CONTENT

  - name: total_amount
    label: Total Amount
    taxonType: CURRENCY
    valuePath: VALUE_OR_ALL_CONTENT
    validationRules:
      - name: "Total calculation check"
        ruleFormula: "abs({total_amount} - ({subtotal} + ifnull({tax_amount}, 0))) < 0.01"
        messageFormula: '"Total amount mismatch"'
        overridable: true

Contract Data Extraction

slug: contract-extraction
name: Contract Data Extraction
taxons:
  - name: contract_metadata
    label: Contract Metadata
    group: true
    children:
      - name: contract_number
        label: Contract Number
        taxonType: STRING

      - name: contract_type
        label: Contract Type
        taxonType: SELECTION
        selectionOptions:
          - label: "Service Agreement"
            id: "service"
          - label: "Purchase Agreement"
            id: "purchase"
          - label: "NDA"
            id: "nda"
          - label: "License Agreement"
            id: "license"

  - name: parties
    label: Parties
    group: true
    children:
      - name: party_a
        label: Party A
        taxonType: STRING

      - name: party_b
        label: Party B
        taxonType: STRING

  - name: key_terms
    label: Key Terms
    group: true
    children:
      - name: effective_date
        label: Effective Date
        taxonType: DATE

      - name: term_length
        label: Term Length
        taxonType: STRING
        semanticDefinition: "Duration of the contract (e.g., '12 months', '2 years')"

      - name: auto_renewal
        label: Auto Renewal
        taxonType: BOOLEAN
        semanticDefinition: "Does the contract automatically renew?"

      - name: termination_notice
        label: Termination Notice Period
        taxonType: STRING
        semanticDefinition: "Required notice period for termination (e.g., '30 days')"

  - name: financial_terms
    label: Financial Terms
    group: true
    children:
      - name: total_value
        label: Total Contract Value
        taxonType: CURRENCY

      - name: payment_terms
        label: Payment Terms
        taxonType: STRING
        semanticDefinition: "Payment schedule and terms (e.g., 'Net 30', 'Monthly in advance')"

Best Practices

Naming Conventions

name: vendor_name           # Snake case for internal names
label: Vendor Name          # Title case for display
externalName: vendorName    # Camel case for APIs

Semantic Definitions

Write semantic definitions as if explaining to a human what to look for. Be specific about:
  • What the field represents
  • Where it typically appears
  • How to identify it
  • Edge cases to consider
Good example:
semanticDefinition: |
  The total amount due on the invoice, including all taxes and fees.
  Look for labels like "Total", "Amount Due", "Balance Due", or "Total Amount".
  This should be the final bottom-line number, not a subtotal.
  If multiple totals exist (e.g., by currency), extract the primary total.
Avoid:
semanticDefinition: "The total"  # Too vague

Group Structures

# Good: Line items are a repeating group
- name: line_items
  label: Line Items
  group: true
  children:
    - name: description
    - name: quantity
    - name: price

Validation Strategy

  1. Start Simple: Begin with basic “not empty” validations
  2. Add Business Rules: Implement domain-specific validations
  3. Make Critical Rules Non-Overridable: Block processing if essential data is wrong
  4. Allow Overrides for Quality Checks: Let users override formatting or minor issues
validationRules:
  # Critical: Don't allow override
  - name: "Invoice number required"
    ruleFormula: "!isblank({invoice_number})"
    overridable: false

  # Quality check: Allow override
  - name: "Total seems high"
    ruleFormula: "{total_amount} < 100000"
    messageFormula: '"Invoice total exceeds $100,000 - please verify"'
    overridable: true

Formula Usage

semanticDefinition: "{quantity} * {unit_price}"
semanticDefinition: "sum({line_items/total})"
semanticDefinition: |
  if({total_amount} > 10000, "Requires Approval", "Auto-Approve")
semanticDefinition: "datemath({invoice_date}, 'days', 30)"

Troubleshooting

Common Issues

Possible causes:
  • enabled: false is set
  • Parent data element is disabled (disabling cascades to children)
  • notUserLabelled: true for labeling interfaces
Solution: Check enabled status up the hierarchy
Check:
  1. Is valuePath correct for your use case?
  2. Is semanticDefinition clear and specific?
  3. Are you using the right taxonType?
  4. Is the model trained for this document type?
Common mistakes:
  • Referencing data elements that don’t exist
  • Syntax errors in formula
  • Circular references
Test: Use formula builder to validate syntax
Check:
  • Is validation rule disabled: false?
  • Does conditionalFormula evaluate to true?
  • Is ruleFormula returning the expected boolean?

Next Steps

Data Types Reference

Reference guide for all supported data types

Formula Reference

Complete formula function reference

Selection Option Formulas

Compute dropdown options dynamically using JavaScript and service bridges

Event-Based Scripting

Attach reactive JavaScript behavior to group data elements

Scripting Reference

Complete API reference for Kodexa JavaScript scripting

Validation and Formatting

Common validation rule and conditional formatting patterns