Complete Data Definition Guide - Kodexa Developer Portal

Overview

Data definitions are the foundation of data extraction in Kodexa. They define the hierarchical structure of data elements you want to extract from documents, along with their types, validation rules, and extraction logic.

What is a Data Definition?

A data definition is a hierarchical structure of taxons (data elements) that defines:

What data to extract from documents
Where the data comes from (document content, metadata, formulas, expressions)
How to validate and format the data
What type of data it is (string, date, currency, etc.)

Key Concepts

Data Definition

Top-level container defining the complete data structure for extraction

Taxon

Individual data element within a data definition (field or group)

Data Group

Organizational container that groups related taxons without storing data itself

Value Path

Defines where the taxon gets its data from (document, metadata, formula, etc.)

Data Definition Structure

Top-Level Configuration

Every data definition has these core properties:

slug: invoice-data
name: Invoice Data Extraction
description: Extract structured data from invoices
taxonomyType: CONTENT
enabled: true
taxons:
  - name: vendor_information
    label: Vendor Information
    # ... taxon configuration

Data Definition Properties

Property	Type	Default	Description
`slug`	string	-	Unique identifier for the data definition
`name`	string	-	Display name
`description`	string	-	Description of the data definition’s purpose
`taxonomyType`	enum	`CONTENT`	Type of data definition (typically CONTENT)
`enabled`	boolean	`true`	Whether the data definition is active
`externalDataTaxonomyRefs`	string[]	`[]`	References to external data definitions
`taxons`	Taxon[]	`[]`	Array of root-level taxons

Taxon Configuration

Taxons are the individual data elements within a data definition. Each taxon has extensive configuration options organized into several categories.

Basic Properties

Every taxon requires these fundamental properties:

id: "auto-generated-uuid"
name: vendor_name
label: Vendor Name
description: The name of the vendor or supplier
enabled: true
color: "#4F46E5"

name

string

required

Internal identifier (alphanumeric, hyphens, underscores only)

label

string

required

Human-readable display name

description

string

Detailed explanation of what this data element represents

enabled

boolean

default:"true"

Whether this taxon is active (disabled taxons cascade to children)

color

string

Hex color code for UI display (auto-generated if not specified)

generateName

boolean

default:"true"

Auto-generate the internal name from the label

externalName

string

Name used when publishing to external systems (auto-generated from label if not specified)

Data Source (Value Path)

The valuePath determines where the taxon gets its data from:

Document (VALUE_OR_ALL_CONTENT)

Extracts data directly from document content using AI/ML models or pattern matching.When to use: Standard document extraction (invoices, contracts, forms)Configuration:

valuePath: VALUE_OR_ALL_CONTENT
semanticDefinition: "Extract the vendor's business name as it appears on the invoice"

Features:

Uses semantic definition as extraction prompt
Can leverage document structure and layout
Supports AI-assisted extraction

Metadata (METADATA)

Pulls data from document metadata (filename, creation date, owner, etc.).When to use: Document properties, system fields, audit trailConfiguration:

valuePath: METADATA
metadataValue: FILENAME  # or CREATED_DATETIME, OWNER_NAME, etc.

Available metadata values:

FILENAME - Document filename
TRANSACTION_UUID - Unique transaction identifier
CREATED_DATETIME - Document creation timestamp
DOCUMENT_LABELS - Applied labels
OWNER_NAME - Document owner
DOCUMENT_STATUS - Processing status
PAGE_NUMBER - Current page number

Formula (FORMULA)

Calculates values using formulas that reference other taxons.When to use: Computed fields, calculations, aggregationsConfiguration:

valuePath: FORMULA
semanticDefinition: |
  SUM(line_items.amount)

Features:

Reference other taxons by name
Built-in functions (SUM, AVG, COUNT, etc.)
Conditional logic support

Review (REVIEW)

Generates review templates using Jinja2 templating.When to use: Human review interfaces, validation checklistsConfiguration:

valuePath: REVIEW
semanticDefinition: |
  ## Review Checklist

  - [ ] Vendor name matches PO: {{ vendor_name }}
  - [ ] Total amount is correct: {{ total_amount }}
  - [ ] All line items present: {{ line_items|length }} items

External (EXTERNAL)

Populates data from external sources using Groovy expressions.When to use: API integrations, database lookups, external systemsConfiguration:

valuePath: EXTERNAL
expression: |
  // Fetch from external API
  def response = http.get("https://api.example.com/vendor/${vendor_id}")
  return response.name

Derived (DERIVED)

Placeholder for derived values (less common, use FORMULA instead).

Data Types

The taxonType defines how the data should be treated and validated:

String
Number
Currency
Date
Date Time
Selection
Boolean
Other Types

taxonType: STRING
typeFeatures:
  longText: true           # Multi-line text field
  maxTextRows: 10          # Maximum rows for display
  markdown: true           # Enable markdown formatting
  expected: true           # Field is expected to be present

Use for: Names, addresses, descriptions, any text content

Data Groups and Hierarchies

Groups organize related taxons and can represent repeating structures:

name: line_items
label: Line Items
group: true                    # This is a group, not a value
valuePath: EXTERNAL            # Groups can use EXTERNAL for API data
children:
  - name: description
    label: Description
    taxonType: STRING

  - name: quantity
    label: Quantity
    taxonType: NUMBER

  - name: unit_price
    label: Unit Price
    taxonType: CURRENCY

  - name: total
    label: Total
    taxonType: CURRENCY
    valuePath: FORMULA
    semanticDefinition: "quantity * unit_price"

Group Configuration

group

boolean

default:"false"

Mark as a group (container for other taxons)

children

Taxon[]

Array of child taxons nested under this group

cardinality

object

Define how many instances of this group can exist:

cardinality:
  min: 1      # Minimum required instances
  max: 100    # Maximum allowed instances

naturalKeys

object[]

Define unique identifiers for group instances:

naturalKeys:
  - taxonRef: "invoice_number"
  - taxonRef: "line_number"

Validation Rules

Define business rules and data quality checks:

validationRules:
  - name: "Total matches sum of line items"
    description: "Ensure calculated total matches the invoice total"
    disabled: false
    conditional: false               # Apply always
    ruleFormula: |
      ABS(total_amount - SUM(line_items.total)) < 0.01
    messageFormula: |
      "Total mismatch: Invoice shows " + total_amount + " but line items sum to " + SUM(line_items.total)
    detailFormula: |
      "Check line items for accuracy"
    overridable: true                # User can override this validation
    exceptionId: "TOTAL_MISMATCH"    # Unique exception identifier
    supportArticleId: "9117988"      # Link to help article

  - name: "Due date after invoice date"
    conditional: true                # Only apply if condition met
    conditionalFormula: "NOT_EMPTY(due_date)"
    ruleFormula: |
      due_date > invoice_date
    messageFormula: |
      "Due date must be after invoice date"
    overridable: false               # Strict validation

Validation Rule Properties

Property	Type	Description
`name`	string	Rule name
`description`	string	Detailed explanation
`disabled`	boolean	Temporarily disable this rule
`conditional`	boolean	Only apply if condition is true
`conditionalFormula`	string	Formula determining if rule applies
`ruleFormula`	string	Formula that must be true (false = validation failure)
`messageFormula`	string	Formula generating the error message
`detailFormula`	string	Formula generating additional details
`overridable`	boolean	Can users override this validation?
`exceptionId`	string	Unique exception identifier
`supportArticleId`	string	Link to help documentation

Conditional Formatting

Apply visual formatting based on data values:

conditionalFormats:
  - name: "Highlight overdue"
    formula: "due_date < TODAY() AND status != 'PAID'"
    backgroundColor: "#FEE2E2"     # Light red
    textColor: "#991B1B"           # Dark red
    fontWeight: "bold"

  - name: "Flag large amounts"
    formula: "total_amount > 10000"
    backgroundColor: "#FEF3C7"     # Light yellow
    icon: "warning"

Classification Features

Help AI/ML models understand and classify content:

Semantic Definition

Provides guidance for AI extraction:

semanticDefinition: |
  The vendor's legal business name as registered with tax authorities.
  Look for names near "Bill To", "Vendor", or "From" sections.
  Should be a proper business name, not an individual's name.

Best practices:

Be specific about what to look for
Describe location hints
Clarify edge cases
Provide examples if helpful

Additional Context

Helps with record-based chunking and classification:

additionContexts:
  - type: RECORD_DEFINITION
    context: |
      Each line item represents a product or service being billed.
      Line items typically appear in a table format.

  - type: RECORD_START_MARKER
    context: "Item #"

  - type: RECORD_END_MARKER
    context: "Subtotal"

Context types:

RECORD_DEFINITION - Describes the record structure
RECORD_START_MARKER - Text indicating record start
RECORD_END_MARKER - Text indicating record end
RECORD_SECTION_STARTER_MARKER - Section start marker
RECORD_SECTION_END_MARKER - Section end marker

Lexical Relations

Synonyms and antonyms for embedding-based classification:

lexicalRelations:
  - type: SYNONYM
    value: "Supplier, Provider, Seller, Merchant"

  - type: ANTONYM
    value: "Customer, Buyer, Client"

Use for:

Improving classification accuracy
Handling terminology variations
Training embedding models

Advanced Options

Fallback Expression

Provide alternative extraction logic if primary method fails:

enableFallbackExpression: true
fallbackExpression: |
  // If extraction failed, try alternate method
  def altValue = document.findPattern("Vendor:\\s*(.*)")
  return altValue ?: "UNKNOWN"

Serialization Expression

Custom logic for exporting data:

enableSerializationExpression: true
serializationExpression: |
  // Export phone number in E.164 format
  return "+1" + phone_number.replaceAll("[^0-9]", "")

Post-Extraction Expression

Transform data after initial extraction:

usePostExpression: true
postExpression: |
  // Standardize vendor name format
  return vendor_name.trim().toUpperCase()

Display Configuration

Control how fields appear in the UI:

typeFeatures:
  overrideWidth: true
  displayWidth: 300            # Width in pixels
  expected: true               # Mark as required field

User Interaction

multiValue: true               # Allow multiple values
userEditable: true             # User can edit in forms
notUserLabelled: false         # Show in labeling interface
nullable: true                 # Allow null values
nullValue: "N/A"              # Display text for null

Common Patterns

Invoice Extraction

Complete example of a typical invoice data definition:

slug: invoice-extraction
name: Invoice Data Extraction
taxonomyType: CONTENT
enabled: true
taxons:
  # Header Information
  - name: invoice_number
    label: Invoice Number
    taxonType: STRING
    valuePath: VALUE_OR_ALL_CONTENT
    semanticDefinition: "The unique invoice number, typically at the top right"
    validationRules:
      - name: "Invoice number required"
        ruleFormula: "NOT_EMPTY(invoice_number)"
        messageFormula: '"Invoice number is required"'
        overridable: false

  - name: invoice_date
    label: Invoice Date
    taxonType: DATE
    valuePath: VALUE_OR_ALL_CONTENT
    semanticDefinition: "The date the invoice was issued"
    typeFeatures:
      normalizeDate: true
      dateFormat: "yyyy-MM-dd"

  - name: due_date
    label: Due Date
    taxonType: DATE
    valuePath: VALUE_OR_ALL_CONTENT
    semanticDefinition: "The payment due date"

  # Vendor Information Group
  - name: vendor
    label: Vendor
    group: true
    children:
      - name: name
        label: Vendor Name
        taxonType: STRING
        valuePath: VALUE_OR_ALL_CONTENT
        semanticDefinition: "The vendor's business name"

      - name: address
        label: Address
        taxonType: STRING
        valuePath: VALUE_OR_ALL_CONTENT
        typeFeatures:
          longText: true

      - name: tax_id
        label: Tax ID
        taxonType: STRING
        valuePath: VALUE_OR_ALL_CONTENT

  # Line Items (Repeating Group)
  - name: line_items
    label: Line Items
    group: true
    children:
      - name: description
        label: Description
        taxonType: STRING

      - name: quantity
        label: Quantity
        taxonType: NUMBER

      - name: unit_price
        label: Unit Price
        taxonType: CURRENCY

      - name: line_total
        label: Line Total
        taxonType: CURRENCY
        valuePath: FORMULA
        semanticDefinition: "quantity * unit_price"

  # Totals
  - name: subtotal
    label: Subtotal
    taxonType: CURRENCY
    valuePath: FORMULA
    semanticDefinition: "SUM(line_items.line_total)"

  - name: tax_amount
    label: Tax Amount
    taxonType: CURRENCY
    valuePath: VALUE_OR_ALL_CONTENT

  - name: total_amount
    label: Total Amount
    taxonType: CURRENCY
    valuePath: VALUE_OR_ALL_CONTENT
    validationRules:
      - name: "Total calculation check"
        ruleFormula: "ABS(total_amount - (subtotal + tax_amount)) < 0.01"
        messageFormula: '"Total amount mismatch"'
        overridable: true

Contract Data Extraction

slug: contract-extraction
name: Contract Data Extraction
taxons:
  - name: contract_metadata
    label: Contract Metadata
    group: true
    children:
      - name: contract_number
        label: Contract Number
        taxonType: STRING

      - name: contract_type
        label: Contract Type
        taxonType: SELECTION
        selectionOptions:
          - label: "Service Agreement"
            id: "service"
          - label: "Purchase Agreement"
            id: "purchase"
          - label: "NDA"
            id: "nda"
          - label: "License Agreement"
            id: "license"

  - name: parties
    label: Parties
    group: true
    children:
      - name: party_a
        label: Party A
        taxonType: STRING

      - name: party_b
        label: Party B
        taxonType: STRING

  - name: key_terms
    label: Key Terms
    group: true
    children:
      - name: effective_date
        label: Effective Date
        taxonType: DATE

      - name: term_length
        label: Term Length
        taxonType: STRING
        semanticDefinition: "Duration of the contract (e.g., '12 months', '2 years')"

      - name: auto_renewal
        label: Auto Renewal
        taxonType: BOOLEAN
        semanticDefinition: "Does the contract automatically renew?"

      - name: termination_notice
        label: Termination Notice Period
        taxonType: STRING
        semanticDefinition: "Required notice period for termination (e.g., '30 days')"

  - name: financial_terms
    label: Financial Terms
    group: true
    children:
      - name: total_value
        label: Total Contract Value
        taxonType: CURRENCY

      - name: payment_terms
        label: Payment Terms
        taxonType: STRING
        semanticDefinition: "Payment schedule and terms (e.g., 'Net 30', 'Monthly in advance')"

Best Practices

Naming Conventions

name: vendor_name           # Snake case for internal names
label: Vendor Name          # Title case for display
externalName: vendorName    # Camel case for APIs

Semantic Definitions

Write semantic definitions as if explaining to a human what to look for. Be specific about:

What the field represents
Where it typically appears
How to identify it
Edge cases to consider

Good example:

semanticDefinition: |
  The total amount due on the invoice, including all taxes and fees.
  Look for labels like "Total", "Amount Due", "Balance Due", or "Total Amount".
  This should be the final bottom-line number, not a subtotal.
  If multiple totals exist (e.g., by currency), extract the primary total.

Avoid:

semanticDefinition: "The total"  # Too vague

Group Structures

# Good: Line items are a repeating group
- name: line_items
  label: Line Items
  group: true
  children:
    - name: description
    - name: quantity
    - name: price

Validation Strategy

Start Simple: Begin with basic “not empty” validations
Add Business Rules: Implement domain-specific validations
Make Critical Rules Non-Overridable: Block processing if essential data is wrong
Allow Overrides for Quality Checks: Let users override formatting or minor issues

validationRules:
  # Critical: Don't allow override
  - name: "Invoice number required"
    ruleFormula: "NOT_EMPTY(invoice_number)"
    overridable: false

  # Quality check: Allow override
  - name: "Total seems high"
    ruleFormula: "total_amount < 100000"
    messageFormula: '"Invoice total exceeds $100,000 - please verify"'
    overridable: true

Formula Usage

Simple Calculations

semanticDefinition: "quantity * unit_price"

Aggregations

semanticDefinition: "SUM(line_items.total)"

Conditional Logic

semanticDefinition: |
  IF(total_amount > 10000, "Requires Approval", "Auto-Approve")

Date Calculations

semanticDefinition: "DATE_ADD(invoice_date, 30, 'DAYS')"

Troubleshooting

Common Issues

Taxon not appearing in UI

Possible causes:

enabled: false is set
Parent taxon is disabled (disabling cascades to children)
notUserLabelled: true for labeling interfaces

Solution: Check enabled status up the hierarchy

Extraction not working

Check:

Is valuePath correct for your use case?
Is semanticDefinition clear and specific?
Are you using the right taxonType?
Is the model trained for this document type?

Formula errors

Common mistakes:

Referencing taxons that don’t exist
Syntax errors in formula
Circular references

Test: Use formula builder to validate syntax

Validation not triggering

Check:

Is validation rule disabled: false?
Does conditionalFormula evaluate to true?
Is ruleFormula returning the expected boolean?

Next Steps

Build Your First Data Definition

Step-by-step tutorial for creating a data definition

Formula Reference

Complete formula function reference

Validation Patterns

Common validation rule patterns

API Reference

Data definition API documentation

Introduction

Data Definitions

Knowledge

Formulas

Project Templates

Data Forms

Reference

​Overview

​What is a Data Definition?

​Key Concepts

Data Definition

Taxon

Data Group

Value Path

​Data Definition Structure

​Top-Level Configuration

​Data Definition Properties

​Taxon Configuration

​Basic Properties

​Data Source (Value Path)

​Data Types

​Data Groups and Hierarchies

​Group Configuration

​Validation Rules

​Validation Rule Properties

​Conditional Formatting

​Classification Features

​Advanced Options

​Fallback Expression

​Serialization Expression

​Post-Extraction Expression

​Display Configuration

​User Interaction

​Common Patterns

​Invoice Extraction

​Contract Data Extraction

​Best Practices

​Naming Conventions

​Semantic Definitions

​Group Structures

​Validation Strategy

​Formula Usage

​Troubleshooting

​Common Issues

​Next Steps

Build Your First Data Definition

Formula Reference

Validation Patterns

API Reference

Overview

What is a Data Definition?

Key Concepts

Data Definition Structure

Top-Level Configuration

Data Definition Properties

Taxon Configuration

Basic Properties

Data Source (Value Path)

Data Types

Data Groups and Hierarchies

Group Configuration

Validation Rules

Validation Rule Properties

Conditional Formatting

Classification Features

Advanced Options

Fallback Expression

Serialization Expression

Post-Extraction Expression

Display Configuration

User Interaction

Common Patterns

Invoice Extraction

Contract Data Extraction

Best Practices

Naming Conventions

Semantic Definitions

Group Structures

Validation Strategy

Formula Usage

Troubleshooting

Common Issues

Next Steps