Skip to main content

Overview

Data definitions are the foundation of data extraction in Kodexa. They define the hierarchical structure of data elements you want to extract from documents, along with their types, validation rules, and extraction logic.

What is a Data Definition?

A data definition is a hierarchical structure of taxons (data elements) that defines:
  • What data to extract from documents
  • Where the data comes from (document content, metadata, formulas, expressions)
  • How to validate and format the data
  • What type of data it is (string, date, currency, etc.)

Key Concepts

Data Definition

Top-level container defining the complete data structure for extraction

Taxon

Individual data element within a data definition (field or group)

Data Group

Organizational container that groups related taxons without storing data itself

Value Path

Defines where the taxon gets its data from (document, metadata, formula, etc.)

Data Definition Structure

Top-Level Configuration

Every data definition has these core properties:
slug: invoice-data
name: Invoice Data Extraction
description: Extract structured data from invoices
taxonomyType: CONTENT
enabled: true
taxons:
  - name: vendor_information
    label: Vendor Information
    # ... taxon configuration

Data Definition Properties

PropertyTypeDefaultDescription
slugstring-Unique identifier for the data definition
namestring-Display name
descriptionstring-Description of the data definition’s purpose
taxonomyTypeenumCONTENTType of data definition (typically CONTENT)
enabledbooleantrueWhether the data definition is active
externalDataTaxonomyRefsstring[][]References to external data definitions
taxonsTaxon[][]Array of root-level taxons

Taxon Configuration

Taxons are the individual data elements within a data definition. Each taxon has extensive configuration options organized into several categories.

Basic Properties

Every taxon requires these fundamental properties:
id: "auto-generated-uuid"
name: vendor_name
label: Vendor Name
description: The name of the vendor or supplier
enabled: true
color: "#4F46E5"
name
string
required
Internal identifier (alphanumeric, hyphens, underscores only)
label
string
required
Human-readable display name
description
string
Detailed explanation of what this data element represents
enabled
boolean
default:"true"
Whether this taxon is active (disabled taxons cascade to children)
color
string
Hex color code for UI display (auto-generated if not specified)
generateName
boolean
default:"true"
Auto-generate the internal name from the label
externalName
string
Name used when publishing to external systems (auto-generated from label if not specified)

Data Source (Value Path)

The valuePath determines where the taxon gets its data from:
Extracts data directly from document content using AI/ML models or pattern matching.When to use: Standard document extraction (invoices, contracts, forms)Configuration:
valuePath: VALUE_OR_ALL_CONTENT
semanticDefinition: "Extract the vendor's business name as it appears on the invoice"
Features:
  • Uses semantic definition as extraction prompt
  • Can leverage document structure and layout
  • Supports AI-assisted extraction
Pulls data from document metadata (filename, creation date, owner, etc.).When to use: Document properties, system fields, audit trailConfiguration:
valuePath: METADATA
metadataValue: FILENAME  # or CREATED_DATETIME, OWNER_NAME, etc.
Available metadata values:
  • FILENAME - Document filename
  • TRANSACTION_UUID - Unique transaction identifier
  • CREATED_DATETIME - Document creation timestamp
  • DOCUMENT_LABELS - Applied labels
  • OWNER_NAME - Document owner
  • DOCUMENT_STATUS - Processing status
  • PAGE_NUMBER - Current page number
Calculates values using formulas that reference other taxons.When to use: Computed fields, calculations, aggregationsConfiguration:
valuePath: FORMULA
semanticDefinition: |
  SUM(line_items.amount)
Features:
  • Reference other taxons by name
  • Built-in functions (SUM, AVG, COUNT, etc.)
  • Conditional logic support
Generates review templates using Jinja2 templating.When to use: Human review interfaces, validation checklistsConfiguration:
valuePath: REVIEW
semanticDefinition: |
  ## Review Checklist

  - [ ] Vendor name matches PO: {{ vendor_name }}
  - [ ] Total amount is correct: {{ total_amount }}
  - [ ] All line items present: {{ line_items|length }} items
Populates data from external sources using Groovy expressions.When to use: API integrations, database lookups, external systemsConfiguration:
valuePath: EXTERNAL
expression: |
  // Fetch from external API
  def response = http.get("https://api.example.com/vendor/${vendor_id}")
  return response.name
Placeholder for derived values (less common, use FORMULA instead).

Data Types

The taxonType defines how the data should be treated and validated:
  • String
  • Number
  • Currency
  • Date
  • Date Time
  • Selection
  • Boolean
  • Other Types
taxonType: STRING
typeFeatures:
  longText: true           # Multi-line text field
  maxTextRows: 10          # Maximum rows for display
  markdown: true           # Enable markdown formatting
  expected: true           # Field is expected to be present
Use for: Names, addresses, descriptions, any text content

Data Groups and Hierarchies

Groups organize related taxons and can represent repeating structures:
name: line_items
label: Line Items
group: true                    # This is a group, not a value
valuePath: EXTERNAL            # Groups can use EXTERNAL for API data
children:
  - name: description
    label: Description
    taxonType: STRING

  - name: quantity
    label: Quantity
    taxonType: NUMBER

  - name: unit_price
    label: Unit Price
    taxonType: CURRENCY

  - name: total
    label: Total
    taxonType: CURRENCY
    valuePath: FORMULA
    semanticDefinition: "quantity * unit_price"

Group Configuration

group
boolean
default:"false"
Mark as a group (container for other taxons)
children
Taxon[]
Array of child taxons nested under this group
cardinality
object
Define how many instances of this group can exist:
cardinality:
  min: 1      # Minimum required instances
  max: 100    # Maximum allowed instances
naturalKeys
object[]
Define unique identifiers for group instances:
naturalKeys:
  - taxonRef: "invoice_number"
  - taxonRef: "line_number"

Validation Rules

Define business rules and data quality checks:
validationRules:
  - name: "Total matches sum of line items"
    description: "Ensure calculated total matches the invoice total"
    disabled: false
    conditional: false               # Apply always
    ruleFormula: |
      ABS(total_amount - SUM(line_items.total)) < 0.01
    messageFormula: |
      "Total mismatch: Invoice shows " + total_amount + " but line items sum to " + SUM(line_items.total)
    detailFormula: |
      "Check line items for accuracy"
    overridable: true                # User can override this validation
    exceptionId: "TOTAL_MISMATCH"    # Unique exception identifier
    supportArticleId: "9117988"      # Link to help article

  - name: "Due date after invoice date"
    conditional: true                # Only apply if condition met
    conditionalFormula: "NOT_EMPTY(due_date)"
    ruleFormula: |
      due_date > invoice_date
    messageFormula: |
      "Due date must be after invoice date"
    overridable: false               # Strict validation

Validation Rule Properties

PropertyTypeDescription
namestringRule name
descriptionstringDetailed explanation
disabledbooleanTemporarily disable this rule
conditionalbooleanOnly apply if condition is true
conditionalFormulastringFormula determining if rule applies
ruleFormulastringFormula that must be true (false = validation failure)
messageFormulastringFormula generating the error message
detailFormulastringFormula generating additional details
overridablebooleanCan users override this validation?
exceptionIdstringUnique exception identifier
supportArticleIdstringLink to help documentation

Conditional Formatting

Apply visual formatting based on data values:
conditionalFormats:
  - name: "Highlight overdue"
    formula: "due_date < TODAY() AND status != 'PAID'"
    backgroundColor: "#FEE2E2"     # Light red
    textColor: "#991B1B"           # Dark red
    fontWeight: "bold"

  - name: "Flag large amounts"
    formula: "total_amount > 10000"
    backgroundColor: "#FEF3C7"     # Light yellow
    icon: "warning"

Classification Features

Help AI/ML models understand and classify content:
Provides guidance for AI extraction:
semanticDefinition: |
  The vendor's legal business name as registered with tax authorities.
  Look for names near "Bill To", "Vendor", or "From" sections.
  Should be a proper business name, not an individual's name.
Best practices:
  • Be specific about what to look for
  • Describe location hints
  • Clarify edge cases
  • Provide examples if helpful
Helps with record-based chunking and classification:
additionContexts:
  - type: RECORD_DEFINITION
    context: |
      Each line item represents a product or service being billed.
      Line items typically appear in a table format.

  - type: RECORD_START_MARKER
    context: "Item #"

  - type: RECORD_END_MARKER
    context: "Subtotal"
Context types:
  • RECORD_DEFINITION - Describes the record structure
  • RECORD_START_MARKER - Text indicating record start
  • RECORD_END_MARKER - Text indicating record end
  • RECORD_SECTION_STARTER_MARKER - Section start marker
  • RECORD_SECTION_END_MARKER - Section end marker
Synonyms and antonyms for embedding-based classification:
lexicalRelations:
  - type: SYNONYM
    value: "Supplier, Provider, Seller, Merchant"

  - type: ANTONYM
    value: "Customer, Buyer, Client"
Use for:
  • Improving classification accuracy
  • Handling terminology variations
  • Training embedding models

Advanced Options

Fallback Expression

Provide alternative extraction logic if primary method fails:
enableFallbackExpression: true
fallbackExpression: |
  // If extraction failed, try alternate method
  def altValue = document.findPattern("Vendor:\\s*(.*)")
  return altValue ?: "UNKNOWN"

Serialization Expression

Custom logic for exporting data:
enableSerializationExpression: true
serializationExpression: |
  // Export phone number in E.164 format
  return "+1" + phone_number.replaceAll("[^0-9]", "")

Post-Extraction Expression

Transform data after initial extraction:
usePostExpression: true
postExpression: |
  // Standardize vendor name format
  return vendor_name.trim().toUpperCase()

Display Configuration

Control how fields appear in the UI:
typeFeatures:
  overrideWidth: true
  displayWidth: 300            # Width in pixels
  expected: true               # Mark as required field

User Interaction

multiValue: true               # Allow multiple values
userEditable: true             # User can edit in forms
notUserLabelled: false         # Show in labeling interface
nullable: true                 # Allow null values
nullValue: "N/A"              # Display text for null

Common Patterns

Invoice Extraction

Complete example of a typical invoice data definition:
slug: invoice-extraction
name: Invoice Data Extraction
taxonomyType: CONTENT
enabled: true
taxons:
  # Header Information
  - name: invoice_number
    label: Invoice Number
    taxonType: STRING
    valuePath: VALUE_OR_ALL_CONTENT
    semanticDefinition: "The unique invoice number, typically at the top right"
    validationRules:
      - name: "Invoice number required"
        ruleFormula: "NOT_EMPTY(invoice_number)"
        messageFormula: '"Invoice number is required"'
        overridable: false

  - name: invoice_date
    label: Invoice Date
    taxonType: DATE
    valuePath: VALUE_OR_ALL_CONTENT
    semanticDefinition: "The date the invoice was issued"
    typeFeatures:
      normalizeDate: true
      dateFormat: "yyyy-MM-dd"

  - name: due_date
    label: Due Date
    taxonType: DATE
    valuePath: VALUE_OR_ALL_CONTENT
    semanticDefinition: "The payment due date"

  # Vendor Information Group
  - name: vendor
    label: Vendor
    group: true
    children:
      - name: name
        label: Vendor Name
        taxonType: STRING
        valuePath: VALUE_OR_ALL_CONTENT
        semanticDefinition: "The vendor's business name"

      - name: address
        label: Address
        taxonType: STRING
        valuePath: VALUE_OR_ALL_CONTENT
        typeFeatures:
          longText: true

      - name: tax_id
        label: Tax ID
        taxonType: STRING
        valuePath: VALUE_OR_ALL_CONTENT

  # Line Items (Repeating Group)
  - name: line_items
    label: Line Items
    group: true
    children:
      - name: description
        label: Description
        taxonType: STRING

      - name: quantity
        label: Quantity
        taxonType: NUMBER

      - name: unit_price
        label: Unit Price
        taxonType: CURRENCY

      - name: line_total
        label: Line Total
        taxonType: CURRENCY
        valuePath: FORMULA
        semanticDefinition: "quantity * unit_price"

  # Totals
  - name: subtotal
    label: Subtotal
    taxonType: CURRENCY
    valuePath: FORMULA
    semanticDefinition: "SUM(line_items.line_total)"

  - name: tax_amount
    label: Tax Amount
    taxonType: CURRENCY
    valuePath: VALUE_OR_ALL_CONTENT

  - name: total_amount
    label: Total Amount
    taxonType: CURRENCY
    valuePath: VALUE_OR_ALL_CONTENT
    validationRules:
      - name: "Total calculation check"
        ruleFormula: "ABS(total_amount - (subtotal + tax_amount)) < 0.01"
        messageFormula: '"Total amount mismatch"'
        overridable: true

Contract Data Extraction

slug: contract-extraction
name: Contract Data Extraction
taxons:
  - name: contract_metadata
    label: Contract Metadata
    group: true
    children:
      - name: contract_number
        label: Contract Number
        taxonType: STRING

      - name: contract_type
        label: Contract Type
        taxonType: SELECTION
        selectionOptions:
          - label: "Service Agreement"
            id: "service"
          - label: "Purchase Agreement"
            id: "purchase"
          - label: "NDA"
            id: "nda"
          - label: "License Agreement"
            id: "license"

  - name: parties
    label: Parties
    group: true
    children:
      - name: party_a
        label: Party A
        taxonType: STRING

      - name: party_b
        label: Party B
        taxonType: STRING

  - name: key_terms
    label: Key Terms
    group: true
    children:
      - name: effective_date
        label: Effective Date
        taxonType: DATE

      - name: term_length
        label: Term Length
        taxonType: STRING
        semanticDefinition: "Duration of the contract (e.g., '12 months', '2 years')"

      - name: auto_renewal
        label: Auto Renewal
        taxonType: BOOLEAN
        semanticDefinition: "Does the contract automatically renew?"

      - name: termination_notice
        label: Termination Notice Period
        taxonType: STRING
        semanticDefinition: "Required notice period for termination (e.g., '30 days')"

  - name: financial_terms
    label: Financial Terms
    group: true
    children:
      - name: total_value
        label: Total Contract Value
        taxonType: CURRENCY

      - name: payment_terms
        label: Payment Terms
        taxonType: STRING
        semanticDefinition: "Payment schedule and terms (e.g., 'Net 30', 'Monthly in advance')"

Best Practices

Naming Conventions

name: vendor_name           # Snake case for internal names
label: Vendor Name          # Title case for display
externalName: vendorName    # Camel case for APIs

Semantic Definitions

Write semantic definitions as if explaining to a human what to look for. Be specific about:
  • What the field represents
  • Where it typically appears
  • How to identify it
  • Edge cases to consider
Good example:
semanticDefinition: |
  The total amount due on the invoice, including all taxes and fees.
  Look for labels like "Total", "Amount Due", "Balance Due", or "Total Amount".
  This should be the final bottom-line number, not a subtotal.
  If multiple totals exist (e.g., by currency), extract the primary total.
Avoid:
semanticDefinition: "The total"  # Too vague

Group Structures

# Good: Line items are a repeating group
- name: line_items
  label: Line Items
  group: true
  children:
    - name: description
    - name: quantity
    - name: price

Validation Strategy

  1. Start Simple: Begin with basic “not empty” validations
  2. Add Business Rules: Implement domain-specific validations
  3. Make Critical Rules Non-Overridable: Block processing if essential data is wrong
  4. Allow Overrides for Quality Checks: Let users override formatting or minor issues
validationRules:
  # Critical: Don't allow override
  - name: "Invoice number required"
    ruleFormula: "NOT_EMPTY(invoice_number)"
    overridable: false

  # Quality check: Allow override
  - name: "Total seems high"
    ruleFormula: "total_amount < 100000"
    messageFormula: '"Invoice total exceeds $100,000 - please verify"'
    overridable: true

Formula Usage

semanticDefinition: "quantity * unit_price"
semanticDefinition: "SUM(line_items.total)"
semanticDefinition: |
  IF(total_amount > 10000, "Requires Approval", "Auto-Approve")
semanticDefinition: "DATE_ADD(invoice_date, 30, 'DAYS')"

Troubleshooting

Common Issues

Possible causes:
  • enabled: false is set
  • Parent taxon is disabled (disabling cascades to children)
  • notUserLabelled: true for labeling interfaces
Solution: Check enabled status up the hierarchy
Check:
  1. Is valuePath correct for your use case?
  2. Is semanticDefinition clear and specific?
  3. Are you using the right taxonType?
  4. Is the model trained for this document type?
Common mistakes:
  • Referencing taxons that don’t exist
  • Syntax errors in formula
  • Circular references
Test: Use formula builder to validate syntax
Check:
  • Is validation rule disabled: false?
  • Does conditionalFormula evaluate to true?
  • Is ruleFormula returning the expected boolean?

Next Steps