Skip to main content

Overview

Data definitions in Kodexa provide the structure and rules for extracting, validating, and processing information from documents. They define what data to extract, how to validate it, and how to present it to users.

What Are Data Definitions?

Data definitions are the blueprints for your document processing workflows. They specify:
  • Structure: What data elements exist and how they relate
  • Types: What kind of data each field contains (text, numbers, dates, etc.)
  • Sources: Where data comes from (document content, metadata, calculations, external systems)
  • Validation: Business rules and data quality checks
  • Presentation: How data appears in forms and exports

Core Concepts

Data Structure

Data definitions are hierarchical structures of data elements (taxons) that define what to extract from documents. Example use cases:
  • Invoice data extraction (vendor, line items, totals)
  • Contract metadata (parties, dates, terms)
  • Form processing (applicant info, answers, signatures)

Complete Definition Guide

Comprehensive guide to configuring data definitions

Data Types

Kodexa supports rich data types for accurate extraction and validation:
  • Basic Types
  • Specialized Types
  • Complex Types
  • STRING - Text of any length
  • NUMBER - Numeric values
  • BOOLEAN - True/false values
  • DATE - Calendar dates
  • DATE_TIME - Dates with timestamps

Data Sources

Define where each data element gets its value:
Extract directly from document content using AI/ML models
valuePath: VALUE_OR_ALL_CONTENT
semanticDefinition: "Extract the invoice total amount"
Pull from document properties and system fields
valuePath: METADATA
metadataValue: FILENAME
Calculate from other fields
valuePath: FORMULA
semanticDefinition: "quantity * unit_price"
Fetch from APIs or databases
valuePath: EXTERNAL
expression: |
  def response = http.get("https://api.example.com/vendor/${vendor_id}")
  return response.name

Common Patterns

Invoice Processing

Extract structured data from invoices:
taxons:
  - name: invoice_number
    label: Invoice Number
    taxonType: STRING

  - name: invoice_date
    label: Invoice Date
    taxonType: DATE

  - name: vendor
    label: Vendor
    group: true
    children:
      - name: name
      - name: address
      - name: tax_id

  - name: line_items
    label: Line Items
    group: true
    children:
      - name: description
      - name: quantity
      - name: unit_price
      - name: total
        valuePath: FORMULA
        semanticDefinition: "quantity * unit_price"

  - name: total_amount
    label: Total Amount
    taxonType: CURRENCY

Contract Metadata

Capture key contract information:
taxons:
  - name: contract_type
    label: Contract Type
    taxonType: SELECTION
    selectionOptions:
      - label: "Service Agreement"
      - label: "Purchase Order"
      - label: "NDA"

  - name: parties
    group: true
    children:
      - name: party_a
      - name: party_b

  - name: key_terms
    group: true
    children:
      - name: effective_date
        taxonType: DATE
      - name: term_length
      - name: termination_notice

Form Data

Process form submissions:
taxons:
  - name: applicant
    group: true
    children:
      - name: full_name
      - name: email
        taxonType: EMAIL_ADDRESS
      - name: phone
        taxonType: PHONE_NUMBER

  - name: responses
    group: true
    children:
      - name: question_1
      - name: question_2
      - name: agree_to_terms
        taxonType: BOOLEAN

Validation and Quality

Validation Rules

Define business rules to ensure data quality:
validationRules:
  - name: "Required field check"
    ruleFormula: "NOT_EMPTY(invoice_number)"
    messageFormula: '"Invoice number is required"'
    overridable: false

  - name: "Date logic check"
    ruleFormula: "due_date > invoice_date"
    messageFormula: '"Due date must be after invoice date"'
    overridable: false

  - name: "Total verification"
    ruleFormula: "ABS(total_amount - SUM(line_items.total)) < 0.01"
    messageFormula: '"Total mismatch detected"'
    overridable: true

Conditional Formatting

Apply visual cues based on data values:
conditionalFormats:
  - name: "Highlight overdue"
    formula: "due_date < TODAY() AND status != 'PAID'"
    backgroundColor: "#FEE2E2"
    textColor: "#991B1B"
    fontWeight: "bold"

  - name: "Flag high amounts"
    formula: "total_amount > 10000"
    backgroundColor: "#FEF3C7"
    icon: "warning"

Best Practices

Design Principles

Begin with core fields and add complexity as needed. Don’t over-engineer initial data definitions.Start with:
  • Essential fields only
  • Basic data types
  • Simple validation
Add later:
  • Computed fields
  • Complex validations
  • Conditional formatting
Write clear, specific extraction prompts:Good:
semanticDefinition: |
  The vendor's legal business name as it appears at the top of the invoice.
  Look near 'Bill To', 'From', or 'Vendor' labels.
Avoid:
semanticDefinition: "vendor name"  # Too vague
Use groups to:
  • Organize related fields logically
  • Handle repeating structures (line items, signatories)
  • Improve UI presentation
Single instance groups: Organizational containers
- name: vendor
  group: true
  children: [name, address, tax_id]
Repeating groups: Collections
- name: line_items
  group: true
  children: [description, quantity, price]
Critical validations (non-overridable):
  • Required fields
  • Data type constraints
  • Business logic rules
Quality checks (overridable):
  • Unusual values
  • Formatting issues
  • Threshold warnings

Naming Conventions

Use consistent naming across your data definitions:
name: vendor_name         # Snake case
name: invoice_date        # Descriptive, unambiguous
name: line_items          # Plural for groups

Getting Started

1

Understand Your Documents

Analyze the documents you’ll process:
  • What data needs to be extracted?
  • What’s the document structure?
  • What validations are needed?
2

Design Your Data Definition

Sketch out the data structure:
  • List all required fields
  • Group related fields
  • Identify repeating sections
3

Configure Data Elements

For each field, define:
  • Data type
  • Value source
  • Semantic definition
  • Validation rules
4

Test and Iterate

Process sample documents:
  • Verify extraction accuracy
  • Refine semantic definitions
  • Adjust validation rules

Learn More


Examples