Overview
Data definitions are the foundation of data extraction in Kodexa. They define the hierarchical structure of data elements you want to extract from documents, along with their types, validation rules, and extraction logic.What is a Data Definition?
A data definition is a hierarchical structure of taxons (data elements) that defines:- What data to extract from documents
- Where the data comes from (document content, metadata, formulas, expressions)
- How to validate and format the data
- What type of data it is (string, date, currency, etc.)
Key Concepts
Data Definition
Top-level container defining the complete data structure for extraction
Taxon
Individual data element within a data definition (field or group)
Data Group
Organizational container that groups related taxons without storing data itself
Value Path
Defines where the taxon gets its data from (document, metadata, formula, etc.)
Data Definition Structure
Top-Level Configuration
Every data definition has these core properties:Data Definition Properties
| Property | Type | Default | Description |
|---|---|---|---|
slug | string | - | Unique identifier for the data definition |
name | string | - | Display name |
description | string | - | Description of the data definition’s purpose |
taxonomyType | enum | CONTENT | Type of data definition (typically CONTENT) |
enabled | boolean | true | Whether the data definition is active |
externalDataTaxonomyRefs | string[] | [] | References to external data definitions |
taxons | Taxon[] | [] | Array of root-level taxons |
Taxon Configuration
Taxons are the individual data elements within a data definition. Each taxon has extensive configuration options organized into several categories.Basic Properties
Every taxon requires these fundamental properties:Internal identifier (alphanumeric, hyphens, underscores only)
Human-readable display name
Detailed explanation of what this data element represents
Whether this taxon is active (disabled taxons cascade to children)
Hex color code for UI display (auto-generated if not specified)
Auto-generate the internal
name from the labelName used when publishing to external systems (auto-generated from label if not specified)
Data Source (Value Path)
ThevaluePath determines where the taxon gets its data from:
Document (VALUE_OR_ALL_CONTENT)
Document (VALUE_OR_ALL_CONTENT)
Extracts data directly from document content using AI/ML models or pattern matching.When to use: Standard document extraction (invoices, contracts, forms)Configuration:Features:
- Uses semantic definition as extraction prompt
- Can leverage document structure and layout
- Supports AI-assisted extraction
Metadata (METADATA)
Metadata (METADATA)
Pulls data from document metadata (filename, creation date, owner, etc.).When to use: Document properties, system fields, audit trailConfiguration:Available metadata values:
FILENAME- Document filenameTRANSACTION_UUID- Unique transaction identifierCREATED_DATETIME- Document creation timestampDOCUMENT_LABELS- Applied labelsOWNER_NAME- Document ownerDOCUMENT_STATUS- Processing statusPAGE_NUMBER- Current page number
Formula (FORMULA)
Formula (FORMULA)
Calculates values using formulas that reference other taxons.When to use: Computed fields, calculations, aggregationsConfiguration:Features:
- Reference other taxons by name
- Built-in functions (SUM, AVG, COUNT, etc.)
- Conditional logic support
Review (REVIEW)
Review (REVIEW)
Generates review templates using Jinja2 templating.When to use: Human review interfaces, validation checklistsConfiguration:
External (EXTERNAL)
External (EXTERNAL)
Populates data from external sources using Groovy expressions.When to use: API integrations, database lookups, external systemsConfiguration:
Derived (DERIVED)
Derived (DERIVED)
Placeholder for derived values (less common, use FORMULA instead).
Data Types
ThetaxonType defines how the data should be treated and validated:
- String
- Number
- Currency
- Date
- Date Time
- Selection
- Boolean
- Other Types
Data Groups and Hierarchies
Groups organize related taxons and can represent repeating structures:Group Configuration
Mark as a group (container for other taxons)
Array of child taxons nested under this group
Define how many instances of this group can exist:
Define unique identifiers for group instances:
Validation Rules
Define business rules and data quality checks:Validation Rule Properties
| Property | Type | Description |
|---|---|---|
name | string | Rule name |
description | string | Detailed explanation |
disabled | boolean | Temporarily disable this rule |
conditional | boolean | Only apply if condition is true |
conditionalFormula | string | Formula determining if rule applies |
ruleFormula | string | Formula that must be true (false = validation failure) |
messageFormula | string | Formula generating the error message |
detailFormula | string | Formula generating additional details |
overridable | boolean | Can users override this validation? |
exceptionId | string | Unique exception identifier |
supportArticleId | string | Link to help documentation |
Conditional Formatting
Apply visual formatting based on data values:Classification Features
Help AI/ML models understand and classify content:Semantic Definition
Semantic Definition
Provides guidance for AI extraction:Best practices:
- Be specific about what to look for
- Describe location hints
- Clarify edge cases
- Provide examples if helpful
Additional Context
Additional Context
Helps with record-based chunking and classification:Context types:
RECORD_DEFINITION- Describes the record structureRECORD_START_MARKER- Text indicating record startRECORD_END_MARKER- Text indicating record endRECORD_SECTION_STARTER_MARKER- Section start markerRECORD_SECTION_END_MARKER- Section end marker
Lexical Relations
Lexical Relations
Synonyms and antonyms for embedding-based classification:Use for:
- Improving classification accuracy
- Handling terminology variations
- Training embedding models
Advanced Options
Fallback Expression
Provide alternative extraction logic if primary method fails:Serialization Expression
Custom logic for exporting data:Post-Extraction Expression
Transform data after initial extraction:Display Configuration
Control how fields appear in the UI:User Interaction
Common Patterns
Invoice Extraction
Complete example of a typical invoice data definition:Contract Data Extraction
Best Practices
Naming Conventions
Semantic Definitions
Write semantic definitions as if explaining to a human what to look for. Be specific about:
- What the field represents
- Where it typically appears
- How to identify it
- Edge cases to consider
Group Structures
Validation Strategy
- Start Simple: Begin with basic “not empty” validations
- Add Business Rules: Implement domain-specific validations
- Make Critical Rules Non-Overridable: Block processing if essential data is wrong
- Allow Overrides for Quality Checks: Let users override formatting or minor issues
Formula Usage
Simple Calculations
Simple Calculations
Aggregations
Aggregations
Conditional Logic
Conditional Logic
Date Calculations
Date Calculations
Troubleshooting
Common Issues
Taxon not appearing in UI
Taxon not appearing in UI
Possible causes:
enabled: falseis set- Parent taxon is disabled (disabling cascades to children)
notUserLabelled: truefor labeling interfaces
Extraction not working
Extraction not working
Check:
- Is
valuePathcorrect for your use case? - Is
semanticDefinitionclear and specific? - Are you using the right
taxonType? - Is the model trained for this document type?
Formula errors
Formula errors
Common mistakes:
- Referencing taxons that don’t exist
- Syntax errors in formula
- Circular references
Validation not triggering
Validation not triggering
Check:
- Is validation rule
disabled: false? - Does
conditionalFormulaevaluate to true? - Is
ruleFormulareturning the expected boolean?
