Data Definitions describe the structured information Kodexa should extract, validate, review, and use in downstream Activity Plans. They turn a business document into a clear data model: the fields you care about, the groups those fields belong to, the types those fields should normalize into, and the rules that determine whether the extracted data is ready to use.Documentation Index
Fetch the complete documentation index at: https://developer.kodexa.ai/llms.txt
Use this file to discover all available pages before exploring further.
In some APIs, SDKs, and configuration files, Data Definitions are still represented by the historical terms
taxonomy and taxon. In user-facing documentation, think Data Definition for the overall model and Data Element for each field or group inside it.What Is a Data Definition?
A Data Definition is a hierarchy of data elements. Each element represents either:- A piece of data to extract, validate, calculate, or review
- A group that organizes related data elements
- A repeating group, such as invoice line items or contract parties
Example Structure
Data Definition Roles
Most business users think about Data Definitions as the final data they want from a document. Kodexa also uses Data Definition structures during processing so modules, Activity steps, and model outputs can share the same vocabulary.Content Data Definition
Defines the business-level data extracted from documents. This is the main model used for final output, review, validation, and downstream integrations.Use for: Business data extraction, final output structure
Processing Data Definition
Supports intermediate processing work. These structures can be provided by modules or Activity Plan steps and become available when those resources are bound into a project.Use for: Intermediate labels, routing signals, AI model support
Module Data Definition
Comes from modules used for training or inference. These structures become available when you add a module to a project or reference it from an Activity Plan.Use for: ML module training, module-specific labels
Key Concepts
Data Elements
In configuration, data elements are written under the API fieldtaxons. Each element can be a simple field, a group, or a repeating group.
Simple Data Element:
Hierarchy and Relationships
Data Definitions use parent-child relationships to organize data:- Root elements: Top-level fields or groups
- Child elements: Fields nested inside a parent group
- Sibling elements: Fields at the same level in the model
- Organize related data logically
- Mirror the way information appears in documents
- Improve extraction and review accuracy
- Produce output that downstream systems can understand
Data Definition Lifecycle
1. Design Phase
Define the model from the business problem:- What documents are involved?
- What data must be extracted?
- Which fields repeat?
- Which values need review or validation?
- Which downstream systems will consume the output?
2. Configuration Phase
Set properties for each data element:- Data type
- Value source
- Semantic definition
- Validation rules
- Conditional formatting
- Event-based scripts
3. Training and Testing Phase
Use the Data Definition to:- Label representative documents
- Train or evaluate extraction models
- Refine semantic definitions
- Test validation and review behavior
4. Production Phase
Use the Data Definition in live workflows to:- Extract structured data from new documents
- Validate extracted values
- Present data to reviewers through Data Forms
- Feed Activity Plan steps and downstream integrations
Common Use Cases
Invoice Processing
Contract Metadata
Form Data
Best Practices
Design for Reuse
Design for Reuse
Create Data Definitions that can be reused across similar document types:
- Use business names that make sense across teams
- Factor out common structures
- Keep repeated document patterns consistent
Keep It Simple
Keep It Simple
Start with the fields that drive the workflow:
- Begin with core data elements
- Add groups only where they improve clarity
- Introduce validation incrementally
Mirror the Business Document
Mirror the Business Document
Align the hierarchy with how users understand the document:
- Match visual organization where it helps
- Follow natural reading order
- Group related information
Use Meaningful Names
Use Meaningful Names
Choose clear, stable names:
- Use business terminology
- Be specific and unambiguous
- Follow consistent naming conventions
Learn More
Data Definition Structure
Configure data elements, groups, value sources, and extraction behavior
Data Types Reference
Detailed information about available data types and normalization
Building Data Classes
Generate Python data classes from Data Definitions for programmatic access
Data Definitions Overview
Overall Data Definition concepts and patterns
Next Steps
Review the structure guide
Read Data Definition Structure for detailed configuration instructions.
Explore examples
Check out Data Definition examples for common document types.
Try it in a project
Create your first Data Definition in a Kodexa project and test it with sample documents.
