Taxonomy Concepts
Taxonomies (also called Data Definitions in the UI) are used to define the structure of the data that is extracted from documents. They provide a hierarchical framework for organizing and extracting structured information.UI Terminology:
- Taxonomies in the API are called Data Definitions in the UI
- Taxons in the API are called Data Elements in the UI
What is a Taxonomy?
A taxonomy is made up of several nodes called taxons (a made-up word), each of which represents either:- A piece of data to be extracted
- A grouping of related data elements
Example Structure
Taxonomy Types
While users typically think of taxonomies in terms of the data they want to extract, Kodexa uses multiple types of taxonomies for different purposes. This is because the labeling process isn’t simply about identifying extraction targets—it’s also about labeling concepts, markers, and other information that aids the extraction process.Types of Taxonomies
Content Taxonomy
Defines the structure of data extracted from documents. Represents the business-level data structure understood by users and stakeholders.Use for: Business data extraction, final output structure
Processing Taxonomy
Used during document processing. These taxonomies are typically provided by assistants and become available when you add an assistant to a project.Use for: Intermediate processing steps, AI model labels
Model Taxonomy
Provided by models used for training or inference. These taxonomies become available when you add a model to a project or reference one through an assistant.Use for: ML model training, model-specific labels
Key Concepts
Taxons (Data Elements)
Each taxon in a taxonomy can represent: Simple Data Element:Hierarchy and Relationships
Taxonomies use parent-child relationships to organize data:- Root taxons: Top-level elements
- Child taxons: Nested within parent groups
- Sibling taxons: At the same level in the hierarchy
- Organizes related data logically
- Mirrors document structure
- Improves extraction accuracy
- Makes data easier to work with
Taxonomy Lifecycle
1. Design Phase
Define the structure based on:- Document analysis
- Business requirements
- Data relationships
- Extraction goals
2. Configuration Phase
Set properties for each taxon:- Data type
- Value source
- Semantic definitions
- Validation rules
3. Training Phase
Use the taxonomy to:- Label training documents
- Train ML models
- Refine extraction logic
4. Production Phase
Apply the taxonomy to:- Extract data from new documents
- Validate extracted values
- Present data to users
Common Use Cases
Invoice Processing
Contract Metadata
Form Data
Best Practices
Design for Reusability
Design for Reusability
Create taxonomies that can be reused across similar document types:
- Use generic names where appropriate
- Factor out common structures
- Create taxonomy templates for similar documents
Keep It Simple
Keep It Simple
Start with essential fields and add complexity as needed:
- Begin with core data elements
- Add groups for organization
- Introduce validation incrementally
Mirror Document Structure
Mirror Document Structure
Align taxonomy hierarchy with document layout:
- Match visual organization
- Follow natural reading order
- Group related information
Use Meaningful Names
Use Meaningful Names
Choose clear, descriptive names:
- Use business terminology
- Be specific and unambiguous
- Follow consistent naming conventions
Learn More
Complete Taxonomy Guide
Comprehensive guide to configuring taxonomies with all available options
Data Types Reference
Detailed information about available data types and normalization
Building Data Classes
Generate Python data classes from taxonomies for programmatic access
Data Definitions Overview
Overall data definitions concepts and patterns
Next Steps
1
Review the Complete Guide
Read the Taxonomy Guide for detailed configuration instructions
2
Explore Examples
Check out example taxonomies for common document types
3
Try It Out
Create your first taxonomy in a Kodexa project and test with sample documents
4
Iterate and Refine
Use extraction results to improve semantic definitions and validation rules
