Taxonomy Concepts

Taxonomies (also called Data Definitions in the UI) are used to define the structure of the data that is extracted from documents. They provide a hierarchical framework for organizing and extracting structured information.

UI Terminology:

Taxonomies in the API are called Data Definitions in the UI
Taxons in the API are called Data Elements in the UI

These terms are interchangeable.

What is a Taxonomy?

A taxonomy is made up of several nodes called taxons (a made-up word), each of which represents either:

A piece of data to be extracted
A grouping of related data elements

This hierarchy of nodes defines the structure of data extraction from documents.

Example Structure

Invoice (Root)
├── Invoice Number (data element)
├── Invoice Date (data element)
├── Vendor (group)
│   ├── Name (data element)
│   ├── Address (data element)
│   └── Tax ID (data element)
└── Line Items (repeating group)
    ├── Description (data element)
    ├── Quantity (data element)
    ├── Unit Price (data element)
    └── Total (calculated element)

Taxonomy Types

While users typically think of taxonomies in terms of the data they want to extract, Kodexa uses multiple types of taxonomies for different purposes. This is because the labeling process isn’t simply about identifying extraction targets—it’s also about labeling concepts, markers, and other information that aids the extraction process.

Types of Taxonomies

Content Taxonomy

Defines the structure of data extracted from documents. Represents the business-level data structure understood by users and stakeholders.Use for: Business data extraction, final output structure

Processing Taxonomy

Used during document processing. These taxonomies are typically provided by assistants and become available when you add an assistant to a project.Use for: Intermediate processing steps, AI model labels

Module Taxonomy

Provided by modules used for training or inference. These taxonomies become available when you add a module to a project or reference one through an assistant.Use for: ML module training, module-specific labels

Key Concepts

Taxons (Data Elements)

Each taxon in a taxonomy can represent: Simple Data Element:

- name: invoice_number
  label: Invoice Number
  taxonType: STRING

Group Container:

- name: vendor
  label: Vendor Information
  group: true
  children:
    - name: name
    - name: address

Repeating Group:

- name: line_items
  label: Line Items
  group: true
  allowsMultipleEntries: true
  children:
    - name: description
    - name: quantity

Hierarchy and Relationships

Taxonomies use parent-child relationships to organize data:

Root taxons: Top-level elements
Child taxons: Nested within parent groups
Sibling taxons: At the same level in the hierarchy

This hierarchical structure:

Organizes related data logically
Mirrors document structure
Improves extraction accuracy
Makes data easier to work with

Taxonomy Lifecycle

1. Design Phase

Define the structure based on:

Document analysis
Business requirements
Data relationships
Extraction goals

2. Configuration Phase

Set properties for each taxon:

Data type
Value source
Semantic definitions
Validation rules

3. Training Phase

Use the taxonomy to:

Label training documents
Train ML models
Refine extraction logic

4. Production Phase

Apply the taxonomy to:

Extract data from new documents
Validate extracted values
Present data to users

Common Use Cases

Invoice Processing

taxons:
  - name: header
    group: true
    children:
      - name: invoice_number
      - name: invoice_date
      - name: due_date

  - name: vendor
    group: true
    children:
      - name: name
      - name: address

  - name: line_items
    group: true
    children:
      - name: description
      - name: quantity
      - name: unit_price

Contract Metadata

taxons:
  - name: contract_type
    taxonType: SELECTION

  - name: parties
    group: true
    children:
      - name: party_a
      - name: party_b

  - name: key_dates
    group: true
    children:
      - name: effective_date
      - name: expiration_date

Form Data

taxons:
  - name: applicant
    group: true
    children:
      - name: full_name
      - name: email
      - name: phone

  - name: application_details
    group: true
    children:
      - name: application_type
      - name: submission_date

Best Practices

Design for Reusability

Create taxonomies that can be reused across similar document types:

Use generic names where appropriate
Factor out common structures
Create taxonomy templates for similar documents

Keep It Simple

Start with essential fields and add complexity as needed:

Begin with core data elements
Add groups for organization
Introduce validation incrementally

Mirror Document Structure

Align taxonomy hierarchy with document layout:

Match visual organization
Follow natural reading order
Group related information

Use Meaningful Names

Choose clear, descriptive names:

Use business terminology
Be specific and unambiguous
Follow consistent naming conventions

Learn More

Complete Taxonomy Guide

Comprehensive guide to configuring taxonomies with all available options

Data Types Reference

Detailed information about available data types and normalization

Building Data Classes

Generate Python data classes from taxonomies for programmatic access

Data Definitions Overview

Overall data definitions concepts and patterns

Next Steps

Review the Complete Guide

Read the Taxonomy Guide for detailed configuration instructions

Explore Examples

Check out example taxonomies for common document types

Try It Out

Create your first taxonomy in a Kodexa project and test with sample documents

Iterate and Refine

Use extraction results to improve semantic definitions and validation rules

Introduction

Data Definitions

Knowledge

Scheduled Jobs

Large Document Processing

Formulas

Project Templates

Data Forms

Reference

Taxonomy Concepts

Taxonomy Concepts

What is a Taxonomy?

Example Structure

Taxonomy Types

Types of Taxonomies

Content Taxonomy

Processing Taxonomy

Module Taxonomy

Key Concepts

Taxons (Data Elements)

Hierarchy and Relationships

Taxonomy Lifecycle

1. Design Phase

2. Configuration Phase

3. Training Phase

4. Production Phase

Common Use Cases

Invoice Processing

Contract Metadata

Form Data

Best Practices

Learn More

Complete Taxonomy Guide

Data Types Reference

Building Data Classes

Data Definitions Overview

Next Steps

Introduction

Data Definitions

Knowledge

Scheduled Jobs

Large Document Processing

Formulas

Project Templates

Data Forms

Reference

​Taxonomy Concepts

​What is a Taxonomy?

​Example Structure

​Taxonomy Types

​Types of Taxonomies

Content Taxonomy

Processing Taxonomy

Module Taxonomy

​Key Concepts

​Taxons (Data Elements)

​Hierarchy and Relationships

​Taxonomy Lifecycle

​1. Design Phase

​2. Configuration Phase

​3. Training Phase

​4. Production Phase

​Common Use Cases

​Invoice Processing

​Contract Metadata

​Form Data

​Best Practices

​Learn More

Complete Taxonomy Guide

Data Types Reference

Building Data Classes

Data Definitions Overview

​Next Steps

Taxonomy Concepts

What is a Taxonomy?

Example Structure

Taxonomy Types

Types of Taxonomies

Key Concepts

Taxons (Data Elements)

Hierarchy and Relationships

Taxonomy Lifecycle

1. Design Phase

2. Configuration Phase

3. Training Phase

4. Production Phase

Common Use Cases

Invoice Processing

Contract Metadata

Form Data

Best Practices

Learn More

Next Steps