Skip to main content

Taxonomy Concepts

Taxonomies (also called Data Definitions in the UI) are used to define the structure of the data that is extracted from documents. They provide a hierarchical framework for organizing and extracting structured information.
UI Terminology:
  • Taxonomies in the API are called Data Definitions in the UI
  • Taxons in the API are called Data Elements in the UI
These terms are interchangeable.

What is a Taxonomy?

A taxonomy is made up of several nodes called taxons (a made-up word), each of which represents either:
  • A piece of data to be extracted
  • A grouping of related data elements
This hierarchy of nodes defines the structure of data extraction from documents.

Example Structure

Invoice (Root)
├── Invoice Number (data element)
├── Invoice Date (data element)
├── Vendor (group)
│   ├── Name (data element)
│   ├── Address (data element)
│   └── Tax ID (data element)
└── Line Items (repeating group)
    ├── Description (data element)
    ├── Quantity (data element)
    ├── Unit Price (data element)
    └── Total (calculated element)

Taxonomy Types

While users typically think of taxonomies in terms of the data they want to extract, Kodexa uses multiple types of taxonomies for different purposes. This is because the labeling process isn’t simply about identifying extraction targets—it’s also about labeling concepts, markers, and other information that aids the extraction process.

Types of Taxonomies

Content Taxonomy

Defines the structure of data extracted from documents. Represents the business-level data structure understood by users and stakeholders.Use for: Business data extraction, final output structure

Processing Taxonomy

Used during document processing. These taxonomies are typically provided by assistants and become available when you add an assistant to a project.Use for: Intermediate processing steps, AI model labels

Model Taxonomy

Provided by models used for training or inference. These taxonomies become available when you add a model to a project or reference one through an assistant.Use for: ML model training, model-specific labels

Key Concepts

Taxons (Data Elements)

Each taxon in a taxonomy can represent: Simple Data Element:
- name: invoice_number
  label: Invoice Number
  taxonType: STRING
Group Container:
- name: vendor
  label: Vendor Information
  group: true
  children:
    - name: name
    - name: address
Repeating Group:
- name: line_items
  label: Line Items
  group: true
  allowsMultipleEntries: true
  children:
    - name: description
    - name: quantity

Hierarchy and Relationships

Taxonomies use parent-child relationships to organize data:
  • Root taxons: Top-level elements
  • Child taxons: Nested within parent groups
  • Sibling taxons: At the same level in the hierarchy
This hierarchical structure:
  • Organizes related data logically
  • Mirrors document structure
  • Improves extraction accuracy
  • Makes data easier to work with

Taxonomy Lifecycle

1. Design Phase

Define the structure based on:
  • Document analysis
  • Business requirements
  • Data relationships
  • Extraction goals

2. Configuration Phase

Set properties for each taxon:
  • Data type
  • Value source
  • Semantic definitions
  • Validation rules

3. Training Phase

Use the taxonomy to:
  • Label training documents
  • Train ML models
  • Refine extraction logic

4. Production Phase

Apply the taxonomy to:
  • Extract data from new documents
  • Validate extracted values
  • Present data to users

Common Use Cases

Invoice Processing

taxons:
  - name: header
    group: true
    children:
      - name: invoice_number
      - name: invoice_date
      - name: due_date

  - name: vendor
    group: true
    children:
      - name: name
      - name: address

  - name: line_items
    group: true
    children:
      - name: description
      - name: quantity
      - name: unit_price

Contract Metadata

taxons:
  - name: contract_type
    taxonType: SELECTION

  - name: parties
    group: true
    children:
      - name: party_a
      - name: party_b

  - name: key_dates
    group: true
    children:
      - name: effective_date
      - name: expiration_date

Form Data

taxons:
  - name: applicant
    group: true
    children:
      - name: full_name
      - name: email
      - name: phone

  - name: application_details
    group: true
    children:
      - name: application_type
      - name: submission_date

Best Practices

Create taxonomies that can be reused across similar document types:
  • Use generic names where appropriate
  • Factor out common structures
  • Create taxonomy templates for similar documents
Start with essential fields and add complexity as needed:
  • Begin with core data elements
  • Add groups for organization
  • Introduce validation incrementally
Align taxonomy hierarchy with document layout:
  • Match visual organization
  • Follow natural reading order
  • Group related information
Choose clear, descriptive names:
  • Use business terminology
  • Be specific and unambiguous
  • Follow consistent naming conventions

Learn More


Next Steps

1

Review the Complete Guide

Read the Taxonomy Guide for detailed configuration instructions
2

Explore Examples

Check out example taxonomies for common document types
3

Try It Out

Create your first taxonomy in a Kodexa project and test with sample documents
4

Iterate and Refine

Use extraction results to improve semantic definitions and validation rules