Slug: llm-taxonomy-model Version: 1.0.0 Infer: Yes Event Aware: No

Overview

LLM Data Labeling Model

The Kodexa LLM Data Labeling model enables intelligent extraction of structured data from documents using AI-powered recognition based on a data definition (taxonomy). It combines pattern recognition with Large Language Models to accurately identify and extract information from documents.

How It Works

Data Definition-Based Extraction: The model uses your data definition (taxonomy) to identify what data to extract from documents
Intelligent Chunking: Documents are broken into manageable pieces based on your selected strategy
Content Classification: AI analyzes content to identify relevant sections for each data element
LLM-Powered Extraction: Advanced language models process chunks to extract structured data
Merging and Validation: Results from multiple chunks are intelligently combined and validated

Core Features

Flexible Chunking Strategies: Multiple approaches for breaking documents into processable pieces
Multi-Level Classification: Identifies document types and specific data regions
Automatic Labeling: Tags document pages and regions based on extracted content
Embedding Support: Optional vector-based similarity search for classification
Structure Review: AI-based validation of extracted structure
Intelligent Merging: Combines results from multiple document chunks

Options Configuration

Option	Description
taxonomy	The data definition to use for extraction, defining the structure of data to extract
label_document	When enabled, adds labels to the document based on extracted data
set_external_data	Sets external data properties based on extracted structure
apply_guidance	Uses pre-defined guidance to improve extraction accuracy
external_data_key	Key to use when setting external data

Process Flow

Feature Options

Chunking Strategies

Controls how documents are divided for processing:

Whole Document: Process entire document (limited by context window)
Page: Process one page at a time
First n Pages: Process only the first n pages
Records: Detect and process individual records, even across pages
Classified Content: Process all content of a specific type together
Page Classified Content: Process classified content by page
Consecutive Classified Content: Group consecutive classified content
Group Classified Content: Group classified content across pages

Classification Strategies

Controls how document sections are classified:

None: No classification
Data Element: Classify based on data element definition
Data Element and Children: Include child elements in classification
Feature: Use custom feature for classification

Classification Content

Determines what content is used for classification:

Text: Use text content (default)
Bounding Boxes: Use spatial layout information
Images & Bounding Boxes: Use visual and spatial information
Image: Use only image data
Embeddings: Use vector embeddings for similarity matching

Example Usage

The LLM Data Labeling model is particularly useful for:

Extracting structured data from semi-structured documents
Building automated document processing pipelines
Creating training datasets for machine learning
Validating document content against expected structure
Integrating document information with downstream systems

Advanced Features

Structure Review: Validates extracted structure against expected schema
AI-Assisted Merging: Uses AI to resolve conflicts when merging data from multiple chunks
Line Fallback: Automatically falls back to line-level processing for complex cases
Thinking Mode: Enables LLM reasoning traces for better extraction quality
Custom Model Selection: Override default models for classification and extraction

Inference Options

The following options can be configured when using this model for inference:

Name	Label	Type	Description	Default	Required
`taxonomy`	Data Definition	taxonomy	The data definition to use for the model	-	No
`label_document`	Label Document	boolean	Label the document	True	No
`set_external_data`	Set External Data	boolean	Set the external data to the structure from the data classes	False	No
`apply_guidance`	Apply Guidance	boolean	Apply guidance, if found	False	No
`external_data_key`	External Data Key	string	N/A	-	No

Model Details

Provider: Kodexa

Feature Configuration Options

These options allow for fine-tuning the model’s features during configuration:

Option Group 1

Name	Label	Type	Description	Default
`enable_line_fallback`	Enable Line Fallback	boolean	Fallback to line level labeling if multiple lines and unable to find content	False
`raise_exception_on_fallback`	Raise Exception on Fallback	boolean	Raise an exception if we fallback to line level labeling	False

Option Group 2

Name	Label	Type	Description	Default
`-`	N/A	article	N/A	-
`embedded`	Embedded	boolean	Treat as embedded	False
`cardinality`	Cardinality	string	The cardinality of the data element in the chunk	single
`classificationStrategy`	Classification Strategy	string	Should we chunk using the data element for classification	dataElement
`classificationContent`	Classification Content	string	What content should we use for classification	text
`maxHits`	Max Embedding Hits	number	The maximum number of hits to return from embeddings	5
`includeExplanation`	Include Explanation	boolean	Try to include an explanation of the classification in the response, useful for debugging	True
`ignoreNonWords`	Ignore Non-Words	boolean	Ignore non-word tokens when classifying	True
`restrictClassification`	Restrict Classification	boolean	Restrict to only this classification (no mixed classes allowed)	True
`rerank`	Rerank Classification Matches	boolean	Should we rerank the classification results	False
`maxPagesFromRerank`	Max Pages From Rerank	number	The maximum number of pages to return from the rerank	5
`chunkingStrategy`	Chunking Strategy	string	How should we chunk the document for the LLM	classifiedContent
`nPages`	Number of Pages	number	The number of pages to use for the chunking strategy	5
`tagPage`	Label Page	boolean	Label the page, if classified	True
`labelDocument`	Tag Document	boolean	Add a tag to the document if classified	True
`promptStrategy`	Prompt Strategy	string	Which prompt strategy should we use	layout
`image_width`	Image Width	number	The width of the image	350
`skipExtraction`	Skip Extraction	boolean	Should we skip extraction for this data element	False
`includeImages`	Include Images	boolean	Should we include images in the prompt even if the strategy doesn’t normally include them	False
`enableThinkingMode`	Enable Thinking Mode	boolean	Should the LLM use thinking mode (if it is available for the selected extraction model)?	False
`overrideExtractionModel`	Override Extraction Model	boolean	Should we override the extraction model	False
`extractionModel`	Extraction Model	cloudModel	Choose a model if you wish to override the extraction model	anthropic.claude-3-haiku-20240307-v1:0
`enableStructureReview`	Enable Structure Review	boolean	Should we enable the structure review	True
`structureReview`	Structure Review Model	cloudModel	Choose a model if you wish to review the structure	anthropic.claude-3-haiku-20240307-v1:0
`merge`	Merge	boolean	Merge the objects identified in the chunks	True
`mergeWithAI`	Merge with AI	boolean	Use AI to review chunks that are grouped and merge them into a single representation	False
`mergeInstructions`	Merge Instructions	string	Additional instructions for merging the results to be included in the merge prompt	-

Introduction

​Overview

​LLM Data Labeling Model

​How It Works

​Core Features

​Options Configuration

​Process Flow

​Feature Options

​Chunking Strategies

​Classification Strategies

​Classification Content

​Example Usage

​Advanced Features

​Inference Options

​Model Details

​Feature Configuration Options

​Option Group 1

​Option Group 2