Data definitions are the foundation of data extraction in Kodexa. They define the hierarchical structure of data elements you want to extract from documents, along with their types, validation rules, and extraction logic.
The valuePath determines where the taxon gets its data from:
Document (VALUE_OR_ALL_CONTENT)
Extracts data directly from document content using AI/ML models or pattern matching.When to use: Standard document extraction (invoices, contracts, forms)Configuration:
valuePath: VALUE_OR_ALL_CONTENTsemanticDefinition: "Extract the vendor's business name as it appears on the invoice"
Features:
Uses semantic definition as extraction prompt
Can leverage document structure and layout
Supports AI-assisted extraction
Metadata (METADATA)
Pulls data from document metadata (filename, creation date, owner, etc.).When to use: Document properties, system fields, audit trailConfiguration:
valuePath: METADATAmetadataValue: FILENAME # or CREATED_DATETIME, OWNER_NAME, etc.
Available metadata values:
FILENAME - Document filename
TRANSACTION_UUID - Unique transaction identifier
CREATED_DATETIME - Document creation timestamp
DOCUMENT_LABELS - Applied labels
OWNER_NAME - Document owner
DOCUMENT_STATUS - Processing status
PAGE_NUMBER - Current page number
Formula (FORMULA)
Calculates values using formulas that reference other taxons.When to use: Computed fields, calculations, aggregationsConfiguration:
The taxonType defines how the data should be treated and validated:
String
Number
Currency
Date
Date Time
Selection
Boolean
Other Types
taxonType: STRINGtypeFeatures: longText: true # Multi-line text field maxTextRows: 10 # Maximum rows for display markdown: true # Enable markdown formatting expected: true # Field is expected to be present stringExtract: '\d' # Keep only matching characters (regex) stringReplace: '[-\s]' # Remove matching characters (regex)
Use for: Names, addresses, descriptions, any text content
Use stringExtract and stringReplace to automatically clean extracted values. See String Filters below.
taxonType: NUMBERtypeFeatures: truncateDecimal: true # Round to fixed decimal places decimalPlaces: 2 # Number of decimal places
Use for: Quantities, counts, measurements
taxonType: CURRENCYtypeFeatures: preferTwoDecimalPlaces: true # Assume last 2 digits are decimal (1234 → 12.34)
Use for: Prices, totals, monetary amounts
taxonType: DATEtypeFeatures: normalizeDate: true # Normalize for display normalizeDateInExport: true # Normalize in exports dateFormat: "yyyy-MM-dd" # Target format
Use for: Invoice dates, due dates, any date without time
Each item in selectionOptions supports these properties:
Property
Type
Description
label
string
Required. Display text shown to the user in the dropdown
id
string
Unique identifier for the option (auto-generated if omitted)
value
string
The value stored when this option is selected. Defaults to label if empty. Use this to separate display text from stored codes (e.g., label “Net 30” with value “NET_30”)
description
string
Description text shown alongside the option
hint
string
Additional help text displayed with the option
hintMarkdown
boolean
When true, renders the hint as Markdown instead of plain text
disabled
string
Set to "true" to disable the option. Disabled options are excluded from AI extraction requests but remain visible (struck through) in the UI. Useful for deprecating options without breaking existing data
isConditional
boolean
Enables conditional visibility for this option
conditionalFormula
string
Formula evaluated per data object to determine if this option appears. Only used when isConditional is true
lexicalRelations
array
Semantic relationships that help AI/ML models understand option equivalences. See Lexical Relations below
When value is set, the UI displays the label but stores the value. This is useful when you need human-readable display text but machine-friendly stored values:
Disabled options are excluded from AI extraction prompts (so the model won’t extract them from new documents) but remain visible in the UI for historical data:
selectionOptions: - label: "Net 30" id: "net_30" disabled: "" # Active — included in AI requests - label: "Net 15" id: "net_15" disabled: "true" # Deprecated — excluded from AI requests, shown struck-through
The disabled field is a string, not a boolean. Use "true" to disable and "" (empty string) or omit for enabled.
The conditionalFormula is evaluated against the current data object at runtime. Options where the formula evaluates to false are hidden from the dropdown. Options without isConditional always appear.
Lexical relations help AI models understand synonyms and related terms for each option, improving extraction accuracy when documents use varied terminology:
The formula is evaluated by the GoJA scripting runtime and must return an array of {label, value} objects or plain strings. Options re-evaluate automatically when referenced attributes change, and the results are persisted on the data object so they survive page reloads.
For the full guide on dynamic selection options — including service bridge integration, grid child formulas, dependency tracking, and troubleshooting — see Selection Option Formulas.
Groups organize related taxons and can represent repeating structures:
name: line_itemslabel: Line Itemsgroup: true # This is a group, not a valuechildren: - name: description label: Description taxonType: STRING - name: quantity label: Quantity taxonType: NUMBER - name: unit_price label: Unit Price taxonType: CURRENCY - name: total label: Total taxonType: CURRENCY valuePath: FORMULA semanticDefinition: "quantity * unit_price"
Attach reactive scripts that run when child attributes change. Event subscriptions can derive values, enforce business rules, or call external systems. Only available on group taxons.
eventSubscriptions: - name: derive-total on: "changed:dataAttribute" dependsOn: - quantity - unit_price script: | if (!currentObject) return; var qty = currentObject.GetFirstAttributeValue("quantity"); var price = currentObject.GetFirstAttributeValue("unit_price"); if (qty && price) { bridge.data.setAttribute(currentObject.GetID(), "line_total", qty * price); }
validationRules: - name: "Total matches sum of line items" description: "Ensure calculated total matches the invoice total" disabled: false conditional: false # Apply always ruleFormula: | ABS(total_amount - SUM(line_items.total)) < 0.01 messageFormula: | "Total mismatch: Invoice shows " + total_amount + " but line items sum to " + SUM(line_items.total) detailFormula: | "Check line items for accuracy" overridable: true # User can override this validation exceptionId: "TOTAL_MISMATCH" # Unique exception identifier supportArticleId: "9117988" # Link to help article - name: "Due date after invoice date" conditional: true # Only apply if condition met conditionalFormula: "NOT_EMPTY(due_date)" ruleFormula: | due_date > invoice_date messageFormula: | "Due date must be after invoice date" overridable: false # Strict validation
Help AI/ML models understand and classify content:
Semantic Definition
Provides guidance for AI extraction:
semanticDefinition: | The vendor's legal business name as registered with tax authorities. Look for names near "Bill To", "Vendor", or "From" sections. Should be a proper business name, not an individual's name.
Best practices:
Be specific about what to look for
Describe location hints
Clarify edge cases
Provide examples if helpful
Additional Context
Helps with record-based chunking and classification:
additionContexts: - type: RECORD_DEFINITION context: | Each line item represents a product or service being billed. Line items typically appear in a table format. - type: RECORD_START_MARKER context: "Item #" - type: RECORD_END_MARKER context: "Subtotal"
Context types:
RECORD_DEFINITION - Describes the record structure
RECORD_START_MARKER - Text indicating record start
Automatically clean extracted values using regex patterns. Extract keeps only matching characters, Replace removes matching characters. If both are set, extract runs first. The original value is preserved separately.
multiValue: true # Allow multiple valuesuserEditable: true # User can edit in formsnotUserLabelled: false # Show in labeling interfacenullable: true # Allow null valuesnullValue: "N/A" # Display text for null
Write semantic definitions as if explaining to a human what to look for. Be specific about:
What the field represents
Where it typically appears
How to identify it
Edge cases to consider
Good example:
semanticDefinition: | The total amount due on the invoice, including all taxes and fees. Look for labels like "Total", "Amount Due", "Balance Due", or "Total Amount". This should be the final bottom-line number, not a subtotal. If multiple totals exist (e.g., by currency), extract the primary total.