Overview
This example demonstrates a comprehensive invoice extraction data definition suitable for production use. It extracts header information, vendor details, customer information, line items, and totals with built-in validation.Complete Invoice Data Definition
Copy
Ask AI
slug: invoice-extraction
name: Invoice Data Extraction
description: Extract structured data from invoices including header, vendor, customer, line items, and totals
taxonomyType: CONTENT
enabled: true
taxons:
# ==========================================
# Document Metadata
# ==========================================
- id: auto-generated
name: document_filename
label: Document Filename
taxonType: STRING
valuePath: METADATA
metadataValue: FILENAME
enabled: true
group: false
userEditable: false
notUserLabelled: true
- id: auto-generated
name: processing_date
label: Processing Date
taxonType: DATE_TIME
valuePath: METADATA
metadataValue: CREATED_DATETIME
enabled: true
group: false
userEditable: false
notUserLabelled: true
typeFeatures:
normalizeDate: true
dateFormat: "yyyy-MM-dd HH:mm:ss"
# ==========================================
# Invoice Header
# ==========================================
- id: auto-generated
name: invoice_number
label: Invoice Number
description: Unique identifier for this invoice
taxonType: STRING
valuePath: VALUE_OR_ALL_CONTENT
enabled: true
group: false
color: "#3B82F6"
semanticDefinition: |
The unique invoice number, typically found at the top right of the invoice.
Look for labels like "Invoice #", "Invoice No.", "Invoice Number", or just "#".
This is usually a combination of letters and numbers (e.g., "INV-2024-001").
validationRules:
- name: Invoice number required
description: Every invoice must have an invoice number
disabled: false
conditional: false
ruleFormula: "NOT_EMPTY(invoice_number)"
messageFormula: '"Invoice number is required"'
detailFormula: '"Please verify the invoice has a visible invoice number"'
overridable: false
exceptionId: INV_NUMBER_REQUIRED
typeFeatures:
expected: true
- id: auto-generated
name: invoice_date
label: Invoice Date
description: The date the invoice was issued
taxonType: DATE
valuePath: VALUE_OR_ALL_CONTENT
enabled: true
group: false
color: "#10B981"
semanticDefinition: |
The date when the invoice was issued by the vendor.
Look for labels like "Invoice Date", "Date", "Issue Date", or "Billing Date".
This is different from the due date or payment date.
typeFeatures:
normalizeDate: true
dateFormat: "yyyy-MM-dd"
expected: true
validationRules:
- name: Invoice date required
ruleFormula: "NOT_EMPTY(invoice_date)"
messageFormula: '"Invoice date is required"'
overridable: false
exceptionId: INV_DATE_REQUIRED
- name: Invoice date not in future
ruleFormula: "invoice_date <= TODAY()"
messageFormula: '"Invoice date cannot be in the future"'
detailFormula: '"Invoice date: " + invoice_date + ", Today: " + TODAY()'
overridable: true
exceptionId: INV_DATE_FUTURE
- id: auto-generated
name: due_date
label: Due Date
description: Payment due date
taxonType: DATE
valuePath: VALUE_OR_ALL_CONTENT
enabled: true
group: false
color: "#F59E0B"
semanticDefinition: |
The date by which payment is due.
Look for labels like "Due Date", "Payment Due", "Pay By", or "Due By".
typeFeatures:
normalizeDate: true
dateFormat: "yyyy-MM-dd"
validationRules:
- name: Due date after invoice date
conditional: true
conditionalFormula: "NOT_EMPTY(due_date)"
ruleFormula: "due_date >= invoice_date"
messageFormula: '"Due date must be on or after the invoice date"'
detailFormula: '"Invoice: " + invoice_date + ", Due: " + due_date'
overridable: false
exceptionId: DUE_DATE_BEFORE_INV
conditionalFormats:
- name: Overdue invoice
formula: "due_date < TODAY() AND status != 'PAID'"
backgroundColor: "#FEE2E2"
textColor: "#991B1B"
fontWeight: bold
- id: auto-generated
name: purchase_order
label: Purchase Order
description: Reference PO number if applicable
taxonType: STRING
valuePath: VALUE_OR_ALL_CONTENT
enabled: true
group: false
semanticDefinition: |
The customer's purchase order number that this invoice is billing against.
Look for labels like "PO #", "PO Number", "Purchase Order", "P.O.", or "Reference".
May not be present on all invoices.
nullable: true
- id: auto-generated
name: payment_terms
label: Payment Terms
description: Payment terms and conditions
taxonType: SELECTION
valuePath: VALUE_OR_ALL_CONTENT
enabled: true
group: false
semanticDefinition: |
The agreed payment terms for this invoice.
Look for terms like "Net 30", "Due on Receipt", "COD", etc.
selectionOptions:
- label: "Due on Receipt"
id: "due_on_receipt"
description: "Payment due immediately upon receipt"
lexicalRelations:
- type: SYNONYM
value: "Immediate, Upon Receipt, COD, Cash on Delivery"
- label: "Net 10"
id: "net_10"
description: "Payment due within 10 days"
lexicalRelations:
- type: SYNONYM
value: "10 days, Within 10 days"
- label: "Net 30"
id: "net_30"
description: "Payment due within 30 days"
lexicalRelations:
- type: SYNONYM
value: "30 days, Within 30 days"
- label: "Net 60"
id: "net_60"
description: "Payment due within 60 days"
lexicalRelations:
- type: SYNONYM
value: "60 days, Within 60 days"
- label: "Net 90"
id: "net_90"
description: "Payment due within 90 days"
lexicalRelations:
- type: SYNONYM
value: "90 days, Within 90 days"
# ==========================================
# Vendor Information
# ==========================================
- id: auto-generated
name: vendor
label: Vendor Information
description: Details about the vendor/supplier
enabled: true
group: true
children:
- id: auto-generated
name: name
label: Vendor Name
description: Legal business name of the vendor
taxonType: STRING
valuePath: VALUE_OR_ALL_CONTENT
enabled: true
group: false
semanticDefinition: |
The vendor's legal business name.
Look in the top portion of the invoice, near "From", "Vendor", "Supplier", or "Bill From".
This should be the company name, not an individual's name.
typeFeatures:
expected: true
validationRules:
- name: Vendor name required
ruleFormula: "NOT_EMPTY(vendor.name)"
messageFormula: '"Vendor name is required"'
overridable: false
exceptionId: VENDOR_NAME_REQUIRED
- id: auto-generated
name: address
label: Address
description: Vendor's business address
taxonType: STRING
valuePath: VALUE_OR_ALL_CONTENT
enabled: true
group: false
semanticDefinition: |
The vendor's complete business address including street, city, state/province, and postal code.
typeFeatures:
longText: true
maxTextRows: 4
- id: auto-generated
name: tax_id
label: Tax ID / VAT Number
description: Vendor's tax identification number
taxonType: STRING
valuePath: VALUE_OR_ALL_CONTENT
enabled: true
group: false
semanticDefinition: |
The vendor's tax identification number.
In the US, look for "EIN", "Tax ID", or "Federal ID".
In EU, look for "VAT Number", "VAT Reg", or "BTW".
nullable: true
- id: auto-generated
name: email
label: Email
description: Vendor contact email
taxonType: EMAIL_ADDRESS
valuePath: VALUE_OR_ALL_CONTENT
enabled: true
group: false
semanticDefinition: "Vendor's email address for correspondence"
nullable: true
- id: auto-generated
name: phone
label: Phone Number
description: Vendor contact phone number
taxonType: PHONE_NUMBER
valuePath: VALUE_OR_ALL_CONTENT
enabled: true
group: false
semanticDefinition: "Vendor's phone number for contact"
nullable: true
# ==========================================
# Customer/Bill To Information
# ==========================================
- id: auto-generated
name: customer
label: Customer Information
description: Details about the customer being billed
enabled: true
group: true
children:
- id: auto-generated
name: name
label: Customer Name
description: Name of the customer/organization being billed
taxonType: STRING
valuePath: VALUE_OR_ALL_CONTENT
enabled: true
group: false
semanticDefinition: |
The customer's name or business name.
Look near "Bill To", "Customer", "Sold To", or "Invoice To".
typeFeatures:
expected: true
- id: auto-generated
name: address
label: Billing Address
description: Customer's billing address
taxonType: STRING
valuePath: VALUE_OR_ALL_CONTENT
enabled: true
group: false
semanticDefinition: "The complete billing address for the customer"
typeFeatures:
longText: true
maxTextRows: 4
# ==========================================
# Line Items (Repeating Group)
# ==========================================
- id: auto-generated
name: line_items
label: Line Items
description: Individual items or services being billed
enabled: true
group: true
additionContexts:
- type: RECORD_DEFINITION
context: |
Each line item represents a product or service being billed.
Line items typically appear in a table with columns for description, quantity, unit price, and total.
- type: RECORD_START_MARKER
context: "Description, Item, Product"
- type: RECORD_END_MARKER
context: "Subtotal, Total"
children:
- id: auto-generated
name: line_number
label: Line Number
description: Sequential line item number
taxonType: NUMBER
valuePath: VALUE_OR_ALL_CONTENT
enabled: true
group: false
semanticDefinition: "The line number or position of this item in the invoice"
nullable: true
- id: auto-generated
name: description
label: Description
description: Description of the item or service
taxonType: STRING
valuePath: VALUE_OR_ALL_CONTENT
enabled: true
group: false
semanticDefinition: |
Description of the product or service being billed.
This may include product codes, part numbers, or detailed descriptions.
typeFeatures:
longText: true
expected: true
- id: auto-generated
name: quantity
label: Quantity
description: Number of units
taxonType: NUMBER
valuePath: VALUE_OR_ALL_CONTENT
enabled: true
group: false
semanticDefinition: "The quantity or number of units for this line item"
typeFeatures:
expected: true
validationRules:
- name: Quantity must be positive
ruleFormula: "line_items.quantity > 0"
messageFormula: '"Quantity must be greater than zero"'
overridable: false
exceptionId: QTY_NOT_POSITIVE
- id: auto-generated
name: unit_price
label: Unit Price
description: Price per unit
taxonType: CURRENCY
valuePath: VALUE_OR_ALL_CONTENT
enabled: true
group: false
semanticDefinition: "The price for one unit of this item"
typeFeatures:
expected: true
- id: auto-generated
name: line_total
label: Line Total
description: Total for this line (quantity × unit price)
taxonType: CURRENCY
valuePath: FORMULA
enabled: true
group: false
semanticDefinition: "line_items.quantity * line_items.unit_price"
validationRules:
- name: Line total calculation check
ruleFormula: "ABS(line_items.line_total - (line_items.quantity * line_items.unit_price)) < 0.01"
messageFormula: '"Line total does not match quantity × unit price"'
detailFormula: '"Expected: " + (line_items.quantity * line_items.unit_price) + ", Found: " + line_items.line_total'
overridable: true
exceptionId: LINE_TOTAL_MISMATCH
# ==========================================
# Totals and Amounts
# ==========================================
- id: auto-generated
name: subtotal
label: Subtotal
description: Sum of all line items before tax
taxonType: CURRENCY
valuePath: FORMULA
enabled: true
group: false
semanticDefinition: "SUM(line_items.line_total)"
typeFeatures:
expected: true
- id: auto-generated
name: tax_rate
label: Tax Rate
description: Applicable tax rate as percentage
taxonType: PERCENTAGE
valuePath: VALUE_OR_ALL_CONTENT
enabled: true
group: false
semanticDefinition: "The tax rate applied to this invoice, expressed as a percentage"
nullable: true
- id: auto-generated
name: tax_amount
label: Tax Amount
description: Total tax amount
taxonType: CURRENCY
valuePath: VALUE_OR_ALL_CONTENT
enabled: true
group: false
semanticDefinition: |
The total tax amount charged.
Look for labels like "Tax", "Sales Tax", "VAT", "GST", or "Tax Amount".
nullable: true
- id: auto-generated
name: shipping_handling
label: Shipping & Handling
description: Shipping and handling charges
taxonType: CURRENCY
valuePath: VALUE_OR_ALL_CONTENT
enabled: true
group: false
semanticDefinition: "Shipping and handling fees, if applicable"
nullable: true
nullValue: "0.00"
- id: auto-generated
name: discount_amount
label: Discount Amount
description: Total discount applied
taxonType: CURRENCY
valuePath: VALUE_OR_ALL_CONTENT
enabled: true
group: false
semanticDefinition: "Any discounts applied to the invoice total"
nullable: true
nullValue: "0.00"
- id: auto-generated
name: total_amount
label: Total Amount Due
description: Final amount to be paid
taxonType: CURRENCY
valuePath: VALUE_OR_ALL_CONTENT
enabled: true
group: false
color: "#EF4444"
semanticDefinition: |
The final total amount due, including all taxes, fees, and discounts.
This is the bottom-line number the customer must pay.
Look for labels like "Total", "Total Due", "Amount Due", "Balance Due", or "Grand Total".
typeFeatures:
expected: true
overrideWidth: true
displayWidth: 150
validationRules:
- name: Total amount required
ruleFormula: "NOT_EMPTY(total_amount)"
messageFormula: '"Total amount is required"'
overridable: false
exceptionId: TOTAL_REQUIRED
- name: Total calculation verification
conditional: true
conditionalFormula: "NOT_EMPTY(subtotal) AND NOT_EMPTY(tax_amount)"
ruleFormula: |
ABS(total_amount - (subtotal + COALESCE(tax_amount, 0) + COALESCE(shipping_handling, 0) - COALESCE(discount_amount, 0))) < 0.01
messageFormula: |
"Total amount does not match calculated total"
detailFormula: |
"Expected: " + (subtotal + COALESCE(tax_amount, 0) + COALESCE(shipping_handling, 0) - COALESCE(discount_amount, 0)) + ", Found: " + total_amount
overridable: true
exceptionId: TOTAL_CALC_MISMATCH
- name: Unusually high amount warning
ruleFormula: "total_amount < 100000"
messageFormula: '"Invoice total exceeds $100,000 - please verify accuracy"'
overridable: true
exceptionId: HIGH_AMOUNT_WARNING
conditionalFormats:
- name: High value invoice
formula: "total_amount > 10000"
backgroundColor: "#FEF3C7"
textColor: "#92400E"
icon: warning
# ==========================================
# Payment Information
# ==========================================
- id: auto-generated
name: payment_status
label: Payment Status
description: Current payment status
taxonType: SELECTION
valuePath: REVIEW
enabled: true
group: false
semanticDefinition: |
Current payment status for this invoice
selectionOptions:
- label: "Pending"
id: "pending"
description: "Payment not yet received"
- label: "Paid"
id: "paid"
description: "Payment received and processed"
- label: "Overdue"
id: "overdue"
description: "Payment past due date"
isConditional: true
conditionalFormula: "due_date < TODAY()"
- label: "Cancelled"
id: "cancelled"
description: "Invoice cancelled"
- id: auto-generated
name: notes
label: Notes / Comments
description: Additional notes or special instructions
taxonType: STRING
valuePath: VALUE_OR_ALL_CONTENT
enabled: true
group: false
semanticDefinition: |
Any special notes, instructions, or comments on the invoice.
Look for sections labeled "Notes", "Comments", "Terms", or "Special Instructions".
typeFeatures:
longText: true
maxTextRows: 6
markdown: true
nullable: true
Usage
Extracting Invoice Data
- Create or Update Data Definition: Use this YAML structure in your Kodexa platform
- Process Invoices: Upload invoice documents for extraction
- Review Results: The data definition will extract all defined fields with validation
- Handle Exceptions: Review and resolve any validation failures
Validation Rules
This data definition includes several validation rules:- Required Fields: Invoice number, date, vendor name, total amount
- Date Logic: Due date must be after invoice date
- Calculations: Line totals and invoice total verified
- Business Rules: Warnings for high-value invoices
- Data Quality: Phone and email format validation
Customization
Adapt this data definition by:- Adding industry-specific fields (e.g., project codes for professional services)
- Modifying validation thresholds (e.g., high-value amount limit)
- Adding custom payment terms or statuses
- Including additional vendor or customer fields
- Adding currency-specific formatting rules
