Skip to main content

Overview

This example demonstrates a comprehensive invoice extraction data definition suitable for production use. It extracts header information, vendor details, customer information, line items, and totals with built-in validation.

Complete Invoice Data Definition

slug: invoice-extraction
name: Invoice Data Extraction
description: Extract structured data from invoices including header, vendor, customer, line items, and totals
taxonomyType: CONTENT
enabled: true

taxons:
  # ==========================================
  # Document Metadata
  # ==========================================
  - id: auto-generated
    name: document_filename
    label: Document Filename
    taxonType: STRING
    valuePath: METADATA
    metadataValue: FILENAME
    enabled: true
    group: false
    userEditable: false
    notUserLabelled: true

  - id: auto-generated
    name: processing_date
    label: Processing Date
    taxonType: DATE_TIME
    valuePath: METADATA
    metadataValue: CREATED_DATETIME
    enabled: true
    group: false
    userEditable: false
    notUserLabelled: true
    typeFeatures:
      normalizeDate: true
      dateFormat: "yyyy-MM-dd HH:mm:ss"

  # ==========================================
  # Invoice Header
  # ==========================================
  - id: auto-generated
    name: invoice_number
    label: Invoice Number
    description: Unique identifier for this invoice
    taxonType: STRING
    valuePath: VALUE_OR_ALL_CONTENT
    enabled: true
    group: false
    color: "#3B82F6"
    semanticDefinition: |
      The unique invoice number, typically found at the top right of the invoice.
      Look for labels like "Invoice #", "Invoice No.", "Invoice Number", or just "#".
      This is usually a combination of letters and numbers (e.g., "INV-2024-001").
    validationRules:
      - name: Invoice number required
        description: Every invoice must have an invoice number
        disabled: false
        conditional: false
        ruleFormula: "NOT_EMPTY(invoice_number)"
        messageFormula: '"Invoice number is required"'
        detailFormula: '"Please verify the invoice has a visible invoice number"'
        overridable: false
        exceptionId: INV_NUMBER_REQUIRED
    typeFeatures:
      expected: true

  - id: auto-generated
    name: invoice_date
    label: Invoice Date
    description: The date the invoice was issued
    taxonType: DATE
    valuePath: VALUE_OR_ALL_CONTENT
    enabled: true
    group: false
    color: "#10B981"
    semanticDefinition: |
      The date when the invoice was issued by the vendor.
      Look for labels like "Invoice Date", "Date", "Issue Date", or "Billing Date".
      This is different from the due date or payment date.
    typeFeatures:
      normalizeDate: true
      dateFormat: "yyyy-MM-dd"
      expected: true
    validationRules:
      - name: Invoice date required
        ruleFormula: "NOT_EMPTY(invoice_date)"
        messageFormula: '"Invoice date is required"'
        overridable: false
        exceptionId: INV_DATE_REQUIRED

      - name: Invoice date not in future
        ruleFormula: "invoice_date <= TODAY()"
        messageFormula: '"Invoice date cannot be in the future"'
        detailFormula: '"Invoice date: " + invoice_date + ", Today: " + TODAY()'
        overridable: true
        exceptionId: INV_DATE_FUTURE

  - id: auto-generated
    name: due_date
    label: Due Date
    description: Payment due date
    taxonType: DATE
    valuePath: VALUE_OR_ALL_CONTENT
    enabled: true
    group: false
    color: "#F59E0B"
    semanticDefinition: |
      The date by which payment is due.
      Look for labels like "Due Date", "Payment Due", "Pay By", or "Due By".
    typeFeatures:
      normalizeDate: true
      dateFormat: "yyyy-MM-dd"
    validationRules:
      - name: Due date after invoice date
        conditional: true
        conditionalFormula: "NOT_EMPTY(due_date)"
        ruleFormula: "due_date >= invoice_date"
        messageFormula: '"Due date must be on or after the invoice date"'
        detailFormula: '"Invoice: " + invoice_date + ", Due: " + due_date'
        overridable: false
        exceptionId: DUE_DATE_BEFORE_INV
    conditionalFormats:
      - name: Overdue invoice
        formula: "due_date < TODAY() AND status != 'PAID'"
        backgroundColor: "#FEE2E2"
        textColor: "#991B1B"
        fontWeight: bold

  - id: auto-generated
    name: purchase_order
    label: Purchase Order
    description: Reference PO number if applicable
    taxonType: STRING
    valuePath: VALUE_OR_ALL_CONTENT
    enabled: true
    group: false
    semanticDefinition: |
      The customer's purchase order number that this invoice is billing against.
      Look for labels like "PO #", "PO Number", "Purchase Order", "P.O.", or "Reference".
      May not be present on all invoices.
    nullable: true

  - id: auto-generated
    name: payment_terms
    label: Payment Terms
    description: Payment terms and conditions
    taxonType: SELECTION
    valuePath: VALUE_OR_ALL_CONTENT
    enabled: true
    group: false
    semanticDefinition: |
      The agreed payment terms for this invoice.
      Look for terms like "Net 30", "Due on Receipt", "COD", etc.
    selectionOptions:
      - label: "Due on Receipt"
        id: "due_on_receipt"
        description: "Payment due immediately upon receipt"
        lexicalRelations:
          - type: SYNONYM
            value: "Immediate, Upon Receipt, COD, Cash on Delivery"

      - label: "Net 10"
        id: "net_10"
        description: "Payment due within 10 days"
        lexicalRelations:
          - type: SYNONYM
            value: "10 days, Within 10 days"

      - label: "Net 30"
        id: "net_30"
        description: "Payment due within 30 days"
        lexicalRelations:
          - type: SYNONYM
            value: "30 days, Within 30 days"

      - label: "Net 60"
        id: "net_60"
        description: "Payment due within 60 days"
        lexicalRelations:
          - type: SYNONYM
            value: "60 days, Within 60 days"

      - label: "Net 90"
        id: "net_90"
        description: "Payment due within 90 days"
        lexicalRelations:
          - type: SYNONYM
            value: "90 days, Within 90 days"

  # ==========================================
  # Vendor Information
  # ==========================================
  - id: auto-generated
    name: vendor
    label: Vendor Information
    description: Details about the vendor/supplier
    enabled: true
    group: true
    children:
      - id: auto-generated
        name: name
        label: Vendor Name
        description: Legal business name of the vendor
        taxonType: STRING
        valuePath: VALUE_OR_ALL_CONTENT
        enabled: true
        group: false
        semanticDefinition: |
          The vendor's legal business name.
          Look in the top portion of the invoice, near "From", "Vendor", "Supplier", or "Bill From".
          This should be the company name, not an individual's name.
        typeFeatures:
          expected: true
        validationRules:
          - name: Vendor name required
            ruleFormula: "NOT_EMPTY(vendor.name)"
            messageFormula: '"Vendor name is required"'
            overridable: false
            exceptionId: VENDOR_NAME_REQUIRED

      - id: auto-generated
        name: address
        label: Address
        description: Vendor's business address
        taxonType: STRING
        valuePath: VALUE_OR_ALL_CONTENT
        enabled: true
        group: false
        semanticDefinition: |
          The vendor's complete business address including street, city, state/province, and postal code.
        typeFeatures:
          longText: true
          maxTextRows: 4

      - id: auto-generated
        name: tax_id
        label: Tax ID / VAT Number
        description: Vendor's tax identification number
        taxonType: STRING
        valuePath: VALUE_OR_ALL_CONTENT
        enabled: true
        group: false
        semanticDefinition: |
          The vendor's tax identification number.
          In the US, look for "EIN", "Tax ID", or "Federal ID".
          In EU, look for "VAT Number", "VAT Reg", or "BTW".
        nullable: true

      - id: auto-generated
        name: email
        label: Email
        description: Vendor contact email
        taxonType: EMAIL_ADDRESS
        valuePath: VALUE_OR_ALL_CONTENT
        enabled: true
        group: false
        semanticDefinition: "Vendor's email address for correspondence"
        nullable: true

      - id: auto-generated
        name: phone
        label: Phone Number
        description: Vendor contact phone number
        taxonType: PHONE_NUMBER
        valuePath: VALUE_OR_ALL_CONTENT
        enabled: true
        group: false
        semanticDefinition: "Vendor's phone number for contact"
        nullable: true

  # ==========================================
  # Customer/Bill To Information
  # ==========================================
  - id: auto-generated
    name: customer
    label: Customer Information
    description: Details about the customer being billed
    enabled: true
    group: true
    children:
      - id: auto-generated
        name: name
        label: Customer Name
        description: Name of the customer/organization being billed
        taxonType: STRING
        valuePath: VALUE_OR_ALL_CONTENT
        enabled: true
        group: false
        semanticDefinition: |
          The customer's name or business name.
          Look near "Bill To", "Customer", "Sold To", or "Invoice To".
        typeFeatures:
          expected: true

      - id: auto-generated
        name: address
        label: Billing Address
        description: Customer's billing address
        taxonType: STRING
        valuePath: VALUE_OR_ALL_CONTENT
        enabled: true
        group: false
        semanticDefinition: "The complete billing address for the customer"
        typeFeatures:
          longText: true
          maxTextRows: 4

  # ==========================================
  # Line Items (Repeating Group)
  # ==========================================
  - id: auto-generated
    name: line_items
    label: Line Items
    description: Individual items or services being billed
    enabled: true
    group: true
    additionContexts:
      - type: RECORD_DEFINITION
        context: |
          Each line item represents a product or service being billed.
          Line items typically appear in a table with columns for description, quantity, unit price, and total.
      - type: RECORD_START_MARKER
        context: "Description, Item, Product"
      - type: RECORD_END_MARKER
        context: "Subtotal, Total"
    children:
      - id: auto-generated
        name: line_number
        label: Line Number
        description: Sequential line item number
        taxonType: NUMBER
        valuePath: VALUE_OR_ALL_CONTENT
        enabled: true
        group: false
        semanticDefinition: "The line number or position of this item in the invoice"
        nullable: true

      - id: auto-generated
        name: description
        label: Description
        description: Description of the item or service
        taxonType: STRING
        valuePath: VALUE_OR_ALL_CONTENT
        enabled: true
        group: false
        semanticDefinition: |
          Description of the product or service being billed.
          This may include product codes, part numbers, or detailed descriptions.
        typeFeatures:
          longText: true
          expected: true

      - id: auto-generated
        name: quantity
        label: Quantity
        description: Number of units
        taxonType: NUMBER
        valuePath: VALUE_OR_ALL_CONTENT
        enabled: true
        group: false
        semanticDefinition: "The quantity or number of units for this line item"
        typeFeatures:
          expected: true
        validationRules:
          - name: Quantity must be positive
            ruleFormula: "line_items.quantity > 0"
            messageFormula: '"Quantity must be greater than zero"'
            overridable: false
            exceptionId: QTY_NOT_POSITIVE

      - id: auto-generated
        name: unit_price
        label: Unit Price
        description: Price per unit
        taxonType: CURRENCY
        valuePath: VALUE_OR_ALL_CONTENT
        enabled: true
        group: false
        semanticDefinition: "The price for one unit of this item"
        typeFeatures:
          expected: true

      - id: auto-generated
        name: line_total
        label: Line Total
        description: Total for this line (quantity × unit price)
        taxonType: CURRENCY
        valuePath: FORMULA
        enabled: true
        group: false
        semanticDefinition: "line_items.quantity * line_items.unit_price"
        validationRules:
          - name: Line total calculation check
            ruleFormula: "ABS(line_items.line_total - (line_items.quantity * line_items.unit_price)) < 0.01"
            messageFormula: '"Line total does not match quantity × unit price"'
            detailFormula: '"Expected: " + (line_items.quantity * line_items.unit_price) + ", Found: " + line_items.line_total'
            overridable: true
            exceptionId: LINE_TOTAL_MISMATCH

  # ==========================================
  # Totals and Amounts
  # ==========================================
  - id: auto-generated
    name: subtotal
    label: Subtotal
    description: Sum of all line items before tax
    taxonType: CURRENCY
    valuePath: FORMULA
    enabled: true
    group: false
    semanticDefinition: "SUM(line_items.line_total)"
    typeFeatures:
      expected: true

  - id: auto-generated
    name: tax_rate
    label: Tax Rate
    description: Applicable tax rate as percentage
    taxonType: PERCENTAGE
    valuePath: VALUE_OR_ALL_CONTENT
    enabled: true
    group: false
    semanticDefinition: "The tax rate applied to this invoice, expressed as a percentage"
    nullable: true

  - id: auto-generated
    name: tax_amount
    label: Tax Amount
    description: Total tax amount
    taxonType: CURRENCY
    valuePath: VALUE_OR_ALL_CONTENT
    enabled: true
    group: false
    semanticDefinition: |
      The total tax amount charged.
      Look for labels like "Tax", "Sales Tax", "VAT", "GST", or "Tax Amount".
    nullable: true

  - id: auto-generated
    name: shipping_handling
    label: Shipping & Handling
    description: Shipping and handling charges
    taxonType: CURRENCY
    valuePath: VALUE_OR_ALL_CONTENT
    enabled: true
    group: false
    semanticDefinition: "Shipping and handling fees, if applicable"
    nullable: true
    nullValue: "0.00"

  - id: auto-generated
    name: discount_amount
    label: Discount Amount
    description: Total discount applied
    taxonType: CURRENCY
    valuePath: VALUE_OR_ALL_CONTENT
    enabled: true
    group: false
    semanticDefinition: "Any discounts applied to the invoice total"
    nullable: true
    nullValue: "0.00"

  - id: auto-generated
    name: total_amount
    label: Total Amount Due
    description: Final amount to be paid
    taxonType: CURRENCY
    valuePath: VALUE_OR_ALL_CONTENT
    enabled: true
    group: false
    color: "#EF4444"
    semanticDefinition: |
      The final total amount due, including all taxes, fees, and discounts.
      This is the bottom-line number the customer must pay.
      Look for labels like "Total", "Total Due", "Amount Due", "Balance Due", or "Grand Total".
    typeFeatures:
      expected: true
      overrideWidth: true
      displayWidth: 150
    validationRules:
      - name: Total amount required
        ruleFormula: "NOT_EMPTY(total_amount)"
        messageFormula: '"Total amount is required"'
        overridable: false
        exceptionId: TOTAL_REQUIRED

      - name: Total calculation verification
        conditional: true
        conditionalFormula: "NOT_EMPTY(subtotal) AND NOT_EMPTY(tax_amount)"
        ruleFormula: |
          ABS(total_amount - (subtotal + COALESCE(tax_amount, 0) + COALESCE(shipping_handling, 0) - COALESCE(discount_amount, 0))) < 0.01
        messageFormula: |
          "Total amount does not match calculated total"
        detailFormula: |
          "Expected: " + (subtotal + COALESCE(tax_amount, 0) + COALESCE(shipping_handling, 0) - COALESCE(discount_amount, 0)) + ", Found: " + total_amount
        overridable: true
        exceptionId: TOTAL_CALC_MISMATCH

      - name: Unusually high amount warning
        ruleFormula: "total_amount < 100000"
        messageFormula: '"Invoice total exceeds $100,000 - please verify accuracy"'
        overridable: true
        exceptionId: HIGH_AMOUNT_WARNING
    conditionalFormats:
      - name: High value invoice
        formula: "total_amount > 10000"
        backgroundColor: "#FEF3C7"
        textColor: "#92400E"
        icon: warning

  # ==========================================
  # Payment Information
  # ==========================================
  - id: auto-generated
    name: payment_status
    label: Payment Status
    description: Current payment status
    taxonType: SELECTION
    valuePath: REVIEW
    enabled: true
    group: false
    semanticDefinition: |
      Current payment status for this invoice
    selectionOptions:
      - label: "Pending"
        id: "pending"
        description: "Payment not yet received"

      - label: "Paid"
        id: "paid"
        description: "Payment received and processed"

      - label: "Overdue"
        id: "overdue"
        description: "Payment past due date"
        isConditional: true
        conditionalFormula: "due_date < TODAY()"

      - label: "Cancelled"
        id: "cancelled"
        description: "Invoice cancelled"

  - id: auto-generated
    name: notes
    label: Notes / Comments
    description: Additional notes or special instructions
    taxonType: STRING
    valuePath: VALUE_OR_ALL_CONTENT
    enabled: true
    group: false
    semanticDefinition: |
      Any special notes, instructions, or comments on the invoice.
      Look for sections labeled "Notes", "Comments", "Terms", or "Special Instructions".
    typeFeatures:
      longText: true
      maxTextRows: 6
      markdown: true
    nullable: true

Usage

Extracting Invoice Data

  1. Create or Update Data Definition: Use this YAML structure in your Kodexa platform
  2. Process Invoices: Upload invoice documents for extraction
  3. Review Results: The data definition will extract all defined fields with validation
  4. Handle Exceptions: Review and resolve any validation failures

Validation Rules

This data definition includes several validation rules:
  • Required Fields: Invoice number, date, vendor name, total amount
  • Date Logic: Due date must be after invoice date
  • Calculations: Line totals and invoice total verified
  • Business Rules: Warnings for high-value invoices
  • Data Quality: Phone and email format validation

Customization

Adapt this data definition by:
  • Adding industry-specific fields (e.g., project codes for professional services)
  • Modifying validation thresholds (e.g., high-value amount limit)
  • Adding custom payment terms or statuses
  • Including additional vendor or customer fields
  • Adding currency-specific formatting rules

Next Steps