Skip to main content
Kodexa supports full extraction pipeline processing for Excel files — the same parse, tag, group, extract, validate, and review workflow that works for PDFs and other document types.

How It Works

Excel processing follows the same two-layer architecture as PDF processing:
  1. Native layer — The original .xlsx file is rendered in a spreadsheet viewer for visual fidelity
  2. Content layer — The parsed KDDB document contains a content node tree (workbook → worksheet → row → cell) that holds tags, data objects, and extraction results
Cell references (A1, B2, etc.) bridge the two layers, similar to how bounding boxes bridge PDF rendering and spatial content nodes.

Uploading Excel Files

Upload Excel files to a document store just like any other document. Supported formats:
  • .xlsx — Native support
  • .xlsm — Native support (macro-enabled workbooks)
  • .xls, .ods, and other formats — Automatically converted via LibreOffice

The Workbook Content Structure

After parsing, the KDDB contains a content node tree that mirrors the workbook:
workbook
  └── worksheet ("Income Statement")
      ├── row
      │   ├── cell: "Revenue"    (ref: A1, sheet: Income Statement)
      │   └── cell: "1,234.00"   (ref: B1, sheet: Income Statement)
      └── row
          ├── cell: "Expenses"   (ref: A2, sheet: Income Statement)
          └── cell: "567.89"     (ref: B2, sheet: Income Statement)
Each cell node carries features:
FeatureExampleDescription
workbook:refB2Cell reference (column + row)
workbook:sheetIncome StatementParent worksheet name
workbook:formula=SUM(B2:B10)Formula (if present)
workbook:mergeA1:D1Merged range (top-left cell only)

Tagging and Extraction

Data-form-driven workflow

The recommended workflow for Excel extraction is data-form-driven:
  1. Define a data definition (taxonomy) for the data you want to extract
  2. Open the Excel file in the workspace — it renders in the spreadsheet viewer
  3. The data form on the left shows the data objects and attributes from your taxonomy
  4. Focus an attribute in the data form (e.g., “Revenue Q1”)
  5. Click or drag-select cells in the spreadsheet to link them to that attribute
  6. Tagged cells highlight with the taxon’s color
  7. Repeat for all attributes

AI-assisted extraction

The LLM extraction engine works with workbook content the same way it works with spatial content. The AI reads the cell content and structure, then automatically tags cells based on your data definition. You review and correct the results in the same data-form-driven workflow.

Selectors for Workbook Content

Query workbook content using the standard selector language:
//cell                                                    # All cells
//cell[hasFeatureValue('workbook', 'ref', 'A1')]         # Specific cell
//cell[contains(@content, 'Total')]                       # Cells containing text
//worksheet[contains(@content, 'Income')]//cell          # Cells in specific sheet
//*[hasTag('revenue/total')]                              # Tagged cells

What’s Next

Content Structures

Learn about the workbook mixin and other content structures.

Data Definitions

Define taxonomies to extract structured data from spreadsheets.