Excel Document Processing

Kodexa supports full extraction pipeline processing for Excel files — the same parse, tag, group, extract, validate, and review workflow that works for PDFs and other document types.

How It Works

Excel processing follows the same two-layer architecture as PDF processing:

Native layer — The original .xlsx file is rendered directly in the workbook viewer, an OOXML-based canvas viewer that draws the native spreadsheet bytes for visual fidelity
Content layer — The parsed KDDB document contains a content node tree (workbook → worksheet → row → cell) that holds tags, data objects, and extraction results

Cell references (A1, B2, etc.) bridge the two layers, similar to how bounding boxes bridge PDF rendering and spatial content nodes.

The Workbook Viewer

Excel files open in the workbook viewer — a canvas-based OOXML viewer that renders the native .xlsx bytes directly, so the spreadsheet you see matches the source file’s fonts, column widths, and cell formatting. Tags and extraction results are drawn as overlays on top of the rendered grid. The viewer is read-only for cell values — you view the spreadsheet and tag cells for extraction, but you don’t edit the underlying data.

Navigating the workbook

Sheet tabs — When a workbook has more than one worksheet, each renders as a tab in source order; click a tab to switch sheets. Single-sheet workbooks show no tab bar.
Zoom — Zoom in and out from the toolbar. Zoom steps by a factor of 1.25 each click and is clamped between 25% and 400%; the toolbar shows the current percentage. A reset control returns the view to 100%.
Column and row resizing — Drag a column or row header boundary to resize it. Double-click a boundary to auto-fit: columns size to the widest cell value on that sheet (capped at 800px) and rows size to a single line of the default font. Resizes are session-only — reloading the document re-parses the original widths from the file.

Finding and copying cell content

Find — The search box matches a case-insensitive substring against every cell value across all sheets. Press Enter (or use next/previous) to jump between matches, switching sheets automatically as needed. The current match is highlighted in amber with an outline; other matches use a lighter amber tint.
Copy — Select a cell or drag-select a range and press Ctrl/Cmd+C to copy the selection as tab-separated values (tabs between columns, newlines between rows). Off-screen cells inside the selected range are included in the copy.

Tag highlights and linking

Tag overlays — Tagged cells are highlighted with their taxon’s color, resolved from the taxonomy’s tag metadata. The focused tag is drawn with a stronger fill and a 2px outline; other tagged cells use a lighter tint of the same color. Toggle overlays on and off with the highlights button in the toolbar (on by default).
Selection — The active cell or range is drawn as a translucent blue rectangle.
Linking cells — In a data-form linking workflow, left-click a cell to focus and link it to the focused attribute; drag to link a range. Right-click a cell to open the tag popup for tagging and other cell actions.

Uploading Excel Files

Upload Excel files to a document store just like any other document. Supported formats:

.xlsx — Native support
.xlsm — Native support (macro-enabled workbooks)
.xls, .ods, and other formats — Automatically converted via LibreOffice

The Workbook Content Structure

After parsing, the KDDB contains a content node tree that mirrors the workbook:

workbook
  └── worksheet ("Income Statement")
      ├── row
      │   ├── cell: "Revenue"    (ref: A1, sheet: Income Statement)
      │   └── cell: "1,234.00"   (ref: B1, sheet: Income Statement)
      └── row
          ├── cell: "Expenses"   (ref: A2, sheet: Income Statement)
          └── cell: "567.89"     (ref: B2, sheet: Income Statement)

Each cell node carries features:

Feature	Example	Description
`workbook:ref`	`B2`	Cell reference (column + row)
`workbook:sheet`	`Income Statement`	Parent worksheet name
`workbook:formula`	`=SUM(B2:B10)`	Formula (if present)
`workbook:merge`	`A1:D1`	Merged range (top-left cell only)

Tagging and Extraction

Data-form-driven workflow

The recommended workflow for Excel extraction is data-form-driven:

Define a data definition (taxonomy) for the data you want to extract
Open the Excel file in the workspace — it renders in the spreadsheet viewer
The data form on the left shows the data objects and attributes from your taxonomy
Focus an attribute in the data form (e.g., “Revenue Q1”)
Click or drag-select cells in the spreadsheet to link them to that attribute
Tagged cells highlight with the taxon’s color
Repeat for all attributes

AI-assisted extraction

The LLM extraction engine works with workbook content the same way it works with spatial content. The AI reads the cell content and structure, then automatically tags cells based on your data definition. You review and correct the results in the same data-form-driven workflow.

Selectors for Workbook Content

Query workbook content using the standard selector language:

//cell                                                    # All cells
//cell[hasFeatureValue('workbook', 'ref', 'A1')]         # Specific cell
//cell[contains(@content, 'Total')]                       # Cells containing text
//worksheet[contains(@content, 'Income')]//cell          # Cells in specific sheet
//*[hasTag('revenue/total')]                              # Tagged cells

Introduction

Activity Plans

Triggers

Task Templates

Task Groups

MCP Connector

Data Definitions

Excel Processing

Scripting

Intakes

Formulas

Project Templates

Data Forms

Working with Claude Code

Working with Kodexa Agent

Reference

How It Works

The Workbook Viewer

Navigating the workbook

Finding and copying cell content

Tag highlights and linking

Uploading Excel Files

The Workbook Content Structure

Tagging and Extraction

Data-form-driven workflow

AI-assisted extraction

Selectors for Workbook Content

What’s Next

Content Structures

Data Definitions

​How It Works

​The Workbook Viewer

​Navigating the workbook

​Finding and copying cell content

​Tag highlights and linking

​Uploading Excel Files

​The Workbook Content Structure

​Tagging and Extraction

​Data-form-driven workflow

​AI-assisted extraction

​Selectors for Workbook Content

​What’s Next

Content Structures

Data Definitions

How It Works

The Workbook Viewer

Navigating the workbook

Finding and copying cell content

Tag highlights and linking

Uploading Excel Files

The Workbook Content Structure

Tagging and Extraction

Data-form-driven workflow

AI-assisted extraction

Selectors for Workbook Content

What’s Next