> ## Documentation Index
> Fetch the complete documentation index at: https://developer.kodexa.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Excel Document Processing

> Process, tag, and extract structured data from Excel spreadsheets in Kodexa using the same parse, group, extract, validate, and review pipeline as PDFs.

Kodexa supports full extraction pipeline processing for Excel files — the same parse, tag, group, extract, validate, and review workflow that works for PDFs and other document types.

## How It Works

Excel processing follows the same two-layer architecture as PDF processing:

1. **Native layer** — The original `.xlsx` file is rendered directly in the workbook viewer, an OOXML-based canvas viewer that draws the native spreadsheet bytes for visual fidelity
2. **Content layer** — The parsed KDDB document contains a content node tree (`workbook → worksheet → row → cell`) that holds tags, data objects, and extraction results

Cell references (`A1`, `B2`, etc.) bridge the two layers, similar to how bounding boxes bridge PDF rendering and spatial content nodes.

```mermaid theme={null}
graph LR
    A[".xlsx file"] -->|"Excel Parser"| B["KDDB with workbook mixin"]
    B -->|"Content nodes"| C["Tag → Group → Extract"]
    A -->|"Native rendering"| D["Workbook Viewer"]
    C -->|"Tag overlays"| D
```

## The Workbook Viewer

Excel files open in the **workbook viewer** — a canvas-based OOXML viewer that renders the native `.xlsx` bytes directly, so the spreadsheet you see matches the source file's fonts, column widths, and cell formatting. Tags and extraction results are drawn as overlays on top of the rendered grid.

The viewer is read-only for cell values — you view the spreadsheet and tag cells for extraction, but you don't edit the underlying data.

### Navigating the workbook

* **Sheet tabs** — When a workbook has more than one worksheet, each renders as a tab in source order; click a tab to switch sheets. Single-sheet workbooks show no tab bar.
* **Zoom** — Zoom in and out from the toolbar. Zoom steps by a factor of 1.25 each click and is clamped between **25%** and **400%**; the toolbar shows the current percentage. A reset control returns the view to **100%**.
* **Column and row resizing** — Drag a column or row header boundary to resize it. Double-click a boundary to auto-fit: columns size to the widest cell value on that sheet (capped at 800px) and rows size to a single line of the default font. Resizes are **session-only** — reloading the document re-parses the original widths from the file.

### Finding and copying cell content

* **Find** — The search box matches a case-insensitive substring against every cell value across **all sheets**. Press Enter (or use next/previous) to jump between matches, switching sheets automatically as needed. The current match is highlighted in amber with an outline; other matches use a lighter amber tint.
* **Copy** — Select a cell or drag-select a range and press <kbd>Ctrl</kbd>/<kbd>Cmd</kbd>+<kbd>C</kbd> to copy the selection as tab-separated values (tabs between columns, newlines between rows). Off-screen cells inside the selected range are included in the copy.

### Tag highlights and linking

* **Tag overlays** — Tagged cells are highlighted with their taxon's color, resolved from the taxonomy's tag metadata. The focused tag is drawn with a stronger fill and a 2px outline; other tagged cells use a lighter tint of the same color. Toggle overlays on and off with the **highlights** button in the toolbar (on by default).
* **Selection** — The active cell or range is drawn as a translucent blue rectangle.
* **Linking cells** — In a data-form linking workflow, left-click a cell to focus and link it to the focused attribute; drag to link a range. Right-click a cell to open the tag popup for tagging and other cell actions.

## Uploading Excel Files

Upload Excel files to a document store just like any other document. Supported formats:

* `.xlsx` — Native support
* `.xlsm` — Native support (macro-enabled workbooks)
* `.xls`, `.ods`, and other formats — Automatically converted via LibreOffice

## The Workbook Content Structure

After parsing, the KDDB contains a content node tree that mirrors the workbook:

```
workbook
  └── worksheet ("Income Statement")
      ├── row
      │   ├── cell: "Revenue"    (ref: A1, sheet: Income Statement)
      │   └── cell: "1,234.00"   (ref: B1, sheet: Income Statement)
      └── row
          ├── cell: "Expenses"   (ref: A2, sheet: Income Statement)
          └── cell: "567.89"     (ref: B2, sheet: Income Statement)
```

Each cell node carries features:

| Feature            | Example            | Description                       |
| ------------------ | ------------------ | --------------------------------- |
| `workbook:ref`     | `B2`               | Cell reference (column + row)     |
| `workbook:sheet`   | `Income Statement` | Parent worksheet name             |
| `workbook:formula` | `=SUM(B2:B10)`     | Formula (if present)              |
| `workbook:merge`   | `A1:D1`            | Merged range (top-left cell only) |

## Tagging and Extraction

### Data-form-driven workflow

The recommended workflow for Excel extraction is **data-form-driven**:

1. Define a **data definition** (taxonomy) for the data you want to extract
2. Open the Excel file in the workspace — it renders in the spreadsheet viewer
3. The **data form** on the left shows the data objects and attributes from your taxonomy
4. Focus an attribute in the data form (e.g., "Revenue Q1")
5. Click or drag-select cells in the spreadsheet to link them to that attribute
6. Tagged cells highlight with the taxon's color
7. Repeat for all attributes

### AI-assisted extraction

The LLM extraction engine works with workbook content the same way it works with spatial content. The AI reads the cell content and structure, then automatically tags cells based on your data definition. You review and correct the results in the same data-form-driven workflow.

## Selectors for Workbook Content

Query workbook content using the standard selector language:

```
//cell                                                    # All cells
//cell[hasFeatureValue('workbook', 'ref', 'A1')]         # Specific cell
//cell[contains(@content, 'Total')]                       # Cells containing text
//worksheet[contains(@content, 'Income')]//cell          # Cells in specific sheet
//*[hasTag('revenue/total')]                              # Tagged cells
```

## What's Next

<CardGroup cols={2}>
  <Card title="Content Structures" icon="table-cells" href="/guides/kodexa-document/content-structures">
    Learn about the workbook mixin and other content structures.
  </Card>

  <Card title="Data Definitions" icon="list-tree" href="/guides/data-definitions/index">
    Define taxonomies to extract structured data from spreadsheets.
  </Card>
</CardGroup>
