Extracting Data from Documents - Kodexa Developer Portal

Data forms render alongside the spatial document viewer, creating a connected workflow where users can select text in the document and extract values into form fields. Two extraction mechanisms are available: direct extract for verbatim text copying, and AI extraction for LLM-powered multi-field inference.

Direct Extract

Direct extract copies selected text from the document viewer into a form field verbatim, applying automatic type conversion (e.g., parsing a date string into a date value). It is the simplest extraction method and is best suited for fields where the document text exactly matches the desired attribute value. Enable direct extract by setting allowDirectExtract: true in the editorOptions on a v2:attributeEditor:

{
  "component": "v2:attributeEditor",
  "props": {
    "tagPath": "Invoice/InvoiceNumber",
    "editorOptions": {
      "allowDirectExtract": true
    }
  }
}

When the user selects text in the document viewer, a copy icon appears on the field. Clicking it tags the selected text region and sets the attribute value directly from the selection. The editor also displays a placeholder — “Select text in document to copy value” — when the field is empty, guiding the user toward the extraction workflow. Use direct extract for simple corrections, manual data entry, and any field where the exact document text is the intended value.

AI Extraction on Attribute Editors

AI extraction sends the page text and the user’s selected text to an LLM, which extracts values for multiple target fields simultaneously. This is useful when a single text selection contains information for several related fields — for example, selecting an invoice header block to populate the invoice number, date, and vendor name at once. Enable AI extraction by adding an aiExtraction object to editorOptions:

{
  "component": "v2:attributeEditor",
  "props": {
    "tagPath": "Invoice/InvoiceNumber",
    "editorOptions": {
      "aiExtraction": {
        "prompt": "Extract the invoice header fields from the selected text.",
        "modelType": "SMALL",
        "targetPaths": [
          { "tagPath": "Invoice/InvoiceNumber", "description": "The invoice or reference number" },
          { "tagPath": "Invoice/InvoiceDate", "description": "The date the invoice was issued" },
          { "tagPath": "Invoice/VendorName", "description": "The name of the vendor or supplier" }
        ]
      }
    }
  }
}

The AIExtractionConfig object accepts the following properties:

Property	Type	Description
`prompt`	`string`	Inline prompt text sent to the LLM. Mutually exclusive with `promptRef`.
`promptRef`	`string`	Reference to a stored prompt template (e.g., `"acme/extract-invoice"`). Mutually exclusive with `prompt`.
`modelType`	`"SMALL" \| "LARGE"`	Model size. `SMALL` is faster and cheaper; `LARGE` is more capable. Defaults to `SMALL`.
`targetPaths`	`AIExtractionTarget[]`	The fields the LLM should populate. Each entry specifies a `tagPath` and an optional `description` to help the model understand what to extract.

When aiExtraction is configured, showAddFromSelection is implied — the editor displays a sparkle button when the user has an active text selection in the document viewer. Clicking the button triggers the LLM call, and the returned values are written into each target field. An empty field shows the placeholder “Select text in document, then click to extract” by default.

AI Extraction on Grids

Grids support their own AI extraction configuration for tabular and repeating data. When configured on a v2:grid, an “AI Extract” button appears in the grid toolbar. Clicking it sends the page text and selection to an LLM, which extracts multiple rows and creates a data object for each one.

{
  "component": "v2:grid",
  "props": {
    "groupTaxon": "Invoice/LineItems",
    "aiExtraction": {
      "promptRef": "acme/extract-line-items",
      "modelType": "SMALL",
      "targetPaths": [
        { "tagPath": "Invoice/LineItems/Description", "description": "Line item description" },
        { "tagPath": "Invoice/LineItems/Quantity", "description": "Quantity ordered" },
        { "tagPath": "Invoice/LineItems/UnitPrice", "description": "Price per unit" },
        { "tagPath": "Invoice/LineItems/Amount", "description": "Total line amount" }
      ]
    }
  }
}

The AIGridExtractionConfig shares the same prompt, promptRef, and modelType properties as the attribute-level config. The key difference is targetPaths — when omitted, the grid automatically derives targets from all enabled, non-group children of the grid’s taxon, so you only need to specify targetPaths if you want to limit or annotate the extracted fields.

Choosing an Extraction Method

	Direct Extract	AI Extraction
Mechanism	Verbatim text copy	LLM inference
Fields	Single field	Multiple fields at once
Accuracy	Exact match	Interpreted by model
Speed	Instant	Network round-trip
Best for	Simple values, corrections	Complex or multi-field extraction

allowDirectExtract and aiExtraction are mutually exclusive on a given attribute editor. Configure one or the other, not both. Direct extract is designed for single-field verbatim copying, while AI extraction handles multi-field inference — combining them on the same editor is not supported.

For a complete reference of v2:attributeEditor and v2:grid props, see Data Components.

​Direct Extract

​AI Extraction on Attribute Editors

​AI Extraction on Grids

​Choosing an Extraction Method

Direct Extract

AI Extraction on Attribute Editors

AI Extraction on Grids

Choosing an Extraction Method