AWS Textract
Extracts data from forms and tables using OCR and machine learning
Slug: aws-textract-model
Version: 1.0.0
Infer: Yes
Overview
AWS Textract Model
The AWS Textract model extracts text, forms, and tables from documents using Amazon’s advanced OCR and machine learning technology. It goes beyond basic OCR to detect and extract structured data, making it ideal for automating document-based workflows.
How It Works
- The model uploads your document to Amazon Web Services Textract
- Textract analyzes the document using specialized machine learning algorithms
- The model processes the results, including:
- Detected text with positioning information
- Form fields with key-value pairs
- Table structures with row and column data
- Results are converted into a rich Kodexa document structure with spatial information
Options Configuration
Option | Description |
---|---|
ignore_dash_lines | When enabled, removes dash-only lines from the extracted document structure |
apply_skew | When enabled, corrects for document skew in the text positioning calculations |
Document Enhancements
The model adds several enhancements to the processed document:
- Spatial Mixin: Every text element includes precise coordinates
- Bounding Boxes: All document elements contain bounding box coordinates
- Confidence Scores: Each extracted element includes a confidence score
- Handwriting Detection: Identifies handwritten text elements
- Form Field Recognition: Labels form elements as key-value pairs
Process Flow
Extraction Capabilities
AWS Textract excels at extracting:
- Text Content: Words, lines, and paragraphs with positioning
- Form Fields: Automatically identifies key-value pairs in forms
- Tables: Detects tabular structures with row and column relationships
- Handwriting: Identifies and extracts handwritten text
- Document Layout: Preserves the visual structure of the document
Use Cases
This model is particularly useful for:
- Forms Processing: Extracting data from invoices, applications, and forms
- Table Extraction: Converting tabular information into structured data
- Document Digitization: Converting paper or image-based documents to digital formats
- Content Indexing: Making document content searchable and analyzable
- Form Field Automation: Identifying key-value pairs for automated data entry
AWS Integration Notes
- The model requires AWS credentials to be configured in your environment
- Processing occurs in your AWS account and may incur AWS Textract charges
- Large documents may take longer to process due to AWS Textract’s processing time
- The model supports most common document formats including PDF, PNG, JPEG, and TIFF
Additional Features
- Line Grouping: Intelligently groups text elements into lines
- Overlapping Text Handling: Resolves overlapping text elements
- Skew Correction: Optionally applies correction for skewed documents
- Configuration Options: Customizable parameters to optimize extraction results
Inference Options
The following options can be configured when using this model for inference:
Name | Label | Type | Description | Default | Required |
---|---|---|---|---|---|
ignore_dash_lines | Ignore Dash Line | boolean | Ignore the dash line in the document | False | No |
apply_skew | Apply Skew | boolean | Apply skew correction to the document | True | No |
Model Details
- Provider: Amazon Web Services