Excel Parser
Creates a Kodexa Document from the Excel file
Slug: excel-parser
Version: 1.0.0
Infer: Yes
Overview
Excel Parser Model
The Excel Parser model converts Excel spreadsheets (XLSX, XLS, XLSM) into structured Kodexa documents, preserving the hierarchical structure, cell values, formulas, and relationships. This enables Excel data to be integrated into document processing pipelines and analyzed alongside other document types.
How It Works
- The model reads the input Excel document and detects its format
- For non-XLSX formats, the file is converted to XLSX using LibreOffice
- The model processes the workbook in two passes:
- First pass: Extracts cell values with data_only=True to get calculated results
- Second pass: Extracts formulas with data_only=False to capture formulas
- For each worksheet, the model extracts:
- Sheet title and structure
- Row organization
- Cell content with precise positioning
- Cell formulas and references
- The extracted content is organized into a hierarchical Kodexa document structure
- A workbook mixin is added to provide specialized Excel functionality
Document Structure
The resulting document has the following hierarchical structure:
Each node includes:
- Cell content: The text or numeric value of each cell
- Cell references: Excel-style references (e.g., “A1”, “B2”)
- Formulas: Original Excel formulas when present
- Positional information: Row and column positions preserved
Process Flow
Data Handling
The Excel Parser carefully handles various data types:
- Numbers: Preserved with full precision, avoiding scientific notation
- Text: Maintained exactly as it appears in the spreadsheet
- Formulas: Both the formula text and calculated results are captured
- Merged Cells: Properly handled to maintain spreadsheet layout
- References: Cell references are converted to standard Excel notation
Use Cases
This model is particularly useful for:
- Data Extraction: Extracting structured data from Excel spreadsheets
- Report Processing: Converting Excel reports into processable documents
- Tabular Data Analysis: Preparing Excel data for further analysis
- Formula Auditing: Examining formulas and calculations
- Excel Integration: Incorporating Excel files into document processing pipelines
- Data Transformation: Converting Excel data for use in other systems
Technical Details
- The model uses openpyxl for primary Excel processing
- LibreOffice is used for converting other Excel formats to XLSX
- Memory optimization techniques are employed for handling large spreadsheets
- Cell references are standardized to Excel A1 notation
- Temporary files are used and cleaned up during processing
- Merged cells are handled appropriately to maintain document structure
Limitations
- Very complex spreadsheets with custom functions may have limited formula support
- Visual elements like charts and graphs are not currently extracted
- Macros and VBA code are not included in the document structure
- Custom formatting is not preserved in the Kodexa document
Model Details
- Provider: Kodexa