The Data Types are used by the Extraction Engine which is part of the core platform. This occurs when you designate one or more data structures and a data store in a pipeline.
The final document is passed to the Extraction Engine which then builds the data objects and data attributes linked back to the labeled document. Data types impact only the Data Attribute, and a data attribute is designed to hold multiple representations of the piece of data. Currently, we have the following data types:
Type | Description |
String | The most basic data type that can hold any type of information as a string of characters |
Date | Supports capturing a date without a time element, the date is that defined to a local to UTC |
Date/Time | Supports capturing a date with a time element, the date is that defined to a local to UTC |
Phone Number | Tries to normalize a phone number |
Email Address | Tries to convert the labeled content to a valid email address |
Selectable Option | Tries to match the value labeled to a list of available options |
Number | Tries to convert the labeled content to a number |
Currency | Tries to convert the labeled content to a valid currency (decimal) |
Boolean | Tries to convert the labeled content to a boolean value |
Understanding Normalization (Coalescing)
When we label text in a document, it is always a “string”. This just means we are capturing text and not trying to standardize (or normalize) it in any way at all.
However, most systems that will use the data from Kodexa will want to know that the data is a specific type. They would want things to be numbers or dates, etc. This process is handled when we try to set the data type on a data attribute. The Extraction Engine will take the text that is labeled in the document and try to coalesce the data into a specific form - for example it might take “1.0” as a string and turn it into a number.
This is important since it means the system using the data from Kodexa knows the data is “valid” for that “Data Type”. In that case, if the data type is a number, it will not allow “abc” for that data attribute.
Algorithms for Coalescing
In the following table we will break down how we coalesce the data from labeled data to the data type.
Data Type | Description | |
Date or Date/Time | The extraction engine will use an NLP framework to try and convert the labeled text to a date/time | |
Boolean | If the text (in lowercase) is “true” then it is true, else it is false | |
Currency | Attempt to convert to a decimal | |
Email | Use the regular expression ("^\[a-zA-Z0-9\_!#$%&'\*+/=?\`{ | }~^.-]+@[a-zA-Z0-9.-]+$") to extract the email address |
Number | Parse as a decimal number | |
Phone Number | Parse the phone number using Google’s LibPhoneNumber | |
Selectable Option | Nothing right now |
How is Typed Data Stored?
A data attribute has the ability to store multiple representations of a piece of extracted data, depending on the data type defined in the data structure one or more of the properties of the Data Attribute will be updated.
Property | Description | Applies to |
value | This is the raw value that was captured from the label | All |
stringValue | This is the raw value as a string | Selectable Options,String |
dateValue | This is the date/time in ISO format (YYYY-MM-DD and YYYY-MM-DDThh:mm) | Date,Date/Time |
booleanValue | This is the boolean value | Boolean |
decimalValue | This is the number or currency value | Currency,Number |
Content Source
When we are extracting data from a document label we are capturing the text that is labeled. This is the “raw” value that we are capturing. This is the value that is stored in the “value” property of the data attribute. However, we also need to understand where that raw value comes from in the document. This is handled by the 'Content Source' property of the taxon.
We support the following types of content source:
Content Source | Description |
Value or All Content | This means that we will look at the label, and we will see if the label has been given a value. If so, we will use this. However, if the label does not have a specified value then we will take all the text that the label has been applied to and use that as the value |
Value Only | This means that we will look at the label, and we will see if the label has been given a value and use that, if the label did not specify a value we return null |
All Content | This means that we will look at the label, and we will take all the text that the label has been applied to and use that as the value |
Expression | This allows the user to define an expression that will be used to capture the value, see Expressions below |
Script | This allows the user to define a script that will be used to capture the value, see Scripts below |
Metadata | This allows the user to choose a metadata object that will be used as the value, see Metadata below |
Expressions
Expressions are a way to define a value that will be used to capture the value of the data attribute. Expressions are defined using the Spring Expression Language (SpEL) library. When you are writing an expression, the context is the data object that you are working with, and the result of the expression will be the value that is assigned to the attribute.
Since the data object is the context, you can use methods from the data object in the expression. For example, if you wanted to get the value of another attribute, you can use:
getAttribute('attributeName').getValue()
We also have other objects available as variables to use in the expression. For example, if you wanted to get a piece of information from the metadata of the source document you can use the document as a variable
# metadata['CorrelationId']
The objects that are available to the expression are:
Object Name | Description |
document | The document that the data object is associated with |
dataObject | The data object that the expression is being evaluated against |
metadata | The metadata of the document that the data object is associated with |
family | The document family that the document is associated with |
Scripts
Scripts are a way to define a value that will be used to capture the value of the data attribute. Scripts are defined using the Groovy language. A script works slightly differently from an expression. The script has the attribute as a variable available to it, and you can assign the value to the attribute directly in the script.
The objects that are available to the script are:
Object Name | Description |
attribute | The attribute that the script is being evaluated for |
document | The document that the data object is associated with |
dataObject | The data object that the expression is being evaluated against |
metadata | The metadata of the document that the data object is associated with |
family | The document family that the document is associated with |
← Previous
Next →