Data Types

Data Types

Data Types

The Data Types are used by the Extraction Engine which is part of the core platform. This occurs when you designate one or more data structures and a data store in a pipeline.

The final document is passed to the Extraction Engine which then builds the data objects and data attributes linked back to the labeled document. Data types impact only the Data Attribute, and a data attribute is designed to hold multiple representations of the piece of data. Currently, we have the following data types:

Type
Description
String
The most basic data type that can hold any type of information as a string of characters
Date
Supports capturing a date without a time element, the date is that defined to a local to UTC
Date/Time
Supports capturing a date with a time element, the date is that defined to a local to UTC
Phone Number
Tries to normalize a phone number
Email Address
Tries to convert the labeled content to a valid email address
Selectable Option
Tries to match the value labeled to a list of available options
Number
Tries to convert the labeled content to a number
Currency
Tries to convert the labeled content to a valid currency (decimal)
Boolean
Tries to convert the labeled content to a boolean value

Understanding Normalization (Coalescing)

When we label text in a document, it is always a “string”. This just means we are capturing text and not trying to standardize (or normalize) it in any way at all.

However, most systems that will use the data from Kodexa will want to know that the data is a specific type. They would want things to be numbers or dates, etc. This process is handled when we try to set the data type on a data attribute. The Extraction Engine will take the text that is labeled in the document and try to coalesce the data into a specific form - for example it might take “1.0” as a string and turn it into a number.

This is important since it means the system using the data from Kodexa knows the data is “valid” for that “Data Type”. In that case, if the data type is a number, it will not allow “abc” for that data attribute.

Algorithms for Coalescing

In the following table we will break down how we coalesce the data from labeled data to the data type.

Data Type
Description
Date or Date/Time
The extraction engine will use an NLP framework to try and convert the labeled text to a date/time
Boolean
If the text (in lowercase) is “true” then it is true, else it is false
Currency
Attempt to convert to a decimal
Email
Use the regular expression ("^\[a-zA-Z0-9\_!#$%&'\*+/=?\`{
}~^.-]+@[a-zA-Z0-9.-]+$") to extract the email address
Number
Parse as a decimal number
Phone Number
Parse the phone number using Google’s LibPhoneNumber
Selectable Option
Nothing right now

How is Typed Data Stored?

A data attribute has the ability to store multiple representations of a piece of extracted data, depending on the data type defined in the data structure one or more of the properties of the Data Attribute will be updated.

Property
Description
Applies to
value
This is the raw value that was captured from the label
All
stringValue
This is the raw value as a string
Selectable Options,String
dateValue
This is the date/time in ISO format (YYYY-MM-DD and YYYY-MM-DDThh:mm)
Date,Date/Time
booleanValue
This is the boolean value
Boolean
decimalValue
This is the number or currency value
Currency,Number

Content Source

When we are extracting data from a document label we are capturing the text that is labeled. This is the “raw” value that we are capturing. This is the value that is stored in the “value” property of the data attribute. However, we also need to understand where that raw value comes from in the document. This is handled by the 'Content Source' property of the taxon.

We support the following types of content source:

Content Source
Description
Value or All Content
This means that we will look at the label, and we will see if the label has been given a value. If so, we will use this. However, if the label does not have a specified value then we will take all the text that the label has been applied to and use that as the value
Value Only
This means that we will look at the label, and we will see if the label has been given a value and use that, if the label did not specify a value we return null
All Content
This means that we will look at the label, and we will take all the text that the label has been applied to and use that as the value
Expression
This allows the user to define an expression that will be used to capture the value, see Expressions below
Script
This allows the user to define a script that will be used to capture the value, see Scripts below
Metadata
This allows the user to choose a metadata object that will be used as the value, see Metadata below

Expressions

Expressions are a way to define a value that will be used to capture the value of the data attribute. Expressions are defined using the Spring Expression Language (SpEL) library. When you are writing an expression, the context is the data object that you are working with, and the result of the expression will be the value that is assigned to the attribute.

Since the data object is the context, you can use methods from the data object in the expression. For example, if you wanted to get the value of another attribute, you can use:

getAttribute('attributeName').getValue()

We also have other objects available as variables to use in the expression. For example, if you wanted to get a piece of information from the metadata of the source document you can use the document as a variable

# metadata['CorrelationId']

The objects that are available to the expression are:

Object Name
Description
document
The document that the data object is associated with
dataObject
The data object that the expression is being evaluated against
metadata
The metadata of the document that the data object is associated with
family
The document family that the document is associated with

Scripts

Scripts are a way to define a value that will be used to capture the value of the data attribute. Scripts are defined using the Groovy language. A script works slightly differently from an expression. The script has the attribute as a variable available to it, and you can assign the value to the attribute directly in the script.

The objects that are available to the script are:

Object Name
Description
attribute
The attribute that the script is being evaluated for
document
The document that the data object is associated with
dataObject
The data object that the expression is being evaluated against
metadata
The metadata of the document that the data object is associated with
family
The document family that the document is associated with

← Previous

Taxonomies