Data Types

The Data Types are used by the Extraction Engine which is part of the core platform. This occurs when you designate one or more data structures and a data store in a pipeline. The final document is passed to the Extraction Engine which then builds the data objects and data attributes linked back to the labeled document. Data types impact only the Data Attribute, and a data attribute is designed to hold multiple representations of the piece of data. Currently, we have the following data types:

Type	Description
String	The most basic data type that can hold any type of information as a string of characters
Date	Supports capturing a date without a time element, the date is that defined to a local to UTC
Date/Time	Supports capturing a date with a time element, the date is that defined to a local to UTC
Phone Number	Tries to normalize a phone number
Email Address	Tries to convert the labeled content to a valid email address
Selection	Tries to match the value labeled to a list of available options
Number	Tries to convert the labeled content to a number
Currency	Tries to convert the labeled content to a valid currency (decimal)
Boolean	Tries to convert the labeled content to a boolean value

Understanding Normalization (Coalescing)

When we label text in a document, it is always a “string”. This just means we are capturing text and not trying to standardize (or normalize) it in any way at all. However, most systems that will use the data from Kodexa will want to know that the data is a specific type. They would want things to be numbers or dates, etc. This process is handled when we try to set the data type on a data attribute. The Extraction Engine will take the text that is labeled in the document and try to coalesce the data into a specific form - for example it might take “1.0” as a string and turn it into a number. This is important since it means the system using the data from Kodexa knows the data is “valid” for that “Data Type”. In that case, if the data type is a number, it will not allow “abc” for that data attribute.

Algorithms for Coalescing

In the following table we will break down how we coalesce the data from labeled data to the data type.

Data Type	Description	Algorithm
Date or Date/Time	The extraction engine will use an NLP framework to try and convert the labeled text to a date/time	NLP conversion
Boolean	If the text (in lowercase) is “true” then it is true, else it is false	Simple string match
Currency	Attempt to convert to a decimal	Decimal conversion
Email	Extract the email address	Regex validation
Number	Parse as a decimal number	Numeric parsing
Phone Number	Parse the phone number using Google’s LibPhoneNumber	Library parsing
Selectable Option	Nothing right now	N/A

How is Typed Data Stored?

A data attribute has the ability to store multiple representations of a piece of extracted data, depending on the data type defined in the data structure one or more of the properties of the Data Attribute will be updated.

Property	Description	Applies to
value	This is the raw value that was captured from the label	All
stringValue	This is the raw value as a string	Selectable Options, String
dateValue	This is the date/time in ISO format (YYYY-MM-DD and YYYY-MM-DDThh:mm)	Date, Date/Time
booleanValue	This is the boolean value	Boolean
decimalValue	This is the number or currency value	Currency, Number

Content Source

When we are extracting data from a document label we are capturing the text that is labeled. This is the “raw” value that we are capturing. This is the value that is stored in the “value” property of the data attribute. However, we also need to understand where that raw value comes from in the document. This is handled by the ‘Content Source’ property of the taxon. We support the following types of content source:

Content Source	Description
Value or All Content	This means that we will look at the label, and we will see if the label has been given a value. If so, we will use this. However, if the label does not have a specified value then we will take all the text that the label has been applied to and use that as the value
Value Only	This means that we will look at the label, and we will see if the label has been given a value and use that, if the label did not specify a value we return null
All Content	This means that we will look at the label, and we will take all the text that the label has been applied to and use that as the value
Expression	This allows the user to define an expression that will be used to capture the value, see Expressions below
Script	This allows the user to define a script that will be used to capture the value, see Scripts below
Metadata	This allows the user to choose a metadata object that will be used as the value, see Metadata below

Expressions

Expressions are a way to define a value that will be used to capture the value of the data attribute. Expressions are defined using the Spring Expression Language (SpEL) library. When you are writing an expression, the context is the data object that you are working with, and the result of the expression will be the value that is assigned to the attribute. Since the data object is the context, you can use methods from the data object in the expression. For example, if you wanted to get the value of another attribute, you can use:

getAttribute('attributeName').getValue()

We also have other objects available as variables to use in the expression. For example, if you wanted to get a piece of information from the metadata of the source document you can use the document as a variable:

metadata['CorrelationId']

The objects that are available to the expression are:

Object Name	Description
document	The document that the data object is associated with
dataObject	The data object that the expression is being evaluated against
metadata	The metadata of the document that the data object is associated with
family	The document family that the document is associated with

Scripts

Scripts are a way to define a value that will be used to capture the value of the data attribute. Scripts are defined using the Groovy language. A script works slightly differently from an expression. The script has the attribute as a variable available to it, and you can assign the value to the attribute directly in the script. The objects that are available to the script are:

Object Name	Description
attribute	The attribute that the script is being evaluated for
document	The document that the data object is associated with
dataObject	The data object that the expression is being evaluated against
metadata	The metadata of the document that the data object is associated with
family	The document family that the document is associated with

Introduction

Getting Started

Organization & Projects

Resources

Models

Data Definition

Reference

Understanding Normalization (Coalescing)

Algorithms for Coalescing

How is Typed Data Stored?

Content Source

Expressions

Scripts

Introduction

Getting Started

Organization & Projects

Resources

Models

Data Definition

Reference

​Understanding Normalization (Coalescing)

​Algorithms for Coalescing

​How is Typed Data Stored?

​Content Source

​Expressions

​Scripts

Understanding Normalization (Coalescing)

Algorithms for Coalescing

How is Typed Data Stored?

Content Source

Expressions

Scripts