Search

Training a Model

Training a Model

In this section, we will further discuss how a Model Runtime is used to train a model that we have deployed. To allow a model to be trained, we need to add a few things to our model.yml file.

# A very simple first model that IS trainable

slug: my-model
version: 1.0.0
orgSlug: kodexa
type: store
storeType: MODEL
name: My Model
metadata:
  atomic: true
  trainable: true
  modelRuntimeRef: kodexa/base-model-runtime
  type: model
  inferenceOptions:
    - name: my_option
      type: string
      default: "Hello World"
      description: "A simple option"
  trainingOptions:
    - name: my_training_option
      type: string
      default: "Hello World"
      description: "A simple option"
  contents:
    - model/*

The important change here is trainable: true. The model assistant allows the user to define a training store, can capture training options. If we want to allow our model to be trained, we also need to provide a train function.

import logging
logger = logging.getLogger(__name__)

def train(document, training_store, training_options, model_data):
    logger.info(f"Training option is {training_options['my_training_option']}")
    logger.info(f"Training store is {training_store} and I can store my data in {model_data}")
    return document
📘 Training Options

It is important to note that in the inference we pass the training options as a parameters, however in the training we pass the training options as a dictionary.

The train function is called by the model runtime when the model is trained. The training_store is the store is the document store that the user has selected to use for training. The training_options are the options that the user has selected for training. The model_data is a directory that the model can use to store “trained model”. A model can store anything in the model_data directory. This contains the model_data directory will be stored in the model store as a Model Training.

We place the responsibility of iterating over the documents in the training store and training model on the model. This provides flexibility in how the model wants to process all the training documents. Once completed, the model can save any “trained materials” in model_data.

Supporting Model Testing in the UI

One of the powerful features in Kodexa is support for the Model Assistant, allowing a user to label a document, and then test the model against that document. This allows the user to see how the model is performing against a document that they have labeled.

To support this, we need to include an extra parameter in the train function. This parameter is additional_training_document. This will be an instance of a KodexaDocument.

import logging
logger = logging.getLogger(__name__)

def train(document, training_store, training_options, model_data, additional_training_document):
    logger.info(f"Training option is {training_options['my_training_option']}")
    logger.info(f"Training store is {training_store} and I can store my data in {model_data}")

    if isinstance(additional_training_document, KodexaDocument):
        logger.info(f"Additional training document is {additional_training_document}")
    return document

Adding this allows you to determine in the model how you wish to handle this additional training document.

📘 Additional Training Document

The additional training document should always be a KodexaDocument. However, it is important to note that you need to confirm (using the path of the Kodexa Document) that you don't pick up the same document from the training store.

We can see below an example of how you might write the logic to allow you to train:

for document_family in training_store.query(page_size=1000).content:
    logger.info(f'Using document {document_family.path}')

    if document_family.path == additional_training_document.metadata['path']:
        logger.info('Skipping additional training document')
        continue

    # Continue and train on Document
    pass

Using files you deployed with the Model

When you deploy a model, you can include files that will be deployed with the model. These files can be used by the model at runtime. To access these files, you can add the parameter model_base, this will be the folder where the model code has been deployed.

import logging
logger = logging.getLogger(__name__)

def train(document, training_store, training_options, model_data, additional_training_document, model_base):
    logger.info(f"Training option is {training_options['my_training_option']}")
    logger.info(f"Training store is {training_store} and I can store my data in {model_data}")

    logger.info(f"Model base is {model_base}")
    if isinstance(additional_training_document, KodexaDocument):
        logger.info(f"Additional training document is {additional_training_document}")
    return document