aivis Engine v2 - State Detection - User Guide

Download OpenAPI specification:Download

aivis State Detection is one of the engines of the aivis Technology Platform by Vernaio and a component of multiple products such as aivis Insights, Process Booster, and more. aivis State Detection analyzes incidents and classifies them according to their root causes. A live score is inferred for each root cause indicating if the risk of an incident is high. The information provided helps to understand root causes, and allows not only to detect but also to prevent incidents.

Using revolutionary new mathematical concepts of multivariate & non-linear statistics (i.e., Geometrical Kernel Machines), aivis State Detection minimizes the user's input requirements and reaches a prediction quality never seen before.

Aivis needs historical data that includes known incidents. Before the actual training, Data Preparation steps are taken, such as time synchronization, automatic feature engineering, and automatic feature selection. Afterwards, Analysis sorts the incidents into segments, which each stand for a root cause. Finally the engine generates a model in the Training. The resulting model is then used during Inference, which, on demand, returns a score between 0 and 1 for each root cause, based on recent values of the relevant signals.

Introduction

API References

This documentation explains the usage and principles behind aivis State Detection to data and software engineers. For detailed API descriptions of docker images, web endpoints and SDK functions, please consult the reference manual of the regarding component:

For additional support, go to Vernaio Support.

Workflow Overview

Using aivis State Detection consists of 3 steps that each fulfill a specific task:

  1. Analysis cleans the data and sorts incidents into segments, which each stand for a certain root cause. Additional information like signal correlations is provided for each segment, to enable the user to fully understand the root cause. The Training Data used for both analysis and training has to include incidents in form of a boolean target signal.
  2. Training creates a compact model based on the Training Data, either by choosing the segmentation suggested by aivis in the first step or
  3. Inference applies the model from the previous step to some Inference Data (without any target values) to create a score for each root cause either for historical evaluation or live inference.

In contrast to other engines, analysis and training are two different steps. This enables the users to adjust the proposed segmentation, if they wish to.

Workflow Overview

Artifact Distribution

Currently, aivis State Detection is distributed to a closed user base only. To gain access to the artifacts, as well as for any other questions, you can open a support ticket via aivis Support.

Example Use Case

As an illustrative use case example, we will use aivis State Detection for machine failures. Any machine (or machine part) has a finite lifetime. Replacing it too early is neither economical nor environmental friendly. However, waiting until it finally breaks results in a long downtime in the factory or even dangerous situations which need to be avoided. Thus, deeper knowledge about the remaining lifetime is crucial. With training data of machines that have been running until failures occurred, an aivis State Detection model can be trained which indicates an upcoming failure in live inference by rising scores.

The example data used in this documentation was sourced from Kaggle. It describes turbofan failures at NASA. Turbofans are jet engines widely used in aerospace engine - failed turbofans of course need to be avoided at any cost. On a test stand, the turbofans ran until they broke, and the data of every test stand was then merged one after another to get consecutive timeseries for all 24 signals on the test stand. Minor alterations were performed on the data for pedagogical reasons. First, the remaining useful lifetime was replaced by a boolean signal indicating the engine's failure. Additionally, UNIX-like timestamps were introduced (Milliseconds since 1970-1-1). Results are independent of the time unit chosen but for simplicity we use milliseconds throughout this documentation. It can be used freely and shared under the following terms.

In the following two chapters we will train a model. We will evaluate it on a historical time window outside of the training period ("out-of-sample").

Getting Started (Docker)

The docker images of aivis State Detection are prepared for easy usage. They use the Java SDK internally, but have a simpler file-based interface. If you have a working docker workflow system like Argo, you can build your own automated workflow based on these images.

In this chapter, we will show you how to get started using docker images. Usage of the SDK will be covered by the next chapter.

Run Example Code

A working example that builds on the code explained below can be downloaded directly here: state-detection-examples.zip.

This zip file contains example code for docker, python and java in respective subfolders. All of them use the same dataset which is in the data subfolder.

Prerequisites: Additionally to the state-detection-examples.zip you just downloaded, you need the following artifacts. To gain access, you can open a support ticket via aivis Support.

  • The docker images aivis-engine-v2-sd-training-worker, aivis-engine-v2-sd-inference-worker and (optionally for HTML report generation) aivis-engine-v2-toolbox
  • An aivis licensing key, see licensing

As a Kubernetes user even without deeper Argo knowledge, the aivis-engine-v2-example-sd-argo.yaml shows best how the containers are executed after each other, how training and inference workers are provided with folders that contain the data csvs and how the toolbox assembles a HTML report at the end.

Artifacts

There are 4 different docker images:

  • The Analysis Worker creates an analysis, i.e. a segmentation and its report files:
    {REGISTRY}/{NAMESPACE}/aivis-engine-v2-sd-analysis-worker:{VERSION}
  • The Training Worker first either creates an analysis from scratch or needs a training-preparation from an analysis worker. Then, it creates a model:
    {REGISTRY}/{NAMESPACE}/aivis-engine-v2-sd-training-worker:{VERSION}
  • The Inference Worker creates scores for a predefined time window in a bulk manner. This is convenient for evaluating a model:
    {REGISTRY}/{NAMESPACE}/aivis-engine-v2-sd-inference-worker:{VERSION}
  • The Inference Service offers a RESTful web API that allows the triggering of individual scores for a specified time via an HTTP call:
    {REGISTRY}/{NAMESPACE}/aivis-engine-v2-sd-inference-service:{VERSION}

All docker images are Linux-based.

Requirements

You need an installation of Docker on your machine as well as access to the engine artifacts:

docker -v
docker pull {REGISTRY}/{NAMESPACE}/aivis-engine-v2-sd-training-worker:{VERSION}
docker -v
docker pull {REGISTRY}/{NAMESPACE}/aivis-engine-v2-sd-training-worker:{VERSION}

Licensing

A valid licensing key is necessary for every aivis calculation in every engine and every component. It has to be set (exported) as environment variable AIVIS_ENGINE_V2_API_KEY.

If aivis returns a licensing error despite the environment variable being set, please check the following items

  • Terminals usually need to be restarted to learn newly set environment variables.
  • Licensing keys have the typical form <FirstPartOfKey>.<SecondPartOfKey> with first and second part being UUIDs. In particular, there is no whitespace.
  • A common error source is that the user's firewall does not let HTTPS requests to v3.aivis-engine-v2.vernaio-licensing.com (before release 2.7: v2.aivis-engine-v2.vernaio-licensing.com, before release 2.3: aivis-engine-v2.perfectpattern-licensing.de) pass and the licensing request never reaches the licensing server. In that case outgoing connections to that hostname and TCP port 443 need to be whitelisted.

Analysis and Training

First, we need to make an analysis and train the model (workflow steps 1: Analysis and 2: Training). The docker container Training Worker can do both, if the user doesn't wish to adjust the proposed segmentation.

At the beginning, we create a folder docker, a subfolder training-config and add the configuration file config.yaml:

data:
  folder: /srv/data
  dataTypes:
    defaultType: FLOAT
    booleanSignals: ["failure"]
analysis:
  _type: Build 
  analysisConfig: 
    target:
      signal: "failure"
    selfLabeling:
      mesh: 3600000
      coherencePeriod: 432000000
    sampling:
      additionalSampleMesh: 
training:  
  modeling: 
    controlPointCount: 2000
output:
  folder: /srv/output

For the moment, you may take this file as it is. The different keys will become more clear from the later sections and the docker reference manual. As a next step, we create a second folder data and add the Training Data CSV file train_sd.csv to the folder. Afterwards, we create a blank folder output.

Our folder structure should now look like this:

+- docker
|  +- training-config
|      +- config.yaml
|
+- data
|  +- train_sd.csv
|
+- output

Finally, we can start our training via:

docker run --rm -it \
  -v $(pwd)/docker/training-config:/srv/conf \
  -v $(pwd)/data/train_sd.csv:/srv/data/train_sd.csv \
  -v $(pwd)/output:/srv/output \
  -e AIVIS_ENGINE_V2_API_KEY={LICENSE_KEY} \
  {REGISTRY}/{NAMESPACE}/aivis-engine-v2-sd-training-worker:{VERSION}
docker run --rm -it `
  -v ${PWD}/docker/training-config:/srv/conf `
  -v ${PWD}/data/train_sd.csv:/srv/data/train_sd.csv `
  -v ${PWD}/output:/srv/output `
  -e AIVIS_ENGINE_V2_API_KEY={LICENSE_KEY} `
  {REGISTRY}/{NAMESPACE}/aivis-engine-v2-sd-training-worker:{VERSION}

After a short time, this should lead to three output files in the output folder:

  • analysis-report.json can be inspected to get information about the analysis step, e.g. segmentation and dependencies between various signals.
  • training-report.json can be inspected to get information about the training step, e.g. the label history.
  • model.json holds all model information for the following Inference.

Evaluation / Inference

After the training has finished, we can evaluate it by running a historical evaluation (bulk inference) on the second data file. This is the out-of-sample evaluation. To assess the quality of the model, we want to obtain a continuous stream of score values — exactly as it would be desired by the machine operator.

For this, we create a second subfolder inference-config of the data folder and add the configuration file config.yaml:

data:
  folder: /srv/data
  dataTypes:
    defaultType: FLOAT
    booleanSignals: ["failure"]
inference:
  config:
    skipOnInsufficientData: true
  modelFile: /srv/output/model.json
  timestamps:
  - _type: Equidistant
    startTime: 1331128800000
    endTime: 1336572000000
    interval: 3600000
output:
  folder: /srv/output

Note that there is also a different, experimental prediction method infer with next normal to help identifying the causes of observed anomalies, see section Infer With Next Normal.

After that, we add the Inference Data CSV file eval_sd.csv to the data folder. Our folder structure should now look like this:

+- docker
|  +- training-config
|      +- config.yaml
|  +- inference-config
|      +- config.yaml
|
+- data
|  +- train_sd.csv
|  +- eval_sd.csv
|
+- output

Finally, we can run the Inference via:

docker run --rm -it \
  -v $(pwd)/docker/inference-config:/srv/conf \
  -v $(pwd)/data/eval_sd.csv:/srv/data/eval_sd.csv \
  -v $(pwd)/output:/srv/output \
  -e AIVIS_ENGINE_V2_API_KEY={LICENSE_KEY} \
  {REGISTRY}/{NAMESPACE}/aivis-engine-v2-sd-inference-worker:{VERSION}
docker run --rm -it `
  -v ${PWD}/docker/inference-config:/srv/conf `
  -v ${PWD}/data/eval_sd.csv:/srv/data/eval_sd.csv `
  -v ${PWD}/output:/srv/output `
  -e AIVIS_ENGINE_V2_API_KEY={LICENSE_KEY} `
  {REGISTRY}/{NAMESPACE}/aivis-engine-v2-sd-inference-worker:{VERSION}

Successful execution should lead to the file scores.json in the output folder, which holds the predicted scores for each segment.

When plotted, the output looks like this (three scores in yellow, green and blue):

Scores

It is of course possible to make a live inference (just-in-time) for the current timestamp. You can then feed the predicted scores back as new signal into your hot storage / time series database / historian and reach our initial goal of providing the machine operator with a continuous score for incident risks. For this purpose, the Inference Service may be preferable as it offers a RESTful API to trigger inferences via HTTP in contrast to the Inference Worker, which uses a file based API and is designed for a bulk evaluation.

Next, we will do the same calculations with direct function calls via an SDK.

Getting Started (SDK)

The SDK of aivis State Detection allows for direct calls from your C, Java or Python program code. All language SDKs internally use our native shared library (FFI). As C APIs can be called from various other languages as well, the C-SDK can also be used with languages such as R, Go, Julia, Rust, and more. Compared to the docker images, the SDK enables a more fine-grained usage and tighter integration.

In this chapter we will show you how to get started using the SDK.

Run Example Code

A working sdk example that builds on the code explained below can be downloaded directly here:

This zip file contains example code for docker, python and java in respective subfolders. All of them use the same dataset which is in the data subfolder.

Additionally to the `state-detection-examples.zip` you just downloaded, you need the following artifacts. To gain access, you can open a support ticket via aivis Support.

Required artifacts:

  • These aivis engine v2 .whl-files which you will receive in a libs.zip directly from aivis Support:
    • aivis_engine_v2_sd_runtime_python_full-{VERSION}-py3-none-win_amd64.whl: A state detection full python runtime
      (here for windows, fitting your operating system - see artifacts for other options on linux and macos.)
    • aivis_engine_v2_base_sdk_python-{VERSION}-py3-none-any.whl: The base python sdk
    • aivis_engine_v2_sd_sdk_python-{VERSION}-py3-none-any.whl: The state detection python sdk
    • aivis_engine_v2_toolbox-{TOOLBOX-VERSION}-py3-none-any.whl: The toolbox python sdk - optional for HTML report generation
  • An aivis licensing key, see licensing, which you will receive directly from aivis Support

Preparations:

  • Make sure you have a valid Python(>=3.9) installation.
  • To apply the aivis licensing key, create an environment variable AIVIS_ENGINE_V2_API_KEY and assign the licensing key to it.
  • Make sure you have an active internet connection so that the licensing server can be contacted.
  • Download and unzip the state-detection-examples.zip. The data CSVs train_sd.csv and eval_sd.csv need to stay in **/data.
  • Download and unzip the libs.zip. These .whl-files need to be in **/libs.

The folder now has the following structure:

+- data
|  +- train_sd.csv
|  +- eval_sd.csv
|
+- docker
|  +- # files to run the example via docker images, which we will not need now
|
+- java
|  +- # files to run the example via java sdk, which we will not need now 
|
+- libs
|  +- # the .whl files to run aivis
|
+- python
|  +- # files to run the example via python sdk 

Running the example code:

  • Navigate to the **/python subfolder. Here, you find the classic python script example_sd.py and the jupyter notebook example_sd.ipynb. Both run the exact same example and output the same result. Choose which one you want to run.
  • There are various ways to install dependencies from .whl files. We will now explain two options, which are installing them via pip install or installing them via poetry. Many other options are also possible, of course.

Option A: pip install (only for the classic python script example_sd.py, not for the jupyter notebook example_sd.ipynb)

  • open a console in the **/python subfolder and run the following commands:
      # installs the `.whl` files
      pip install -r requirements-<platform>.txt
    
      # runs the classic python script `example_sd.py`
      python example_sd.py --input=../data --output=output
    

Option B: poetry install

  • If not already happened, install poetry, a python package manager:
      # installs poetry (a package manager)
      python -m pip install poetry
    
  • Run either the classic python script example_sd.py
      # installs the `.whl` files
      poetry install --no-root
    
      # runs the classic python script `example_sd.py`
      poetry run python example_sd.py --input=../data --output=output
    
  • Or run jupyter notebook example_sd.ipynb by executing the following commands in the console opened in the **/python subfolder. The first one might take a while, the third one opens a tab in your browser.
      # installs the `.whl` files
      poetry install --no-root
    
      # installs jupyter kernel
      poetry run ipython kernel install --user --name=test_sd
    
      # runs the jupyter python script `example_sd.ipynb`
      poetry run jupyter notebook example_sd.ipynb
    

After running the scripts, you will find your computation results in **/python/output.

Additionally to the state-detection-examples.zip you just downloaded, you need the following artifacts. To gain access, you can open a support ticket via aivis Support.

Required artifacts:

  • These aivis engine v2 .jar files which you will receive in a libs.zip directly from aivis Support:
    • aivis-engine-v2-sd-runtime-java-full-win-x8664-{VERSION}.jar: A state detection full java runtime, here for windows, fitting your operating system - see artifacts for other options on linux and macos.
    • aivis-engine-v2-base-sdk-java-{VERSION}.jar: The base java sdk
    • aivis-engine-v2-sd-sdk-java-{VERSION}.jar: The state detection java sdk
    • There is NO toolbox jar for HTML report generation.
  • An aivis licensing key, see licensing, which you will receive directly from aivis Support

Preparations:

  • Make sure you have a valid Java(>=11) installation.
  • To apply the aivis licensing key, create an environment variable AIVIS_ENGINE_V2_API_KEY and assign the licensing key to it.
  • Make sure you have an active internet connection so that the licensing server can be contacted.
  • Download and unzip the state-detection-examples.zip. The data CSVs train_sd.csv and eval_sd.csv need to stay in **/data.
  • Download and unzip the libs.zip. These .jar-files need to be in **/libs.

The folder now has the following structure:

+- data
|  +- train_sd.csv
|  +- eval_sd.csv
|
+- docker
|  +- # files to run the example via docker images, which we will not need now
|
+- java
|  +- # files to run the example via java sdk 
|
+- libs
|  +- # the .jar files to run aivis
|
+- python
|  +- # files to run the example via python sdk, which we will not need now 

Running the example code:

  • We use Gradle as our Java-Package-Manager. It's easiest to directly use the gradle wrapper.
  • Navigate to the **/java subfolder. Here, you find the build.gradle. Check, if the paths locate correctly to your aivis engine v2 .jar files in the **/libs subfolder.
  • open a console in the **/java subfolder and run the following commands:
      # builds this Java project with gradle wrapper
      ./gradlew clean build
    
      # runs Java with parameters referring to input and output folder
      java -jar build/libs/example_sd.jar --input=../data --output=output
    

After running the scripts, you will find your computation results in **/java/output.

Artifacts

Our SDK artifacts come in two flavours:

  • full packages provide the full functionality and are available for mainstream targets only:
    • win-x8664
    • macos-armv8* (SDK >= 11.0) 2.3
    • macos-x8664* (SDK >= 11.0; until aivis engine version 2.9.0) 2.3
    • linux-x8664 (glibc >= 2.14)
  • inf packages contain only API functions regarding the inference of a model. As lightweight artifacts they are available for a broader target audience:
    • win-x8664
    • macos-armv8* (SDK >= 11.0) 2.3
    • macos-x8664* (SDK >= 11.0; until aivis engine version 2.9.0) 2.3
    • linux-x8664 (glibc >= 2.14)
    • linux-armv7 (glibc >= 2.18; until aivis engine version 2.9.0)
    • linux-armv8 (glibc >= 2.18; until aivis engine version 2.9.0)
    • linux-ppc64 (glibc >= 2.18; until aivis engine version 2.2.0)

* Only Python and C SDKs are supported. Java SDK is not available for this target.

In this chapter we want to demonstrate the full API functionality and thus always use the full package.

To use the Python-SDK you must download the SDK artifact (flavour and target generic) for your pythonpath at build time. Additionally at installation time, the runtime artifact must be downloaded with the right flavour and target.

The artifacts are distributed through a PyPI registry.

Using Poetry you can simply set a dependency on the artifacts specifying flavour and version. The target is chosen depending on your installation system:

aivis_engine_v2_sd_sdk_python = "{VERSION}"
aivis_engine_v2_sd_runtime_python_{FLAVOUR} = "{VERSION}"

The SDK supports the full API and will throw a runtime exception if a non-inference function is invoked with an inference-flavoured runtime.

To use the Java-SDK, you must download at build time:

  • SDK artifact (flavour and target generic) for your compile and runtime classpath
  • Runtime artifact with the right flavour and target for your runtime classpath

It is possible to include multiple runtime artifacts for different targets in your application to allow cross-platform usage. The SDK chooses the right runtime artifact at runtime.

The artifacts are distributed through a Maven registry.

Using Maven, you can simply set a dependency on the artifacts specifying flavour, version and target:

<dependency>
  <groupId>com.vernaio</groupId>
  <artifactId>aivis-engine-v2-sd-sdk-java</artifactId>
  <version>{VERSION}</version>
</dependency>
<dependency>
  <groupId>com.vernaio</groupId>
  <artifactId>aivis-engine-v2-sd-runtime-java-{FLAVOUR}-{TARGET}</artifactId>
  <version>{VERSION}</version>
  <scope>runtime</scope>
</dependency>

Alternativly, with Gradle:

implementation 'com.vernaio:aivis-engine-v2-sd-sdk-java:{VERSION}'
runtimeOnly    'com.vernaio:aivis-engine-v2-sd-runtime-java-{FLAVOUR}-{TARGET}:{VERSION}'

The SDK supports the full API and will throw a runtime exception if a non-inference function is invoked with an inference-flavoured runtime.

To use the C-SDK, you must download the SDK artifact at build time (flavour and target generic). For final linkage/execution you need the runtime artifact with the right flavour and target.

The artifacts are distributed through a Conan registry.

Using Conan, you can simply set a dependency on the artifact specifying flavour and version. The target is chosen depending on your build settings:

aivis-engine-v2-sd-sdk-c/{VERSION}
aivis-engine-v2-sd-runtime-c-{FLAVOUR}/{VERSION}

The SDK artifact contains:

  • Headers: include/aivis-engine-v2-sd-core-full.h

The runtime artifact contains:

  • Import library (LIB file), if Windows target: lib/aivis-engine-v2-sd-{FLAVOUR}-{TARGET}.lib
  • Runtime library (DLL file), if Windows target: bin/aivis-engine-v2-sd-{FLAVOUR}-{TARGET}.dll (also containing the import library)
  • Runtime library (SO file), if Linux target: lib/aivis-engine-v2-sd-{FLAVOUR}-{TARGET}.so (also containing the import library)

The runtime library must be shipped to the final execution system.

Licensing

A valid licensing key is necessary for every aivis calculation in every engine and every component. It has to be set (exported) as environment variable AIVIS_ENGINE_V2_API_KEY.

If aivis returns a licensing error despite the environment variable being set, please check the following items

  • Terminals usually need to be restarted to learn newly set environment variables.
  • Licensing keys have the typical form <FirstPartOfKey>.<SecondPartOfKey> with first and second part being UUIDs. In particular, there is no whitespace.
  • A common error source is that the user's firewall does not let HTTPS requests to v3.aivis-engine-v2.vernaio-licensing.com (before release 2.7: v2.aivis-engine-v2.vernaio-licensing.com, before release 2.3: aivis-engine-v2.perfectpattern-licensing.de) pass and the licensing request never reaches the licensing server. In that case outgoing connections to that hostname and TCP port 443 need to be whitelisted.

Setup

Before we can invoke API functions of our SDK, we need to set it up for proper usage and consider the following things.

Releasing Unused Objects

It is important to ensure the release of allocated memory for ununsed objects.

In Python, freeing objects and destroying engine resources like Data-, Training- and Inference-objects is done automatically. You can force resource destruction with the appropriate destroy function.

In Java, freeing objects is done automatically, but you need to destroy all engine resources like Data-, Training- and Inference-objects with the appropriate destroy function. As they all implement Java’s AutoClosable interface, we can also write a try-with-resource statement to auto-destroy them:

try(final StateDetectionData trainingData = StateDetectionData.create()) {

  // ... do stuff ...

} // auto-destroy when leaving block

In C, you must always

  • free every non-null pointer allocated by the engine with aivis_free (all pointers returned by functions and all double pointers used as output function parameter e.g. Error*)
    Note: aivis_free will only free own objects. Also, it will free objects only once and it disregards null pointers.
  • free your own objects with free as usual.
  • destroy all handles after usage with the appropriate destroy function.

Error Handling

Errors and exceptions report what went wrong on a function call. They can be caught and processed by the outside.

In Python, an Exception is thrown and can be caught conveniently.

In Java, an AbstractAivisException is thrown and can be caught conveniently.

In C, every API function can write an error to the given output function parameter &err (to disable this, just set it to NULL). This parameter can then be checked by a helper function similar to the following:

const Error *err = NULL;

void check_err(const Error **err, const char *action) {

  // everything is fine, no error
  if (*err == NULL)
    return;

  // print information
  printf("\taivis Error: %s - %s\n", action, (*err)->json);

  // release error pointer
  aivis_free(*err);
  *err = NULL;

  // exit program
  exit(EXIT_FAILURE);
}

Failures within function calls will never affect the state of the engine.

Logging

The engine emits log messages to report on the progress of each task and to give valuable insights. These log messages can be caught via registered loggers.

# create logger
class Logger(EngineLogger):
    def log(self, level, thread, module, message):
        if (level <= 3):
            print("\t... %s" % message)

# register logger
StateDetectionSetup.register_logger(Logger())
// create and register logger
StateDetectionSetup.registerLogger(new EngineLogger() {
            
    public void log(int level, String thread, String module, String message) {
        if (level <= 3) {
            System.out.println(String.format("\t... %s", message));
        }
    }
});
// create logger
void logger(const uint8_t level, const char *thread, const char *module, const char *message) {
  if (lvl <= 3)
    printf("\t... %s\n", message);
}

// register logger
aivis_setup_register_logger(&logger, &err);
check_err(&err, "Register logger");

Thread Management

During the usage of the engine, a lot of calculations are done. Parallelism can drastically speed things up. Therefore, set the maximal threads to a limited number of CPU cores or set it to 0 to use all available cores (defaults to 0).

# init thread count
StateDetectionSetup.init_thread_count(4)
// init thread count
StateDetectionSetup.initThreadCount(4);
// init thread count
aivis_setup_init_thread_count(4, &err);
check_err(&err, "Init thread count");

Data Input

Now that we are done setting up the SDK, we need to create a data store that holds our historical Training Data which is to be used for analysis and training. In general, all data must always be provided through data stores. You can create as many as you want.

After the creation of the data store, you can fill it with signal data.

# create empty data context for training data
training_data = StateDetectionData.create()

# add sample data
training_data.add_float_signal("signal-name", [
  DtoFloatDataPoint(1640995200000, 1.0),
  DtoFloatDataPoint(1640998800000, 2.0),
  DtoFloatDataPoint(1641002400000, 4.0),
])

# ... use training data ...
// create empty data context for training data
try(final StateDetectionData trainingData = StateDetectionData.create()) {

  // add sample data
  trainingData.addFloatSignal("signal-name", Arrays.asList(
    new DtoFloatDataPoint(1640995200000L, 1.0),
    new DtoFloatDataPoint(1640998800000L, 2.0),
    new DtoFloatDataPoint(1641002400000L, 3.0),
  ));

  // ... use training data ...

} // auto-destroy training data
// create empty data context for training data
TimeseriesDataHandle training_data = aivis_timeseries_data_create(&err);
check_err(&err, "Create training data context");

const DtoFloatDataPoint points[] = {
  {1640995200000, 1.0},
  {1640998800000, 2.0},
  {1641002400000, 4.0},
};

// add sample data
aivis_timeseries_data_add_float_signal(training_data, "signal-name", &points[0], sizeof points / sizeof *points, &err);
check_err(&err, "Adding signal");

// ... use training data ...

// destroy data context
aivis_timeseries_data_destroy(training_data, &err);
check_err(&err, "Destroy data context");
training_data = 0;

Analysis

With the data store filled with historical Training Data, we can now create our analysis:

# build analysis config
analysis_config = json.dumps({
  "target": {
    "signal": "failure"
  },
  "selfLabeling": {
    "coherencePeriod": 432000000,
    "mesh": 3600000
  },
  "sampling": {
    "additionalSampleMesh": None
  },
})

# create and perform analysis
analysis = StateDetectionAnalysis.create(training_data, analysis_config)

# ... use analysis ...
// build analysis config
final DtoAnalysisConfig analysisConfig =
    new DtoAnalysisConfig(new DtoTargetConfig("failure"), new DtoSelfLabelingConfig(432000000L, 3600000L), new DtoAnalysisSamplingConfig())
    );

// create and perform analysis
final StateDetectionAnalysis analysis = StateDetectionAnalysis.create(trainingData, analysisConfig) {

  // ... use analysis ...

} // auto-destroy analysis
// build analysis config
const char *analysis_config = "{"
  "\"target\": {"
    "\"signal\": \"failure\""
  "},"
  "\"selfLabeling\": {"
    "\"coherencePeriod\": 432000000,"
    "\"mesh\": 3600000"
  "}",
  "\"sampling\": {"
    "\"additionalSampleMesh\": null"
  "}"
"}";

For the moment, you may take the training configuration as it is. Different configuration will be explained later on.

// create and perform analysis
StateDetectionAnalysisHandle analysis_handle = aivis_state_detection_analysis_create(
  training_data,
  (uint8_t *) analysis_config,
  strlen(analysis_config),
  &err
);
check_err(&err, "Create analysis");

// ... use analysis ...

// destroy analysis
aivis_state_detection_analysis_destroy(analysis_handle, &err);
check_err(&err, "Destroy analysis");
analysis_handle = 0;

Training

With the data store filled with historical Training Data and the analysis, we can now create our training. The training config can contain changes on the proposed segmentation, but in this example we only change the maximal sample count.

# build training config
training_config = json.dumps({
  "sampling": {
    "maximalSampleCount": 300
  }
})

# create training and train the model
training = StateDetectionTraining.create_by_analysis(training_data, analysis, training_config)

# ... use training ...
// build training config
final DtoTrainingConfig trainingConfig = new DtoTrainingConfig().withSampling(
    new DtoTrainingSamplingConfig().withMaximalSampleCount(300)
);

// create training and train the model
final StateDetectionTraining training = StateDetectionTraining.createByAnalysis(trainingData, analysis, trainingConfig) {

  // ... use training ...

} // auto-destroy training
// build training config
const char *training_config = "{"
  "\"sampling\": {"
    "\"maximalSampleCount\": 300
  "},"
"}";

// create training and train the model
StateDetectionTrainingHandle training_handle = aivis_state_detection_training_create_by_analysis(
  training_data,
  analysis_handle,
  (uint8_t *) training_config,
  strlen(training_config),
  &err
);
check_err(&err, "Create training");

// ... use training ...

// destroy training
aivis_state_detection_training_destroy(training_handle, &err);
check_err(&err, "Destroy training");
training_handle = 0;

Evaluation / Inference

After the training has finished, we can evaluate it by running a historical evaluation (bulk inference) on the inference data (out-of-sample). This way, we obtain a continuous stream of values — exactly as it would be desired by the machine operator.

As we do the inference in the same process with the training, we can create the inference directly from the training. If these two processes were separated we could get the model explicitly from the training and write it to a file. The inference could then be created based on the content of the model file.

# build inference config
inference_config = json.dumps({"skipOnInsufficientData": True})

# create inference
inference = StateDetectionInference.create_by_training(training, inference_config)

# ... use inference ...
// build inference config
final DtoInferenceConfig inferenceConfig = new DtoInferenceConfig(true);

// create inference
try(final StateDetectionInference inference = StateDetectionInference.createByTraining(training, inferenceConfig)) {

  // ... use inference ...

} // auto-destroy inference
// build inference config
const char *inference_config = "{\"skipOnInsufficientData\": true}";

// create inference
StateDetectionInferenceHandle inference_handle = aivis_state_detection_inference_create_by_training_handle(
  training_handle,
  (uint8_t *) inference_config,
  strlen(inference_config),
  &err
);
check_err(&err, "Create inference");

// ... use inference ...

// destroy inference
aivis_state_detection_inference_destroy(inference_handle, &err);
check_err(&err, "Destroy inference");
inference_handle = 0;

Finally, we want to infer some scores for a list of Inference Timestamps. Therefore we again need to provide a filled data store which this time holds our Inference Data, just in the same way as we created our Training Data store. We then invoke the appropriate infer function. Note that the infer with next normal function is an alternative, experimental function to better understand how some state has been detected, see the section on Infer With Next Normal.

# choose inference timestamps
timestamps = ...

# infer scores
scores = inference.infer(inference_data, timestamps)

# ... use scores e.g. for plotting ...
// choose inference timestamps
final List<Long> timestamps = ...

// infer scores
final List<DtoSegmentsFloatDataPoint> scores = inference.infer(inferenceData, timestamps);

// ... use scores e.g. for plotting ...
// choose inference timestamps
Time *timestamps = ...

// infer scores
const List_DtoSegmentsFloatDataPoint *scores = aivis_state_detection_inference_infer(
  inference_handle,
  inference_data,
  timestamps,
  timestamps_len,
  &err
);
check_err(&err, "Infer scores");

// ... use scores e.g. for plotting ...

// free scores
aivis_free(scores);
scores = NULL;

// free timestamps
free(timestamps);
timestamps = NULL;

When plotted, the output looks like this (three scores in yellow, green and blue):

Scores

Besides plotting, you can feed the predicted values back as new signal into your hot storage / time series database / historian and reach our initial goal of providing the machine operator with continuous live quality scores.

In the next chapter we will focus on the nature of input data.

Data Specification

In the course of using aivis, large amounts of data are ingested. This chapter explains the terminology as well as the required format, quality and quantity.

Timeseries Data / Signals

Most aivis engines work on time series data that is made up of signals. Every signal consists of two things, these being

  • an ID, which is any arbitrary String except timestamp and availability. The ID needs to be unique within the data.
  • a list of data points. Each data point consists of a signal value and a specific point in time, the Detection Timestamp (optionally there can also be an Availability Timestamp, but more on that later). Usually the values are the result of a measurement happening in a physical sensor like a thermometer, photocell or electroscope, but you can also use market KPIs like stock indices or resource prices as a signal.

The data points for one or more signals for a certain detection time range are called a history.

Timeseries

The values of a signal can be boolean values, 64-bit Floating Point numbers or Strings. Non-finite numbers (NAN and infinity) and empty strings are regarded as being unknown and are therefore skipped.

Points in time are represented by UNIX Timestamps in milliseconds (64-bit Integer). This means the number of milliseconds that have passed since 01.01.1970 00:00:00 UTC.

Detection Timestamp

The point in time that a signal value belongs to is called the Detection Timestamp. This usually is the timestamp when the measurement originally has taken place. If the measurement is a longer offline process, it should refer to the point in time at which the measured property was established, e.g. the time point of sample drawing or the production time for delayed sampling. In case of the target signal, the Detection Timestamp should be set to the time you would have liked to have measured the signal online. In the aivis Signal Prediction example use case, the paper quality is such a signal. It is measured around 2 hours after the production of the paper in a laboratory and must be backdated to a fictitious, but instantaneous quality measurement in the process.

Different signals may have different Detection Timestamps. Some might have a new value every second, some every minute, some just when a certain event happens. aivis automates the process of synchronizing them internally. This includes dealing with holes in the data.

Availability Timestamp

When doing a historical evaluation, we want to know what the engine would have inferred/predicted for a list of Inference Timestamps that lie in the past (Inference Timestamps are the moments for which you want to get an inference). For a realistic inference, the engine must ignore all signal values that were not yet available to the database at the Inference Timestamp. A good example for such a case is a measurement, that is recorded by a human. The value of this measurement will be backdated by him/her to the Detection Timestamp, but it took e.g. 5 minutes to extract the value and report it to the system. So, it would be wrong to assume that one minute after this fictitious Detection Timestamp, the value would have been already available to the Inference. Another example case is the fully automated lagged data ingestion of distributed systems (especially cloud systems).

There are multiple ways to handle availability. Which strategy you use depends an the concrete use case. Availability

To allow for these different strategies, every data point can have an additional Availability Timestamp that tells the system when this value became available or would have been available. Signal values for which the Availability Timestamp lies after the Inference Timestamp are not taken it into account for an inference at this Inference Timestamp.

If there is no knowledge about when data became available, the Availability Timestamp can be set to the Detection Timestamp — but then you must keep in mind that your historical evaluation might look better as it could have been in reality.

Data Recommendations

aivis works best on raw, unprocessed data. It is important to keep the following rules in mind:

  • Remove signals beforehand only if you are absolutely sure that they are unrelated to your objective! The engine will select all relevant signals, anyway, and removing signals may reduce quality.
  • Avoid linear interpolation (or similar data processing steps), as this would include information from the future and therefore invalidate or worsen the results.
  • It is okay (except for aivis Signal Monitor) to drop consecutive duplicate values of one signal (e.g., if the value stays the same for a long period of time). This is because the engine assumes the value of a signal to be constant until a new data point is given, though there are subtleties for the target signal. It is, however, not advisable to drop duplicate values when running aivis Signal Monitor SignalInactive trigger, since the engine learns how often the signal gets a new data point.
  • Do not train the engine on signals that wouldn't be there in live operative (i.e., during the inference phase). Doing so could harm the prediction quality because the engine might choose to use these soon-to-be-missing signals for prediction. For aivis Signal Monitor, this may produce unnecessary (or false) warnings (e.g., SignalInactive).

Training Data Filtering

There is the possibility of filtering the Training Data in multiple ways:

  • The overall time window can be restricted.
  • Signals can be excluded and included as a whole.
  • Specific time windows of specific signals can be excluded or included.

The filtering is configurable:

  • The docker image Training Worker can be configured in the main config file.
  • SDK Training API has filter nodes in the their config structure.

This means that two models could be trained on the same data set, but on different time windows or signal sets. Alternatively, the user can of course also restrict the data that enters the engine beforehand.

Training vs. Inference Data

aivis uses data at two distinct points in the workflow:

  1. Training Data is used to train a model from knowledge that was derived from historical data. To ensure high quality of the model, you should use as many signals as possible over a period of time in a fine resolution that fits to your objective. The engine can ingest several thousands of signals and time ranges over multiple years. The idea is to simply put in all the data you have. The engine will filter out irrelevant signals by itself.
  2. Inference Data is the small amount of live data that is used as the direct input to make an inference/prediction. For each Inference Timestamp, the engine needs a small and recent history of the relevant signals to understand the current situation of the system. You can find more information on this in the next section Inference Data Specification.

Inference Data Specification

When making an inference, aivis must know the current state of the real system by including a small portion of history.

In Training, the engine calculates which signals among the many signals in the Training Data will be relevant for the Inference. Furthermore, for each relevant signal a time window is specified relative to the Inference Timestamp. This time window determines which values of the signal must be included in order to make a prediction for said timestamp. This doesn't only include the values within the time window, but also either the value right at the start, or the last value before the time window (see "Nearest Predecessor"). This information is called the Inference Data Specification and must be obeyed strictly when triggering Inference, as the engine relies on this data.

You can inspect a model for its Inference Data Specification.

It is possible to set the maximum amount of time to be included in the local history. This is done in the configuration of the Training via the parameter Maximal Lag.

The following diagram gives you a visual representation of how an Inference Data Specification could look like:

Inference Data Specification

In the diagram you see that a start lag and end lag is specified for every signal. For the Inference, this means that for each signal we need all data points whose detection timestamps lie in the window [ inference timestamp - start lag; inference timestamp - end lag ] as well as the nearest predecessor (see below).

Nearest Predecessor

As previously mentioned, it is essential that you provide data for the whole time window. Especially, it must be clear what the value at the beginning is, i.e. at inference timestamp - start lag.

Typically, there is no measurement for exactly this point in time. Then, you must provide the nearest predecessor. This is the last value before the beginning of the time window. Then, the engine can at least take this value as an estimate. Of course this first data point must also be available at the Inference Timestamp (regarding the Availability Timestamp).

Nearest Predecessor

Depending on the configuration, the engine will either throw an error or ignore timestamps for which you provide neither a value at the beginning of the time window nor a nearest predecessor. This implies that you always need at least one available value per relevant signal. Sending more data outside the demanded time window will have no effect on the inference, though.

CSV Format

All artifacts use CSV as the input data format. As the CSV format is highly non-standardized, we will discuss it briefly in this section.

CSV files must be stored in a single folder specified in the config under data.folder. Within this folder the CSV files can reside in an arbitrary subfolder hierarchy. In some cases (e.g. for HTTP requests), the folder must be passed as a ZIP file.

General CSV rules:

  • The file’s charset must be UTF-8.
  • Records must be separated by Windows or Unix line ending (CR LF/LF). In other words, each record must be on its own line.
  • Fields must be separated by comma.
  • The first line of each CSV file represents the header, which must contain column headers that are file-unique.
  • Every record including the header must have the same number of fields.
  • Text values must be enclosed in quotation marks if they contain literal line endings, commas or quotation marks.
  • Quotation marks inside such a text value have to be prefixed (escaped) with another quotation mark.

Special rules:

  • One column must be called timestamp and contain the Detection Timestamp as UNIX Timestamps in milliseconds (64-bit Integer)
  • Another column can be present that is called availability. This contains the Availability Timestamp in the same format as the Detection Timestamp.
  • All other columns, i.e. the ones that are not called timestamp or availability, are interpreted as signals.
  • Signal IDs are defined by their column headers
  • If there are multiple files containing the same column header, this data is regarded as belonging to the same signal
  • Signal values can be boolean values, numbers and strings
  • Empty values are regarded as being unknown and are therefore skipped
  • Files directly in the data folder or in one of its subfolders are ordered by their full path (incl. filename) and read in this order
  • If there are multiple rows with the same Detection Timestamp, the data reader proceeds all to the engine which uses the last value that has been read

Boolean Format

Boolean values must be written in one of the following ways:

  • true/false (case insensitive)
  • 1/0
  • 1.0/0.0 with an arbitrary number of additional zeros at the end

Regular expression: (?i:true)|(?i:false)|1(\.0+)?|0(\.0+)?

Number Format

Numbers are stored as 64-bit Floating Point numbers. They are written in scientific notation like -341.4333e-44, so they consist of the compulsory part Significand and an optional part Exponent that is separated by an e or E.

The Significand contains one or multiple figures and optionally a decimal separator .. In such a case, figures before or after the separator can be ommited and are assumed to be 0. It can be prefixed with a sign (+ or -).

The Exponent contains one or multiple figures and can be prefixed with a sign, too.

The 64-bit Floating Point specification also allows for 3 non-finite values (not a number, positive infinity and negative infinity) that can be written as nan, inf/+inf and -inf (case insensitive). These values are valid, but the engine regards them as being unknown and they are therefore skipped.

Regular expression: (?i:nan)|[+-]?(?i:inf)|[+-]?(?:\d+\.?|\d*\.\d+)(?:[Ee][+-]?\d+)?

String Format

String values must be encoded as UTF-8. Empty strings are regarded as being unknown values and are therefore skipped.

Example

timestamp,availability,SIGNAL_1,SIGNAL_2,SIGNAL_3,SIGNAL_4,SIGNAL_5
1580511660000,1580511661000,99.98,74.33,1.94,true,
1580511720000,1580511721000,95.48,71.87,-1.23,false,MODE A
1580511780000,1580511781000,100.54,81.19,,1e-5,MODE A
1580511840000,1580511841000,76.48,90.01,2.46,0.0,MODE C
...

Preparation and Incidents

Previous sections gave an introduction on how to use aivis State Detection and also shed some light on how it works. The following sections will provide a more profound background. It is not necessary to know this background to use aivis State Detection! However, you may find convenient solutions for specific problems, such as special data kinds or restrictions, or ways to reduce training computation time. The following sections are organized in the natural order of the workflow. By workflow, we mean the cycle of data preparation, data analysis, model training, and finally utilizing it by making inferences with it. It will become clear that only minimal user input is required for this workflow. Nevertheless, the user has the option to control the process with several input parameters which will be presented below.

Before approaching the workflow steps and their configuration, two things are required. First, there must be a data set that meets the previously described data specifications. Second, and this is equally important, the incidents in the data must be clear. They need to have an exact occurrence time, indicated e.g. by a certain signal switching to True. Applying this to our introductory use case example, the data set would consist of sensor measurements from the test stand and a signal called failure which is False usually and switches to True when a engine breaks. It makes sense to also take a step back and contemplate the goal of the task at hand. In the case of aivis State Detection, the question will be similar to “How can I have less incidents?”, “What are the root causes of my incidents?”, “How can I understand these root causes” and “Can I be informed that my process reached a state where an incident is likely?”

Incidents can be any kind of events that happen from time to time, and are related to some short-term change of the system. In our example use case, the incidents are machine failures. It is not recommended to apply aivis State Detection to thresholds on continuous signals. For example, you may want to assure the temperature of some component to stay below 80°C. Technically, you could create a boolean signal that switches when this threshold is exceeded, and use this signal as target in aivis State Detection. But temperature changes slowly, by a continuous process, from below to above 80°C. In that case, another engine like aivis Signal Prediction or aivis Response Analysis may be more suitable.

Analysis

The analysis comprises all the steps to reach a segmentation. In a segmentation incidents are clustered according to their root causes. aivis relieves the user of many steps that are typically associated with the machine learning process. Some of the steps are performed automatically, others are rendered superfluous. While domain knowledge can be employed, this is not a necessity. The following section will illuminate these topics in more detail. It is explained how the raw data is transformed, additional features are extracted automatically or manually, and relevant features are identified. In the following sections, the analysis configuration parameters are explained in some detail. Here we focus on explaining the underlying concepts. When it comes to actually making the inputs, syntactical questions will be answered by the aivis reference manuals, which define the exact inputs that can be made depending on whether you are using one of the SDKs or a Docker container.

Workflow

Feature Engineering

The requirements for the input data that were formulated in section Data Specification serve the purpose of making it possible for aivis to read the data. Typically, several additional steps are required to make the data appropriate to be fed into a machine learning algorithm. Among others, these include:

  • synchronizing timestamps
  • dealing with missing values
  • standardization
  • handling outliers
  • filtering white noise

All of the above is handled by aivis automatically. Here, data preparation steps that go beyond anything described in the “Data Specification” section are not necessary and even discouraged as they may alter the underlying information. Synchronization is not necessary as aivis treats each signal separately as a time series, and this also eliminates the need for imputation of missing values. Standardization is a routine task that is executed internally (if necessary at all). Outliers don’t need to be removed beforehand or cut back thanks to the way the aivis engine builds its models. Some minimal outlier handling might still be beneficial as will be explained below.

When the data is brought into a form that can directly be used for analysis and model building, it is referred to as “features”. Sometimes, signals are noisy and fluctuate a lot, while the change in what the signal actually is measuring may be only very small. To reduce such noise, a common approach would be to calculate a moving mean and use it as a feature. This is just one example for data aggregation over some time interval but there may be other cases, involving e.g. the maximum of some value in a certain time range, or similar. In aivis, such feature engineering using moving time windows is not necessary. Here, it pays off that aivis understands the data as time series and automatically extracts powerful features. Again, this is explained in more detail below.

As mentioned above, aivis takes over many time-consuming data preparation and feature engineering tasks. As the aivis analysis and training algorithm is very powerful, in most cases the automatically generated features already suffice to obtain an optimum model. For some special kinds of signals such as angles, audio data, or categorical data, there are built-in interpreters that take care of the relevant pecularities. Nevertheless, there are still some situations for which the model performance may improve adding manually engineered features, for which you already know or expect that they are related to the target. For this purpose, the expression language is provided that facilitates manual feature engineering. This way domain knowledge can easily be taken into account. However, you will be surprised how well aivis predicts signals even without any manual efforts.

Signal Cleaning

The philosophy of aivis is that the user does not need to care about which signals may be relevant. Therefore, you may insert a large number of signals. From each signal, typically several more features are auto-engineered, e.g. to reflect the behaviour of the signal within some relevant time window. This means that we have many features that can all potentially be used for analysis and model training. In most cases however, only a small selection of the available features will be of use for predicting the target. Although inclusion of irrelevant features may not worsen prediction accuracy, it will have a negative impact on computational resources for model training. While identifying the relevant features has traditionally required domain knowledge, aivis is capable of making the selection automatically.

Finding the relevant signals is done with the help of calculating the distance correlations of the features to the target. The distance correlation is a measure of how much two vectors are related to each other. For example, if the distance correlation between some feature and the target signal equals 0, then the feature and the target signal are statistically independent. Therefore, nothing can be concluded from this feature. On the other hand, if the distance correlation between some feature and the target signal equals 1, the target signal could be perfectly predicted by this signal alone. Therefore, in this way signals which are (practically) independent of the target are cleaned out automatically in the analysis.

Even if some feature (distance) correlates with the target signal, it does not necessarily mean that it adds relevant information. Often, two or several features behave so similarly that they effectively become redundant. In this case, the more important ones are kept, thus enabling a more effective training stage.

Hereinafter, we will go back to using the word “signal” instead of “feature”. While there are subtle differences, in most cases the terms can be used interchangeably.

Segmentation

Although all engines do some kind of segmentation, which is a distinctive virtue of aivis, this step has a special role in aivis State Detection. Each segment contains incidents of a certain state of the process. Therefore, different root causes underlie incidents from different segments. It is crucial to understand said root cause to be able to apply countermeasures: Some countermeasures may be one-time fixes, and others may have to be applied every time when the root cause's score increases in live inference. For this, the analysis provides a set of information for every segment.

Segments

First of all, every incident in the training data is clearly sorted into a segment. For the example use case, we have three segments: Segment 0 holds 34 incidents, segment 1 holds 33 incidents and segment 2 holds 23 incidents.

The sorting is done by evaluating the signal histories before incidents: Have the signals behaved similarly? Or do they show suspicious behaviour changes before an incident? Simply said, incidents where e.g. signal A and B suspiciously rise five minutes before an incident will be sorted into a different segment as incidents where signal C suspiciously decreases ten minutes before an incident. The way signal histories are analyzed can be configured in the self labeling config.

Understanding a root cause

So, the first step to understand a segment is to look at the list of its incidents. The analysis report does not only provide their start times (when the boolean target signal switched to true) and end times (when their boolean target signal switched back to false - algorithmically not a very relevant time) but also its origin time. This is the time before an incident where aivis detected a process change. It is also a rough indication of how early you can typically expect an alarm from aivis.

The second step to understand a segment is to look at its signal correlations. For each segment there are signal correlations for all signals which aivis selected to be relevant. A signal correlation is a value between 0 and 1. To understand the root causes of a segment, it's usually sufficient to look at the signals with highest correlation. The advanced user can even have more information on the signal's correlation, like via lagged correlations.

The third step is to inspect the signals with the highest signals more closely. This can be done by looking directly at their behaviour just before incidents occur. A convenient alternative are behaviour plots. They compare the signals' mean behaviour before the incidents (anormal curves) to the behaviour far away from incidents (normal curves).

Some segmentations have a segmentation residue, which is a bucket for incidents that did not fit into any other segment. This segment is special, as incidents in the same residue generally do not have the same root cause. Therefore this segment is listed separately.

Zoom Level

Aivis proposes the segmentation for which the distance between the segments is highest. This is usually a sensible choice. Nevertheless, mainly for advanced users, the engine can propose several segmentations of different granularity. Then, the user can pick a certain segmentation that fits best their needs as will be explained below.

At the beginning of the segmentation process, all incidents are in a single segment. This segmentation pertains to zoom level 0. Then a "zooming" process is started. In each iteration of the process, one segment is split up into two segments such that segments have a high distance to each other. Each iteration pertains to a new zoom level. In the end, all segmentations are rated. The rating is highest for the zoom level for which segments have the highest distance to each other, so they are "most distinct". With the configuration parameter output multiple zoom levels, the analysis contains all segmentations of all zoom levels.

Still, the user should start by looking at the segmentation with the highest rating. However, when e.g. a segment seems to contain incidents of two root causes, the zoom index can be increased until said segment splits up.

For model training the user must select which segments to use. Typically all segments corresponding to a certain zoom level are chosen. But it is even possible to combine segments from different zoom level.

A Fully Loaded Analyis Configuration

Before next going into the details and explanations of how the analysis may be configured, first we provide an overview of all kinds of possible configuration keys. We stress that the vast majority of the keys is optional. A minimal analysis configuration was used above in SDK analysis, respectively, in Docker analysis. This example may mainly serve as a quick reference. The definition of the syntax is given in the reference manuals.

analysis:
  dataFilter:
    startTime: 1262304000000
    endTime: 1328468400000
    excludeSignals:
    - signal: sensor_measurement_14
      startTime: 1262304000000
      endTime: 1285490000000
    # includeSignals: ... similar
    includeRanges:
    - startTime: 1262304000000
      endTime: 1262305000000
    - startTime: 1285490000000
      endTime: 1328468400000  
    # excludeRanges: ... similar
  target:
    signal: failure
  selfLabeling:
    mesh: 3600000
    coherencePeriod: 432000000
  sampling:
    additionalSampleMesh: 1800000
    maximalSampleCount: 100000
  operativePeriods:
    signal: MY_BOOLEAN_OPERATIVE_SIGNAL
    batches: false
  signals:
  - signal: sensor_measurement_11
    forceRetain: true
  - signal: sensor_measurement_13
    lagging:
      maximalLag: 300000
      minimalLag: 0
      mesh: 30000
  - signal: sensor_measurement_15
    interpreter:
      _type: Cyclic
      cycleLength: 0.2
  - signal: sensor_measurement_9
    interpreter:
      _type: Oscillatory
      windowLength: 18000000
      mesh: 1800000
  - signal: MY_CATEGORICAL_SIGNAL
    interpreter:
      _type: Categorical
  lagging:
    maximalLag: 9000000
    minimalLag: 1800000
    mesh: 1800000
  segmenting:
    minimalSegmentSize: 5
    outputMultipleZoomLevels: true
analysis_config = json.dumps({
  "dataFilter": {
    "startTime": 1262304000000,
    "endTime": 1328468400000,
    "excludeSignals": [{
      "signal": "sensor_measurement_14",
      "startTime": 1262304000000,
      "endTime": 1285490000000
    }], 
    # "includeSignals": ... similar
    "includeRanges" : [{
      "startTime": 1262304000000,
      "endTime": 1262305000000
    },{
      "startTime": 1285490000000,
      "endTime": 1328468400000
    }],
    # "excludeRanges": ... similar
  },
  "target": {
    "signal": "failure",
  },
  "selfLabeling": {
    "mesh": 3600000,
    "coherencePeriod": 432000000
  },
  "sampling": {
    "additionalSampleMesh": 1800000,
    "maximalSampleCount": 100000
  },
  "operativePeriods": {
    "signal": "MY_BOOLEAN_OPERATIVE_SIGNAL",
    "batches": False
  },
  "signals": [{
    "signal": "sensor_measurement_11",
    "forceRetain": True
  }, {
    "signal": "sensor_measurement_13",
    "lagging": {
      "maximalLag": 300000,
      "minimalLag": 0,
      "mesh": 30000
  }}, {
    "signal": "sensor_measurement_15",
    "interpreter": {
      "_type": "Cyclic",
      "cycleLength": 0.2
  }}, {
    "signal": "sensor_measurement_9",
    "interpreter": {
      "_type": "Oscillatory",
      "windowLength": 18000000,
      "mesh": 1800000
  }}, {
    "signal": "MY_CATEGORICAL_SIGNAL",
    "interpreter": {
      "_type": "Categorical"
  }}],
  "lagging": {
    "maximalLag": 9000000,
    "minimalLag": 1800000,
    "mesh": 1800000
  },
  "segmenting": {
    "minimalSegmentSize": 5,
    "outputMultipleZoomLevels": True
}})
final DtoAnalysisConfig analysisConfig = new DtoAnalysisConfig(
  new DtoTargetConfig("failure"),
  new DtoSelfLabelingConfig(432000000L, 3600000L),
  new DtoAnalysisSamplingConfig()
    .withAdditionalSampleMesh(1800000L)
    .withMaximalSampleCount(100000)
)
  .withDataFilter(new DtoDataFilter()
    .withStartTime(1262304000000L)
    .withEndTime(1328468400000L)
    .withExcludeSignals(new DtoDataFilterRange[] { 
      new DtoDataFilterRange("sensor_measurement_14")
        .withStartTime(1262304000000L)
        .withEndTime(1285490000000L)
    })
    // .withIncludeSignals ... similar
    .withIncludeRanges(new DtoInterval[] {
      new DtoInterval()
        .withStartTime(1262304000000L)
        .withEndTime(1262305000000L),
      new DtoInterval()
        .withStartTime(1285490000000L)
        .withEndTime(1328468400000L)
    })
    // .withExcludeRanges ... similar
  )
  .withOperativePeriods(new DtoOperativePeriodsConfig("MY_BOOLEAN_OPERATIVE_SIGNAL", false))
  .withSignals(new DtoSignalConfig[] {
    new DtoSignalConfig("sensor_measurement_11")
      .withForceRetain(true),
    new DtoSignalConfig("sensor_measurement_13")
      .withLagging(new DtoSignalLaggingConfig(300000L, 30000L)
        .withMinimalLag(0L)),
    new DtoSignalConfig("sensor_measurement_15")
      .withInterpreter(new DtoCyclicSignalInterpreter(0.2)),
    new DtoSignalConfig("sensor_measurement_9")
      .withInterpreter(new DtoOscillatorySignalInterpreter(18000000L,1800000L)),
    new DtoSignalConfig("MY_CATEGORICAL_SIGNAL")
      .withInterpreter(new DtoCategoricalSignalInterpreter()),
  })
  .withLagging(new DtoLaggingConfig(9000000L, 1800000L)
    .withMinimalLag(1800000L))
  .withSegmenting(new DtoSegmentingConfig()
    .withMinimalSegmentSize(5)
    .withOutputMultipleZoomLevels(true)
  );
const char *analysis_config = "{"
  "\"dataFilter\": {"
    "\"startTime\": 1262304000000,"
    "\"endTime\": 1328468400000,"
    "\"excludeSignals\": [{"
      "\"signal\": \"sensor_measurement_14\","
      "\"startTime\": 1262304000000,"
      "\"endTime\": 1285490000000"
    "}]" 
     // "\"includeSignals\": ... similar
    "\"includeRanges\": [{"
      "\"startTime\": 1262304000000,"
      "\"endTime\": 1262305000000"
      "}, {"
      "\"startTime\": 1285490000000,"
      "\"endTime\": 1328468400000"
      "}]" 
     // "\"excludeRanges\": ... similar             
  "},"
  "\"target\": {"
    "\"signal\": \"failure\","
  "},"
  "\"selfLabeling\": {"
    "\"mesh\": 3600000,"
    "\"coherencePeriod\": 432000000"
  "},"
  "\"sampling\": {"
    "\"additionalSampleMesh\": 1800000,"
    "\"maximalSampleCount\": 100000"
  "},"
  "\"operativePeriods\": {"
    "\"signal\": \"MY_BOOLEAN_OPERATIVE_SIGNAL\","
    "\"batches\": false"
  "},"
  "\"signals\": [{"
    "\"signal\": \"sensor_measurement_11\","
    "\"forceRetain\": true"
  "}, {"
    "\"signal\": \"sensor_measurement_13\","
    "\"lagging\": {"
      "\"maximalLag\": 300000,"
      "\"minimalLag\": 0,"
      "\"mesh\": 30000"
  "}}, {"
    "\"signal\": \"sensor_measurement_15\","
    "\"interpreter\": {"
      "\"_type\": \"Cyclic\","
      "\"cycleLength\": 0.2"
  "}}, {"
    "\"signal\": \"sensor_measurement_9\","
    "\"interpreter\": {"
      "\"_type\": \"Oscillatory\","
      "\"windowLength\": 18000000,"
      "\"mesh\": 1800000"
  "}}, {"
    "\"signal\": \"MY_CATEGORICAL_SIGNAL\","
    "\"interpreter\": {"
      "\"_type\": \"Categorical\""
  "}}],"
  "\"lagging\": {"
    "\"maximalLag\": 9000000,"
    "\"minimalLag\": 1800000,"
    "\"mesh\": 1800000"
  "},"
  "\"segmenting\": {"
    "\"minimalSegmentSize\": 5,"
    "\"outputMultipleZoomLevels\": true"
"}}";

Data Filter: Exclude Parts of the Data

The following sections list and explain the parameters the user may configure to control the training. The sections are organized along the structure of the configuration classes.

The data filter allows you define the signals and time range that are used for training. Concretely, the data filter allows you to choose signals for training. This can be done by either of the following ways: exclude signals, or, alternatively, provide a list of signal names to include (include signals). Beyond that, the data filter allows you to determine the time range of data that is used for training, and even to include or exclude separate time ranges for specific signals.

There are several situations for which data filters come in handy. The global end time may be used to split the data set into a training set and an inference data set. As the name suggests, the training set is used to train a model. The inference data set may afterwards be used to assess the model’s quality. Typically, 10 – 15 % of the data are reserved for evaluation. It is important to make sure that the model does not use any data from the inference data set during training. In fact, it should at no point have access to any additional information that is not present under productive circumstances. This is necessary to optimize the model to the real situation, and to assure that model performance tests are meaningful.

As for selecting the global start time, having more data from a longer time range is almost always advantageous. This allows aivis to get a clear picture of the various ways of how signal behaviors influence the target. That being said, the time range that is chosen should be representative for the time period for which predictions are to be made. If some process was completely revised at your industry, this may have affected signal values or relationships between the signals. In such a case, it is advisable to include sufficient data from the time after revision. For major revisions, it might even be preferable to restrict only to data after that revision. Such restriction can easily be done using the start time.

It is also possible to include/exclude several time intervals globally, instead of just selecting/excluding one global time interval. This is carried out using the fields include ranges and exclude ranges 2.5. It is important to understand the way in which the global and signal based includes/excludes interact. When include signals and include ranges are set to given values different than 'None', first the signals available in include signals are taken with their respective time ranges, and only then those time ranges are intersected with the global time ranges defined in include ranges. The opposite is valid for exclude ranges and exclude signals, where we first consider the time ranges excluded globally, and the time ranges of the signals contained in exclude signals are united to the global ones.

Analogous to the global start time and end time, such intervals can also be specified for individual signals. Note, however, that it is usually advisable to apply time constraints globally.

Data excluded by data filters is still available for the expression language but does not directly enter the model building. Control on signal selection for model building is provided by the signal configuration. Finally, note that complementary to the data filter, training periods can conveniently be defined by operative periods.

Target Configuration: Define your Incidents

Perhaps the most important input parameter is the signal that defines the incidents. We call this signal the target signal. In most cases, it will simply be one of the signals in your data. However, it is also possible to define a target based on multiple signals using various mathematical functions and operators. Such data transformations can be achieved with the powerful Expression Language, explained in detail in “Appendix 1: Expression Language”.

In state detection, the target signal must be boolean. The target signal defines the incidents. Incidents are the timestamps when the target signal switches from false to true. Algorithmically, it is not important when exactly the signal switches back to false.

When you have chosen or constructed your target signal, you need to inform the engine by passing its signal ID. Note that if you have synthesized the target signal with the Expression Language, all signals that are part of the expression are automatically excluded from the training. This way, signals that are used to build the target are not used later on to determine of the target. This corresponds to the common case in which these signals are only available during training but will not be available during live prediction. However, there are also other cases and this automatic exclusion can be overruled by setting signal configurations.

Operative Periods: Exclude Downtimes or Define Batches

Sometimes, the target includes data points for which a prediction is not desired. A typical situation is the time during which a machine is off, possibly including warm up and cool down phases. Also for maintenance time, typically no prediction is desired. To restrict the model training to times of interest, a signal may be assigned to be define the operative periods. An operative signal must be boolean. Then, training is restricted to target timestamps for which the operative signal is “true”. Often, there is no such signal in the raw data but it may be easy to derive the operative times from other signals. For example some motor may be stopped when production is off. If motor speed is above a certain threshold, this may be used to define operative periods. For such situations, an operative signal may easily be created with help of the Expression Language.

Batches

Batches are processes that have a start and an end. After one batch has finished (and maybe some follow up and preparation steps have been performed) the next batch starts. For example, a heat in a steel making process corresponds to a batch: a certain process is executed during each heat and signals often behave in a similar way. For example, some temperature sensor might be always low at the batch start and increasing until the batch end. Another principle of batches is, that they are independent of each other: The previous batch does not influence the next batch. The opposite of batch processes are continuous processes. In a continuous process the present state is always affected by some recent history. If you configure your operative periods as batches, aivis State Detection will assume each operative period to be a batch. Otherwise, each operative period will be assumed to be a randomly selected interval from a continuous process.

Signal Configuration: If Signals Require Special Treatment

The signal configuration is the place to pass additional information about feature signals in order to enforce a special treatment. Each signal configuration refers to one specific signal.

Interpreter

At the core of the signal configuration is the interpreter. The interpreter defines which features are built from a signal and how these enter the engine. The features produced by an interpreter we call the different aspects of the signal. Very often the default configuration is the best choice and you don't need to set any interpreter. However, it is important to configure here any non-standard behavior as it may strongly affect the outcome. Below you find a table on the different interpreters, followed by some more in-depth explanations.

Interpreter Short Explanation Examples
Default Corresponds to a numerical interpreter for float signals, and to a categorical one for string and boolean signals
Numerical No special aspect generation. The signal is taken as it is. Speed, temperature, weight,...
Categorical Each signal value corresponds to some category. Categories have no order. Color, operation mode, on/off,...
Cyclic Signal values can be mapped to a finite interval. Lower and upper bound of this interval are identified with each other. Angles (0° to 360°), time of the day (0:00 to 24:00),...
Oscillatory Signal contains periodically recurrent parts. Interest is rather in the frequency of recurrences than the actual signal values. Audio data, vibrations,...

By default, all float signals are interpreted as numerical. This interpreter should be used for all signals for which the order of numbers is meaningful and which don't require some special treatment. A thermometer, for example, generates numerical data: the smaller the number the colder the temperature. It is irrelevant whether the scale is continuous, or whether the thermometer’s reading precision is limited to integer degrees. The numerical signal kind is quite common for float signals but there are also situations, for which it does not fit. Therefore, float signals may be also declared any of the other signal kinds.

String and boolean signals are always interpreted as categorical. Categorical data has nominal scale, i.e. it takes only specific levels and does not necessarily follow any order. In practice, this would express the information about certain states, such as “green”, “red”, or “blue”. This information may be present in form of strings, booleans, or also encoded in numbers. An example could be a signal for which "1.0" stands for "pipe open", "2.0" for "pipe blocked", and "3.0" for "pipe sending maintenance alarm".

For a cyclic signal only the residue from division by the cycle length is accounted for. This means the order of numbers is meaningful but it wraps at the cycle length. A common example are angles. Angles are usually defined in the interval \(0\) to \(2 \pi\). This means a cycle length of \(2 \pi\). If the signal takes a value outside this range, it is automatically mapped therein. For example, \(2.1 \pi\) is identified with \(0.1 \pi\). And, of course, 0 and \(1.99 \pi\) are considered to be close to each other. Another example can be derived from a continuous time signal. Let's say time is measured in the unit of hours. Then, applying an interpreter with cycle length 24, yields an aspect that describes the time of the day.

Finally, audio, vibration, or any other data that oscillates with some periodicity may best be interpreted as oscillatory. Oscillatory signals are interpreted in the frequency domain. In order to calculate a frequency spectrum and automatically derive the most relevant aspects, two configuration parameters are necessary. The mesh describes the shortest timespan to consider, the inverse sampling frequency. For example, a mesh of 2 milliseconds means a sample rate of 0.5 kHz. Within this documentation, the unit of the timestamps is usually assumed to be milliseconds to keep explanations concise. However, the unit of timestamps is irrelevant internally. Oscillatory signals may well have sample rates above 1 kHz for which a more fine-grained time unit is necessary. For example, a 32 kHz audio signal has a signal value each 0.031250 milliseconds. In this case, the usual notion of timestamps as milliseconds does not work anymore. Instead, timestamps may be provided in units of nanoseconds and full information is retained for a mesh of 31250. (Alternatively, timestamps may also be provided in units of a thirty second part of a millisecond, and full information is retained for a mesh of 1.) If the highest frequencies of the signal are expected not to be relevant, i.e. if the microphone or detector records with a higher rate than actually needed, the mesh may be chosen larger than the difference between timestamps in the data. In the above example, a mesh of 62500 nanoseconds would retain only each second value. The other parameter is the window length. It describes the longest time span to consider for a frequency spectrum. Therefore, it should reflect some reasonable time to "listen" to the signal before trying to get information out of it. The window length defines the period of the lowest frequency that can be analysed. Therefore, at the very least it should be as long as the period of the lowest relevant frequency. If no data are provided during some interval larger than twice the mesh, no frequency spectrum is calculated for this gap. Instead, the frequency spectrum is calculated for the last period for which there is no gap over the window length. This behavior allows for discontinuous signal transmission. To reduce the amount of data, for example, the signal may be provided in bunches which are sent each 2 minutes, each covering a time period of 10 seconds. The drawbacks are some delay of results, up to 2 minutes in the above example, and loss of information about anything happening within the gap.

Sometimes, it is not clear which interpreter to choose. As an example, take a signal for which "0.0" may stand for "no defects", "1.0" for "isolated microscopic defects" and "3.0" for "microscopic connected defects". A priori, one may assume that the effect of isolated defects may be somewhere in between no defects and connected defects, and thus assign the numerical scale. On the other hand, isolated microscopic defects may have no relevant effect: It may be irrelevant whether there are no or isolated defects. In this case, a categorical scale would be preferable. Such a situation can easily be dealt with: create a duplicate of the signal with the help of the expression language, configure one of the signals as numerical and the other as categorical, and let aivis make the best out of both.

Enforce Retaining Signals

This configuration option was mentioned already in the section on data filter. If force retain is set to true, the corresponding signal will not be cleaned and thus definitely enters the model building. As aivis performs very well in feature selection, you may not expect to improve the model performance by forcing some signals to be included. Nevertheless you may be interested in how the model or its performance changes by inclusion of some signal. In a first run, you may have denoted that some signal has been cleaned away although you expect or even know from previous data analysis that it is related to the target. In this case, you may force the engine to retain it in order to calculate and retrieve its correlation to the target and its relevance compared to other signals. Often the same information is contained in different signals and therefore even signals with high correlation to the target may be excluded because they are redundant.

Signal Specific Lagging Configuration

2.3

The use of a global lagging configuration is explained below. Sometimes it can make sense to adjust the lagging specifically to some signals. This helps, among other things, to keep calculational effort low and can be done if you have prior information for some signals about which (range of) lags are most relevant: For a few signals a very high maximal lag may be useful, or a very small mesh. For example, in a production line, it may take a few seconds or minutes from one process step to the next one. Let's assume you are interested in the third process step. Then signals that refer to this step may be configured with zero lag, while signals that describe the first two process steps may be configured with some lag > 0.

A typical case for the application of signal configurations arises when the target is synthesized by the expression language. In this case, all signals that are part of the expression are automatically excluded, see the section on the target configuration. However, this exclusion is not performed for signals with a specific signal configuration. In this scenario, the minimal lag may be of particular relevance, to allow modeling of delayed information availability in the database, see the sections on the lagging and on the target configuration.

Lagging Configuration: Including the local history

In many situations, predictions can be improved by taking into account parts of the history of some signal values, and not only the most recent value. However, signal values from very far in the past are unlikely to be relevant, and their inclusion would unnecessarily increase computation time. Therefore, the maximum time window to be included for each signal can be controlled by the parameter maximal lag [milliseconds]. The maximal lag tells aivis how far it should look into the past to find dependencies. For example, if data from 10 minutes ago affects the current target value, then the maximal lag should be at least that high. Consequently, the best value for maximal lag depends on the process at hand. Start out with lower values if possible, as this value has a large impact on the calculation effort. If this configuration is not set, the target signal is predicted only from the current signal values (or, as always, their nearest predecessors).

The number of sample points within this past time window is controlled by the mesh [milliseconds]. aivis analyzes said time window by slicing it up and taking a sample data point from each slice. The width and therefore the amount of these slices is determined by the mesh. This means that for a target at time t, signal values are retrieved at times t, t – mesh, t – 2 * mesh, and so on until you reach the maximal lag constraint. A smaller mesh is associated with higher computing time and memory demand but may lead to better predictions. It should be kept in mind that if the mesh is larger than the sample interval of the data, some information is lost. Therefore, it is important to choose the mesh according to the time scale during which relevant signals may change.

Simplified, you can imagine that maximal lag and mesh take data points from the past and add them as an additional “lagged signal” to the original problem. For example, a maximal lag of 10 hours and a mesh of 1 hour would require as much additional computational effort as setting a maximal lag of 5 hours, but a mesh of 30 minutes, as both would add up to 10 “lagged” signals per original signal. For each lagged signal, the distance correlation is calculated. The results constitute a correlation trend, and an example is depicted in the figure below. In the example the correlation is highest shortly before the evaluation time. This may be indicative that a shorter maximal lag might suffice for this signal. The correlations are part of the report. This way, the lagging configuration can be checked after the training.

Correlation Trend

These different lagged signals are almost always strongly correlated with each other and thus partially redundant. However, you don't need to care about this problem. The different lagged signals are automatically combined and selected to distill full information with a low number of final features.

In analogy to the maximal lag, there is also the option to define a minimal lag 2.3. If this parameter is set to some positive value, no information is used from the most recent past. This parameter is mainly useful for special cases: First, if you know that some range of lags is particularly important, it allows you to pinpoint aivis to exactly this range of lags. The second case applies to delayed information during live prediction. If you expect signals to be available only with some delay in your database, this situation can be emulated by setting a minimal lag. Then aivis will not use any information more recent than the minimal lag, and performance will therefore be independent of any delay shorter than this minimal lag. In the regard of information delays also note the availability timestamps.

Self Labeling Configuration: Where to look for root causes

For segmentation, aivis evaluates the signal histories before incidents. Have the signals behaved similarly? Or do they show suspicious behaviour changes before an incident? The histories are defined with two similar time parameters like used in the lagging config: The coherence period determines how far aivis looks into the past and the mesh defines the characteristic timescale for this.

The coherence period should be chosen larger than the lagging's maximal lag. Also, just like the lagging config, choosing a particularly small mesh to a large coherence period results in a large calculation time and memory consumption. Typically, the coherence period being more than 250 times larger than the mesh is a magnitude that might sometimes be necessary but should be justified from the data.

Chosing a random self labeling config which does not match the data at all might result in an engine error. For example, when we look at a particularly slow process where each signal has a new value every ten minutes, a coherence period of a minute will cause no incident being included in the analysis, because there is just nothing happening the minute before an incident and no cause can be found.

Output: Report and Analysis

There are two outputs that can be retrieved from the analysis which differ in their proposed usage. First, the analysis report lists the most relevant information about the segmentation(s). This includes, for each segment, a list of all contained events and a list of all signals with their signal correlations. Also the report includes a list of all other signals together with a justification for their exclusion from analysis. So, the report is meant to provide the user with information about the analysis. Second, a training preparation is retrieved. The training preparation is not for user information. Instead it contains internal data to be used as input for the next step in the workflow, the training.

Training

The training comprises all the steps to reach a model after an analysis is done.

Workflow

Analysis Source

There are a two ways to refer to an analysis when starting a training: First, when using an SDK, a state detection training can be created directly via the analysis object. Second, a state detection training can be created from a training preparation. The training preparation is an output of the analysis with exactly the purpose of training a model that is based on the analysis results.

Thereby, analysis and training must run on the very same data context.

Referring to segments

The model can be trained either by using the segmentation with the highest rating. In this case, no segments have to be configured before training. Alternatively, if the user has configured the analysis to output the segmentations of all zoom levels, individual segments can be picked. Each segment has an ID which can be listed in the training configuration. The model will then infer a score for each selected segment.

Modeling Configuration: How the model is built

The control point count can be set to tweak model building. It controls the granularity of the model and has a strong influence on the required computing time. The larger this number, the more details may potentially be included. However, training computing time scales approximately with the third power of the control point count. By default, 5,000 control points are used. Just like with the maximal sample counts, a higher maximal control point count should technically yield better results. However, there is always a ceiling above which increasing the value will not lead to significant improvements, but only longer computation times.

A Fully Loaded Preparation Based Training Configuration

For quick reference, the various possible training configuration keys are summed up in the code snippet below. We again stress that the majority of the keys is optional. A minimal training configuration was used above in SDK training, respectively, in Docker training. The code snippet applies to the situation when an analysis has been performed, and the following training is based on the analysis results. The definition of the syntax is given in the reference manuals.

training:
  dataFilter:
    startTime: 1262304000000
    endTime: 1328468400000
    excludeSignals:
    - signal: sensor_measurement_14
      startTime: 1262304000000
      endTime: 1285490000000
    # includeSignals: ... similar
    includeRanges:
    - startTime: 1262304000000
      endTime: 1262305000000
    - startTime: 1285490000000
      endTime: 1328468400000  
    # excludeRanges: ... similar
  sampling:
    maximalSampleCount: 100000
  segments: [1,3,4]
  modeling:
    controlPointCount: 2500
training_config = json.dumps({
  "dataFilter": {
    "startTime": 1262304000000,
    "endTime": 1328468400000,
    "excludeSignals": [{
      "signal": "sensor_measurement_14",
      "startTime": 1262304000000,
      "endTime": 1285490000000
    }],
    # "includeSignals": ... similar
    "includeRanges" : [{
      "startTime": 1262304000000,
      "endTime": 1262305000000
    },{
      "startTime": 1285490000000,
      "endTime": 1328468400000
    }],
    # "excludeRanges": ... similar
  },
  "sampling": {
    "maximalSampleCount": 100000
  },
  "segments": [1,3,4],
  "modeling": {
    "controlPointCount": 2500
}})
final DtoPreparationBasedTrainingConfig trainingConfig = new DtoPreparationBasedTrainingConfig()
  .withDataFilter(new DtoDataFilter()
    .withStartTime(1262304000000L)
    .withEndTime(1328468400000L)
    .withExcludeSignals(new DtoDataFilterRange[] { 
      new DtoDataFilterRange("sensor_measurement_14")
        .withStartTime(1262304000000L)
        .withEndTime(1285490000000L)
    })
    // .withIncludeSignals ... similar
    .withIncludeRanges(new DtoInterval[] {
        new DtoInterval()
          .withStartTime(1262304000000L)
          .withEndTime(1262305000000L),
        new DtoInterval()
          .withStartTime(1285490000000L)
          .withEndTime(1328468400000L)
      })
    // .withExcludeRanges ... similar
  )
  .withSampling(new DtoTrainingSamplingConfig()
    .withMaximalSampleCount(100000))
  .withSegments(new Long[] {1L, 3L, 4L})
  .withModeling(new DtoModelingConfig()
    .withControlPointCount(2500)
  );
const char *training_config = "{"
  "\"dataFilter\": {"
    "\"startTime\": 1262304000000,"
    "\"endTime\": 1328468400000,"
    "\"excludeSignals\": [{"
      "\"signal\": \"sensor_measurement_14\","
      "\"startTime\": 1262304000000,"
      "\"endTime\": 1285490000000"
    "}]," 
    // "\"includeSignals\": ... similar
    "\"includeRanges\": [{"
      "\"startTime\": 1262304000000,"
      "\"endTime\": 1262305000000"
      "}, {"
      "\"startTime\": 1285490000000,"
      "\"endTime\": 1328468400000"
      "}]"  
    // "\"excludeRanges\": ... similar           
  "},"
  "\"sampling\": {"
    "\"maximalSampleCount\": 100000"
  "},"
  "\"segments\": [1,3,4],"
  "\"modeling\": {"
    "\"controlPointCount\": 2500"
"}}";

Output: Report and Model

There are two outputs that can be retrieved from the training. First, the report lists the most relevant information from training. Second, a model is retrieved, which will later be used for inference. The model also specifies the Inference Data Specification.

Inference

When the model has been trained, it is ready for the ultimate goal, which is inference. Inference means, the model is provided with new (usually unseen) data around a certain timestamp and is asked for some value/estimation at said timestamp: That value is simply the prediction of the target in aivis Signal Prediction. In aivis Anomaly Detection the inferences are scores and in aivis State Detection the inferences provide scores per segment.

In general, there are two main scenarios in which you would want to make inferences. The first one is performance evaluation of the model. Here, inferences are made on historical data, i.e., some test data set. Again, it is important that this data was not part of the training set, as this would lead to unrealistic and therefore non-representative predictions. These historical inferences are used for determining model performance, see section “Measuring Model Performance”.

The second typical scenario for making inferences is using them in a productive setting. Here, the true target value or process state is technically not known at the time of the inference. This is called live inference. For live inference, inferences are usually made on an ongoing basis, as this is typically what you would want for most productive use cases. This is contrary to performance evaluation, where all inferences are made in one go.

For each of the above scenarios, there is a dedicated docker image. The Inference Worker creates predictions for a predefined time window in a bulk manner for model evaluation. In contrast, the Inference Service is optimized for live inference. It offers a RESTful web API that allows the triggering of individual predictions for a specified time via an HTTP call. Due to the different application modes, APIs differ between the different docker images and the SDK. These differences will be noted in the following sections.

Inference Timestamps

In the SDK, in order to make inferences, it is necessary to pass a list of timestamps for which the inferences are to be made. This allows for the request of a single live inference result but also for bulk inference for model evaluation on historic data. Typically it is easy to generate such lists of timestamps in the programming language that calls the SDK. On the other hand, docker images are not necessarily called from within a powerful programming language. This is not an issue for the Inference Service. For live inference, typically inference is requested only for a single timestamp, the most recent one. However, it could be cumbersome to derive a list of timestamps for the Inference Worker. Therefore, for the Inference Worker, timestamps are selected via a list of timestamps configs. There are two different methods:

Timestamp Config Short Explanation Typical Use Case
Equidistant Provides equidistant inference timestamps with fixed interval (for example a inference each minute). Obtain continuous inferences in some time interval.
AtNextSignalValue Selects those timestamps for inference for which there are data points for some specified signal. For model validation it is necessary to make inferences for timestamps for which target values are known.

For both timestamps configs, there are a start time and an end time. An operative signal can be used to further restrict the timestamps. Finally, a minimum interval may be set to avoid calculating too many inferences in a short time interval. This can speed up the computation, or may be useful to balance the distribution of inferences. Finally note that further flexibility can be gained by providing several timestamps configs in which case all timestamps are combined. An example was already provided in the Getting Started section.

Inference Data

Regarding the data, the same general specifications hold as for the training. In addition, the data is expected to obey the Inference Data Specification that can be inspected from the training output. The Inference Data Specification ensures that all data needed for inference has been provided. Note that this may include also signals synthesized by the expression language, which then need to be synthesized by the same expression. If the Inference Data Specification is not satisfied for some timestamp, this timestamp is either skipped on insufficient data, or an error is thrown. This behavior needs to be configured for the SDK and the Inference Worker. For the Inference Service, an error is always returned.

Data filtering can be performed as for the training. As described in the subsection on training data filtering, evaluating the performance of the model is done best by splitting the available data with the data filter. That way, it is not necessary to provide different data files for training and inference. Instead, some training end timestamp is specified, and the same timestamp for inference start. With this approach, the Inference Data Specification is automatically satisfied. For the Inference Service, data filtering is not possible.

Signal availabilities are an important concept for checking model performance with historical data. Only those data are to be used for inference that would be available if the inference would be performed online. However, the availability timestamp does not always correspond to the timestamp the data would have become available. For example, availability timestamps may have been overridden during some migration of the data base. For such cases, the Inference Worker provides the option to ignore signal availabilities. This option may also come in handy to check the impact of the delay of information on the model performance.

Infer With Next Normal

2.7

For computing the scores per segment in both docker inference and SDK inference two methods are given: infer and infer with next normal. The standard infer function returns the scores per segment for the specified timestamps.

The function infer with next normal requires an additional input which is the next normal config. In this configuration a threshold used for all segments, global normal threshold, or segment individual thresholds, segment normal thresholds, and an optional feature filter can be set. Scores below these thresholds are considered normal and no additional computation is performed. For any timestamp at which the feature values yield an incident score above the threshold, the next normal point is sought after. The next normal point is the collection of feature values which is the closest to the observed feature values whose incident score is below the threshold. Moreover, a rating is calculated from the difference between observed and next normal feature values. It reflects the influence of each feature on the score, i.e. the feature with the highest rating is the most influential factor causing the score being above the provided threshold. The feature filter in the next normal config allows the user to exclude features from the above next normal seeking process or to fix the features which are allowed to vary in the next normal seeking process via the include features entry. If the algorithm can not find a collection of feature values below the threshold it will return the feature values with the smallest reachable incident score.

In summary, the function infer with next normal returns, in addition to the scores per segment at the timestamp of interest, the observed feature values, the next normal feature values, the score of the next normal feature values, and a rating of the difference between observed and next normal feature values.

In the output and the feature filter the features are indexed by an integer id which can be looked up in the report to relate this id to the underlying signals, aspects and lags.

Warning: The infer with next normal function is experimental. It does an expensive optimization and should carefully be used for models with many features, in particular when a lagging config is present. Also in such cases the result might not yet be optimal.

We provide here a minimal example of the syntax of the infer with next normal method:

data:
    ...
inference: 
  config: 
    ...
  timestamps: 
    ...
  nextNormal:
    globalNormalThreshold: 0.4
output: 
    ...
# choose inference timestamps
timestamps = ...

# build next normal config
next_normal_config = json.dumps({"globalNormalThreshold": 0.4})

# infer scores and next normal point 
scores_with_next_normal = inference.infer_with_next_normal(inference_data, timestamps, next_normal_config)
// choose inference timestamps
final List<Long> timestamps = ...

// build next normal config
final DtoNextNormalConfig nextNormalConfig = new DtoNextNormalConfig(0.4);

// infer scores and next normal point 
final List<DtoFloatDataPointWithNextNormal> scoresWithNextNormal = inference.inferWithNextNormal(inferenceData, timestamps, nextNormalConfig);
// choose inference timestamps
Time *timestamps = ...

// build next normal config
const char *next_normal_config = "{\"globalNormalThreshold\": 0.4}";

// infer scores and next normal point 
const List_DtoFloatDataPointWithNextNormal *scores_with_next_normal = aivis_state_detection_inference_infer_with_next_normal(
  inference_handle,
  inference_data,
  timestamps,
  timestamps_len,
  (uint8_t *) next_normal_config,
  strlen(next_normal_config),
  &err
); 
check_err(&err, "Infer Scores with next normal");

// free scores_with_next_normal
aivis_free(scores_with_next_normal);
scores_with_next_normal = NULL;    

// free timestamps
free(timestamps);
timestamps = NULL;

A Fully Loaded Inference Configuration

For quick reference, the various possible training configuration keys are summed up in the code snippet below. A minimal inference configuration is provided in SDK inference, respectively, in Docker inference. The definition of the syntax is given in the reference manuals. For the docker images, here we focus only on the Inference Worker as it features more configuration keys. Note that on the other hand, the inference service can be asked to evaluate several models in one go.

config:
  dataFilter:
    startTime: 1328472000000
    endTime: 1336572000000
    excludeSignals:
    - signal: sensor_measurement_14
      startTime: 1332662000000
      endTime: 1333333000000
    # includeSignals: ... similar
    includeRanges:
    - startTime: 1328472000000
      endTime: 1332662000000
    # excludeRanges: ... similar
  skipOnInsufficientData: true
inference_config = json.dumps({
  "dataFilter": {
    "startTime": 1328472000000,
    "endTime": 1336572000000,
    "excludeSignals": [{
      "signal": "sensor_measurement_14",
      "startTime": 1332662000000,
      "endTime": 1333333000000
    }],
    # "includeSignals": ... similar
    "includeRanges" : [{
      "startTime": 1328472000000,
      "endTime": 1332662000000
    }]
    # "excludeRanges": ... similar
  "skipOnInsufficientData": True
})
final DtoInferenceConfig inferenceConfig = new DtoInferenceConfig(true)
  .withDataFilter(new DtoDataFilter()
    .withStartTime(1328472000000L)
    .withEndTime(1336572000000L)
    .withExcludeSignals(new DtoDataFilterRange[] { 
      new DtoDataFilterRange("sensor_measurement_14")
        .withStartTime(1332662000000L)
        .withEndTime(1333333000000L)
    })
    // .withIncludeSignals ... similar
    .withIncludeRanges(new DtoInterval[] {
      new DtoInterval()
        .withStartTime(1328472000000L)
        .withEndTime(1332662000000L)
    })
    // .withExcludeRanges ... similar
  );
const char *inference_config = "{"
  "\"dataFilter\": {"
    "\"startTime\": 1328472000000,"
    "\"endTime\": 1336572000000,"
    "\"excludeSignals\": [{"
      "\"signal\": \"sensor_measurement_14\","
      "\"startTime\": 1332662000000,"
      "\"endTime\": 1333333000000"
    "}]," 
   // "\"includeSignals\": ... similar
    "\"includeRanges\": [{"
      "\"startTime\": 1328472000000,"
      "\"endTime\": 1332662000000"
      "}]" 
    // "\"excludeRanges\": ... similar           
  "},"
  "\"skipOnInsufficientData\": true"
"}";

Measuring Model Performance

After the model has been built, scores can be inferred on unseen inference/test data. The perfect behaviour would be that before each incident a score goes up, and that the segment of this score is sensible. However, any kinds of deviations may occur and discussing the model performance is thus necessary. For signal prediction models, performance is often evaluated based on the Coefficient of Determination \((r^2)\). However, there is no such universal KPI for state detection. Therefore, we need to look a bit closer. The following steps may serve as a guideline.

  • The first step is qualitatively: We look at the scores and we look at the incidents of the evaluation period. Does some score go up before each incident? Is it the score of the segment we would have expected? Maybe several scores go up. This can be perfectly fine if the incident fulfills the criteria of different segments.
  • The scores are not a probability in the sense of probability theory. Therefore, the second step is playing with alarm thresholds for each score. There will most likely be different thresholds for different segments. A threshold of 0.5 is often a good starting point but other choices such as 0.7 or 0.15 may fit better. For some segments, it might even be reasonable to set a threshold of 1.0, i.e. to ignore this score. Particularly, this might happen for the residue segment.
  • The third step is playing with moving average, activation and deactivation times: Sometimes the score goes up just for a very short time, which should not count as an alarm. On the other hand, the longer you wait to trigger an alarm, the shorter the period for potential countermeasures. The question is: When to trigger an alarm?
  • The fourth step is looking at false positives. If some score clearly goes up and stays up for a significant amount of time, aivis recognized a state change and a high risk of an incident. Even if no incident occurred, *aivis* is not necessarily wrong. Sometimes nothing happens although risk is high. Is it somehow possible to analyze this?
  • The fifth step is a more quantitative analysis to get a confusion matrix. For this, evaluation timeframes are needed, which have incidents (rather at the end) or are free of incidents / far away from them. In case of batches, these timeframes are of course the batches itself. The confusion matrix is typically build with this two criteria: Was there an incident in the timeframe? Was there an alarm in the timeframe? With the selection of the parameters mentioned above, in particular the thresholds, ROC curves can help, which plot the false positive rate versus the true positive rate for various thresholds. Usually, the threshold is chosen which maximizes the sum of false positive rate and true positive rate.
  • The sixth step is calculation KPI of this confusion matrix, like accuracy, precision, recall and f1-score.
  • The seventh step can be a cost calculation. How expensive is an incident? Which countermeasures could have been applied thanks to early warnings? How expensive and efficient are the countermeasures?

Even though there are proper KPI like the f1-score, a lot of parameters have been set to get to it. Some of these decisions can be suggested, but each use case might have different requirements, like an alarm being only valueable if it comes at least a certain time period before the incident. Therefore, measuring a state detection model will always require a closer, use-case specific look.

Appendix 1: Expression Language

Before starting the workflow, sometimes there is the need to add a new signal to the dataset (a synthetic signal) that is derived from other signals already present. There are various reasons for this, especially if

  • you want to predict a quantity that is not in your Training Data, but it could be calculated by a formula. For that task, you need to add the new signal via an expression and then use this new synthetic signal as target.
  • you want to restrict the training to operative periods but there is no signal that labels when your machines were off. However, you may be able to reconstruct these periods based on some other signals.
  • you posess domain knowledge and you want to include and pinpoint the engine to some important derived quantity. Often certain derived quantities play a specific role in the application's domain, and might be easier to understand/verify as opposed to the raw quantities.

Technically, you can add synthetic signals using the docker images or any SDK Data API

To create new synthetic signals in a flexible way, aivis Signal Prediction features a rich Expression Language to articulate the formula.

The Expression Language is an extension of the scripting language Rhai. We have mainly added support for handling signals natively. Information on the basic usage of the language can be found in the very helpful Language Reference of the Rhai Book. This documentation will mainly focus on the added features.

Signal Type

A signal consists of a list of data points that represents a time series (timestamps and values of the same type).

The following value types are supported:

  • bool : Boolean
  • i64 : 64-bit Integer
  • f64 : 64-bit Floating Point
  • string : UTF-8 String

A signal type and its value type are written generically as signal<T> and specifically like e.g. signal<i64> for an integer signal.

It is not possible to write down a signal literally, but you can refer to an already existing signal in your dataset.

Signal References

Referring to an already existing signal is done via one of these two functions:

  • s(signal_id: string literal): signal<T>
  • s(signal_id: string literal, time_shift: integer literal): signal<T>

The optional time shift parameter shifts the data points into the future. For example, if the signal "a" takes the value 5.7 at timestamp 946684800000, then the following expression takes the same value 5.7 at timestamp 946684808000. The synthesized signal is therefore a lagged version of the original signal "a".

s("a", 8000)

These functions must be used exactly with the syntax above. It is not allowed to invoke them as methods on the signal id. Both parameters must be simple literals without any inner function invocation!

Examples:

s("my signal id")              // OK
s("my signal id", 8000)        // OK
s("my s" + "ignal id", 8000)   // FAIL
"my signal id".s(8000)         // FAIL
s("my signal id", 7000 + 1000) // FAIL

Examples

To begin with, let's start with a very simple example. Let "a" and "b" be the IDs of two float signals. Then

s("a") + s("b")

yields the sum of the two signals. The Rhai + operator has been overloaded to work directly on signals (such as many other operators, see below). Therefore, the above expression yields a new signal. It contains data points for all timestamps of "a" and "b".

A more common application of the expression language may be the aim to interpolate over several timestamps. For example, "a" might fluctuate and we may therefore be interested in a local linear approximation of "a" rather than in "a" itself:

trend_intercept(s("a"), t, -1000, 0)

Here, the literal t refers to the current timestamp. Therefore, the expression yields the present value as obtained from a linear approximation over the last second. As another example, the maximum within the last second:

max(slice(s("a"), t, -1000, 0))

A typical use of the expression language is synthesizing an operative signal. Assume you want to make inferences only when your production is running, and you are sure your production is off when some specific signal "speed" falls below a certain threshold, say 10. However, "speed" may also be above the threshold during maintenance. However, during maintenance "speed" exceeds the threshold only for a few hours. This is in contrast to production which usually runs stable for months. In this situation, an operative signal may thus be synthesized by adopting only intervals larger than one day, i.e. 86400000 ms:

set_sframe(s("speed") > 10, false, 86400000)

Additional Signal Functions

In the following, all functions are defined that operate directly on signals and do not have a Rhai counterpart (such as the + operator). Some functions directly return a signal. The others can be used to create signals via the t literal as will be explained below. Note that a timeseries is always defined on a finite number of timestamps: all timestamps of all signals involved in the expression are used for the synthesized signal. Time shifts specified in the signal function s(signal_id: string literal, time_shift: integer literal) are taken into account. On the other hand, arguments of the functions below (in particular time, from, and to) do not alter the evaluation timestamps. If you need more evaluation timestamps, please apply add_timestamps to some signal in the expression (see below).

  • add_timestamps(signal_1: signal<T>, signal_2: signal<S>): signal<T> – returns a new signal which extends signal_1 by the timestamps of signal_2. The signal values for the new timestamps are computed with respect to signal_1 using the latest predecessor similar to the above at() function. The syntax for this expression is s("x1").add_timestamps(s("x2")). 2.4
  • at(signal: signal<T>, time: i64): T – returns the signal value at a given time
    If there is no value at that time, it will go back in history to find a nearest predecessor; if there is no predecessor, it returns NAN, 0, false or ""
  • set_lframe(signal: signal<bool>, new_value: bool, minimal_duration: i64) : signal<bool> – returns a new boolean signal, where large same-value periods of at least duration minimal_duration are set to new_value. Note that the duration of a period is only known after end of the period. This affects the result of this function especially for live prediction.
  • set_sframe(signal: signal<bool>, new_value: bool, maximal_duration: i64) : signal<bool> – returns a new boolean signal, where small same-value periods of at most duration maximal_duration are set to new_value. Note that the duration of a period is only known after end of the period. This affects the result of this function especially for live prediction.
  • slice(signal: signal<T>, time: i64, from: i64, to: i64): array<T> – returns an array with all values within a time window of the given signal.
    The time window is defined by [time + from; time + to]
  • steps(signal: signal<T>, time: i64, from: i64, to: i64, step: i64): array<T> – returns an array with values extracted from the given signal using the at function step by step.
    The following timestamps are used: (time + from) + (0 * step), (time + from) + (1 * step), ... (until time + to is reached inclusively)
  • time_since_transition(signal: signal<bool>, time: i64, max_time: i64) : f64 – returns a new float signal, which gives time since last switch of signal from false to true. If this time exceeds max_time we return max_time. Times before the first switch and times t where the signal gives false in [t - max_time , t] are mapped to max_time. 2.4
  • times(signal: signal<T>): signal<i64> – returns a new signal constructed from the given one, where the value of each data point is set to the timestamp
  • trend_slope/trend_intercept(signal: signal<i64/f64>, time: i64, from: i64, to: i64): f64 – returns the slope/y-intercept of a simple linear regression model
    Any NAN value is ignored; returns NAN if there are no data points available; the following timestamps are used: [time + from; time + to]. The intercept at t = time is returned.

Best practice combining expressions

When combining several expressions which operate on time windows then, from a performance point of view, it might be better to build the expression step by step than writting the combination into one expression.

For example, if we want to exclude periods smaller than 30 minutes and periods bigger than 12 hours from an existing boolean signal with signal id "control" we may use the expression:

(s("control")).set_lframe(false, 12*60*60*1000).set_sframe(false, 30*60*1000)

When evaluating this expression at a timestamp t the synthesizer scans trough the 30 minutes time window before t and for each timestamp in there it scan through another 12 hour window before. This means constructing the desired synthesized signal is of complexity 12 × 60 × 30 × # timestamps. However, splitting the above in two expressions, we first generate a signal "helper" via

(s("control")).set_lframe(false, 12*60*60*1000)

and then we apply on the result the expression

(s("helper")).set_sframe(false, 30*60*1000)

In this case we end up with complexity 12 × 60 × # timestamps + 30 × # timestamps which is considerably smaller than before.

Basics of Rhai

Working with signals

In this section, we will briefly show the potential behind Rhai and what you can create out of it. Rhai supports many types including also collections. But Rhai does not have natively a signal type. Then, when working with signals, one approach involves extracting the primitive values from signals and converting the results back into a signal format. This process uses the literal

t: i64 – the current timestamp

together with the function s to refer to some signal and some other function defined above to extract values from the signal. For example, the sum of two signals "a" and "b" could be written without use of the overloaded + operator:

s("a").at(t) + s("b").at(t)

The results of such an expression are automatically translated into a new signal. In order to construct a signal from the results, the expression must not terminate with a ;. Of course, the additional signal functions can be used as any other functions in Rhai, and may thus be combined with the rest of Rhai's tools, when applicable.

Rhai is a scripting language

As such, you can script. A typical snippet would look like the following

let array = [[s("one").at(t), s("two").at(t)], [s("three").at(t), s("four").at(t)], [s("five").at(t), s("six").at(t)]];
let pair_avg = array.map(|sub| sub.avg());
pair_avg.filter(|x| !x.is_nan()).map(|cleaned| cleaned.abs().exp()).sum().ln()

Here, we used array functions (avg(), sum()) that will be clearly defined and presented in the following sections. The last line defines the result of the expression.

Rhai has the usual statements

In the same spirit of many other languages, you can create and control flow using statements if, for, do, while, and more (read Language Reference of the Rhai Book). Here's an example demonstrating their usage

let val = s("one").at(t);
if (val >= 10.0) && (val <= 42.0) {
  1.0 - (val - 42.0)/(10.0-60.0)
} else if (val <= 60.0) && (val > 42.0) {
  1.0 - (val - 42.0)/(60.0-42.0)
} else {
  0.0/0.0
}

In this code snippet, we determine a value to return based on the current state of the "one" signal. Different expressions are assigned depending on the signal's current value. Note that 0.0/0.0 will evaluate to NAN.

Rhai allows you to create your own functions

Like most other languages, you can create your own functions and use them whenever needed.

fn add(x, y) {
    x + y
}

fn sub(x, y,) {     // trailing comma in parameters list is OK
    x - y
}

Rhai allows you to do many more things than the ones here described. Careful reading of Language Reference of the Rhai Book brings numerous benefits in the usage of this programming language.

Additional Array Functions

The following functions for arrays were additionally defined:

  • some(items: array<bool>): bool – returns true if at least one item is true
  • all(items: array<bool>): bool – returns true if all items are true
  • sum(items: array<i64/f64>): f64 – returns the sum of all items and 0.0 on an empty array
  • product(items: array<i64/f64>): f64 – returns the product of all items and 1.0 on an empty array
  • max(items: array<i64/f64>): f64 – returns the largest array item; any NAN value is ignored; returns NAN on an empty array
  • min(items: array<i64/f64>): f64 – returns the smallest array item; any NAN value is ignored; returns NAN on an empty array
  • avg(items: array<i64/f64>): f64 – returns the arithmetic average of all array items; any NAN value is ignored; returns NAN on an empty array
  • median(items: array<i64/f64>): f64 – returns the median of all array items; any NAN value is ignored; returns NAN on an empty array

Constants

The following constants are defined in Rhai:

  • PI(): f64 – the Archimedes' constant: 3.1415...
  • E(): f64 – the Euler's number: 2.718...

Operators / Functions

Signals can be used in all normal operators and functions that are designed for primitive values. You can even mix signals and primitive values in the same invocation. If at least one parameter is a signal, the result will also be a signal.

Operators

See:

The following operators were defined:

  • Arithmetic:
    • +(i64/f64): i64/f64
    • -(i64/f64): i64/f64
    • +(i64/f64, i64/f64): i64/f64
    • -(i64/f64, i64/f64): i64/f64
    • *(i64/f64, i64/f64): i64/f64
    • /(i64/f64, i64/f64): i64/f64
    • %(i64/f64, i64/f64): i64/f64
    • **(i64/f64, i64/f64): i64/f64
  • Bitwise:
    • &(i64, i64): i64
    • |(i64, i64): i64
    • ^(i64, i64): i64
    • <<(i64, i64): i64
    • >>(i64, i64): i64
  • Logical:
    • !(bool): bool
    • &(bool, bool): bool
    • |(bool, bool): bool
    • ^(bool, bool): bool
  • String:
    • +(string, string): string
  • Comparison (returns false on different argument types):
    • ==(bool/i64/f64/string, bool/i64/f64/string): bool
    • !=(bool/i64/f64/string, bool/i64/f64/string): bool
    • <(i64/f64, i64/f64): bool
    • <=(i64/f64, i64/f64): bool
    • >(i64/f64, i64/f64): bool
    • >=(i64/f64, i64/f64): bool

Binary arithmetic and comparison operators can handle mixed i64 and f64 arguments properly, the other parameter is then implicitly converted beforehand via to_float. Binary arithmetic operators will return f64 if at least one f64 argument is involved.

Functions

See:

The following functions were defined:

  • Arithmetic:
    • abs(i64/f64): i64/f64
    • sign(i64/f64): i64
    • sqrt(f64): f64
    • exp(f64): f64
    • ln(f64): f64
    • log(f64): f64
    • log(f64, f64): f64
  • Trigonometry:
    • sin(f64): f64
    • cos(f64): f64
    • tan(f64): f64
    • sinh(f64): f64
    • cosh(f64): f64
    • tanh(f64): f64
    • asin(f64): f64
    • acos(f64): f64
    • atan(f64): f64
    • asinh(f64): f64
    • acosh(f64): f64
    • atanh(f64): f64
    • hypot(f64, f64): f64
    • atan(f64, f64): f64
  • Rounding:
    • floor(f64): f64
    • ceiling(f64): f64
    • round(f64): f64
    • int(f64): f64
    • fraction(f64): f64
  • String:
    • len(string): i64
    • trim(string): string – with whitespace characters as defined in UTF-8
    • to_upper(string): string
    • to_lower(string): string
    • sub_string(value: string, start: i64, end: i64): string
  • Conversion:
    • to_int(bool): i64 – returns 1/0
    • to_float(bool): f64 – returns 1.0/0.0
    • to_string(bool): string – returns "true"/"false"
    • to_float(i64): f64
    • to_string(i64): string
    • to_int(f64): i64 – returns 0 on NAN; values beyond INTEGER_MAX/INTEGER_MIN are capped
    • to_string(f64): string
    • to_degrees(f64): f64
    • to_radians(f64): f64
    • parse_int(string): i64 – throws error if not parsable
    • parse_float(string): f64 – throws error if not parsable
  • Testing:
    • is_zero(i64/f64): bool
    • is_odd(i64): bool
    • is_even(i64): bool
    • is_nan(f64): bool
    • is_finite(f64): bool
    • is_infinite(f64): bool
    • is_empty(string): bool
  • Comparison (returns other parameter on NAN):
    • max(i64/f64, i64/f64): i64/f64
    • min(i64/f64, i64/f64): i64/f64

Comparison operators can handle mixed i64 and f64 arguments properly, the other parameter is then implicitly converted beforehand via to_float. It will return f64 if at least one f64 argument is involved.

The Boolean conversion and comparison functions were added and are not part of the official Rhai.

Appendix 2: Integration Scenarios

Usually the steps of the workflow will run as part of two different service applications: Training App and Inference App

The diagrams below display typical blueprints of these service aplications using different available components of the engine as well as where they might be located in the end-customer infrastructure landscape (execution environments).

Hereby the following color code was used:

  • Blue boxes denote aivis Signal Prediction components
  • Purple boxes stand for components that are provided by the service application provider, which can be Vernaio, an industry partner, a reseller or the customer himself
  • Grey boxes symbolize 3rd party components (typical vendor systems/services that can be used are suggested in the speech balloons)

Training App

The service application Training App covers the workflow step Training, as well as any bulk inference, e.g. for historical evaluation.

It is executed in the so-called Cold World, which means that it consists of long running tasks that are executed infrequently and have a high resource consumption. Training App works on historical data that was previously archived and thus needs to be retrieved in an extra step from the Data Lake / Cold Storage.

Because of its high resource consumption it is usually not located in the OT network, but is a good fit for the cloud or an on-premise datacenter.

Via Docker

Training App via Docker

Via SDK

Training App via SDK

Inference App

The service application Inference App provides the means for live prediction.

In contrast to the Training App, it runs within the Hot World. Usually it is an ongoing process which serves to predict the current value and only needs minimal resources. Inference App works on live data that is easily available from the Historian / Hot Storage.

As the outcome often influences the production systems (e.g. Advanced Process Control), usually it runs in the OT network. Thanks to low resource consumption, it can run on practical any environment/device, be it in the cloud, on-premise, on-edge or even embedded.

Via Docker

Inference App via Docker

Via SDK

Inference App via SDK

Infrastructure Landscape

Infrastructure Landscape

Appendix 3: Toolbox

aivis engine v2 toolbox is a side project of aivis engine v2. It mainly provides tools to turn the output artifacts of aivis engine v2 into technical, single-file HTML reports.

Disclaimer

It is explicitly not an official part of aivis engine v2. Therefore, its api and behaviour is subject to change and not necessarily thoroughly tested. It is very important to note that these HTML reports are not a designed UI but rather a visualization testing playground:
The aivis engine v2 toolbox targets researchers and data scientists who already know the concepts of aivis engine v2 beforehand and wish to quickly visualize and adapt its outputs.

Furthermore:

  • With exceptionally large input files (e.g. too many inferences) or the wrong configuration, the generated HTML pages will be too slow to handle.
  • The HTMLs are optimized for a wide screen.

Setup

The aivis engine v2 toolbox does not need a licensing key. The python code is free to look into or even adapt. The respective toolbox release of an aivis engine v2 release {VERSION} is available as:

  • Python Whl aivis_engine_v2_toolbox-{VERSION}-py3-none-any.whl
  • Docker Image aivis-engine-v2-toolbox:{VERSION}

Create Engine Report

Each call to construct a toolbox HTML report for engine xy has the following structure:

from aivis_engine_v2_toolbox.api import build_xy_report

config = {
    "title": "My Use Case Title", 
    ...
    "outputFile": "/path/to/my-use-case-report.html"}
build_xy_report(config)

Additionally, the config needs to contain references to the respective engine's output files, e.g. "analysisReportFile": "/path/to/analysis-report.json". The full call to create a report for any engine can be found in python or argo examples of the respective engine.

Expert Configuration

There are many optional expert configurations to customize your HTML report. Some examples:

  • The aivis engine v2 toolbox always assumes timestamps to be unix and translates them to readable dates. This behaviour can be switched off via "advancedConfig": {"unixTime": False}, so that timestamps always remain long values.

  • By referring to a metadata file via "metadataFile": "/path/to/metadata.json", signals are not only described via their signal id but enriched with more information. The metadata json contains an array of signals with the keys id (must) as well as name, description, unitSymbol, unitType (all optional):

    {"signals": [{
        "id": "fa6c65bb-5cee-45fa-ab19-355ba94889e9",
        "name": "et 1",
        "description": "extruder temperature nr. 1",
        "unitName": "Kelvin",
        "unitSymbol": "K"
      }, {
        "id": "dc3477e5-a83c-4485-b7f4-7528d336d9c4", 
        "name": "abc 2"
        }, 
       ...
    ]}
    
  • To every HTML report which contains a timeseries plot, additional signals can be added to also be displayed.

All custom configuration options can be seen in the api.py file in src/aivis_engine_v2_toolbox.