ml

The ml command applies machine learning (ML) algorithms from the ML Commons plugin to the search results returned by a PPL command. It supports various ML operations, including anomaly detection and clustering. The command can perform train, predict, or combined train-and-predict operations, depending on the algorithm and specified action.

To use the ml command, plugins.calcite.enabled must be set to false.

The ml command supports the following algorithms:

Random Cut Forest (RCF) for anomaly detection, with support for both time-series and non-time-series data
K-means for clustering data points into groups

Syntax

The ml command supports different syntax options, depending on the algorithm.

Anomaly detection for time-series data

Use this syntax to detect anomalies in time-series data. This method uses the RCF algorithm optimized for sequential data patterns:

ml action='train' algorithm='rcf' <number_of_trees> <shingle_size> <sample_size> <output_after> <time_decay> <anomaly_rate> <time_field> <date_format> <time_zone>

Parameters

The fixed-in-time RCF algorithm supports the following parameters.

Parameter	Required/Optional	Description
`number_of_trees`	Optional	The number of trees in the forest. Default is `30`.
`shingle_size`	Optional	The number of records in a shingle. A shingle is a consecutive sequence of the most recent records. Default is `8`.
`sample_size`	Optional	The sample size used by the stream samplers in this forest. Default is `256`.
`output_after`	Optional	The number of points required by the stream samplers before results are returned. Default is `32`.
`time_decay`	Optional	The decay factor used by the stream samplers in this forest. Default is `0.0001`.
`anomaly_rate`	Optional	The anomaly rate. Default is `0.005`.
`time_field`	Required	The time field for RCF to use as time-series data.
`date_format`	Optional	The format for the `time_field`. Default is `yyyy-MM-dd HH:mm:ss`.
`time_zone`	Optional	The time zone for the `time_field`. Default is `UTC`.
`category_field`	Optional	The category field used to group input values. The predict operation is applied to each category independently.

Anomaly detection for non-time-series data

Use this syntax to detect anomalies in data where the order doesn’t matter. This method uses the RCF algorithm optimized for independent data points:

ml action='train' algorithm='rcf' <number_of_trees> <sample_size> <output_after> <training_data_size> <anomaly_score_threshold>

Parameters

The batch RCF algorithm supports the following parameters.

Parameter	Required/Optional	Description
`number_of_trees`	Optional	The number of trees in the forest. Default is `30`.
`sample_size`	Optional	The number of random samples provided to each tree from the training dataset. Default is `256`.
`output_after`	Optional	The number of points required by the stream samplers before results are returned. Default is `32`.
`training_data_size`	Optional	The size of the training dataset. Default is the full dataset size.
`anomaly_score_threshold`	Optional	The anomaly score threshold. Default is `1.0`.
`category_field`	Optional	The category field used to group input values. The predict operation is applied to each category independently.

K-means clustering

Use this syntax to group data points into clusters based on similarity:

ml action='train' algorithm='kmeans' <centroids> <iterations> <distance_type>

Parameters

The k-means clustering algorithm supports the following parameters.

Parameter	Required/Optional	Description
`centroids`	Optional	The number of clusters to group data points into. Default is `2`.
`iterations`	Optional	The number of iterations. Default is `10`.
`distance_type`	Optional	The distance type. Valid values are `COSINE`, `L1`, and `EUCLIDEAN`. Default is `EUCLIDEAN`.