ml
The ml command applies machine learning (ML) algorithms from the ML Commons plugin to the search results returned by a PPL command. It supports various ML operations, including anomaly detection and clustering. The command can perform train, predict, or combined train-and-predict operations, depending on the algorithm and specified action.
To use the ml command, plugins.calcite.enabled must be set to false.
The ml command supports the following algorithms:
-
Random Cut Forest (RCF) for anomaly detection, with support for both time-series and non-time-series data
-
K-means for clustering data points into groups
Syntax
The ml command supports different syntax options, depending on the algorithm.
Anomaly detection for time-series data
Use this syntax to detect anomalies in time-series data. This method uses the RCF algorithm optimized for sequential data patterns:
ml action='train' algorithm='rcf' <number_of_trees> <shingle_size> <sample_size> <output_after> <time_decay> <anomaly_rate> <time_field> <date_format> <time_zone>
Parameters
The fixed-in-time RCF algorithm supports the following parameters.
| Parameter | Required/Optional | Description |
|---|---|---|
number_of_trees | Optional | The number of trees in the forest. Default is 30. |
shingle_size | Optional | The number of records in a shingle. A shingle is a consecutive sequence of the most recent records. Default is 8. |
sample_size | Optional | The sample size used by the stream samplers in this forest. Default is 256. |
output_after | Optional | The number of points required by the stream samplers before results are returned. Default is 32. |
time_decay | Optional | The decay factor used by the stream samplers in this forest. Default is 0.0001. |
anomaly_rate | Optional | The anomaly rate. Default is 0.005. |
time_field | Required | The time field for RCF to use as time-series data. |
date_format | Optional | The format for the time_field. Default is yyyy-MM-dd HH:mm:ss. |
time_zone | Optional | The time zone for the time_field. Default is UTC. |
category_field | Optional | The category field used to group input values. The predict operation is applied to each category independently. |
Anomaly detection for non-time-series data
Use this syntax to detect anomalies in data where the order doesn’t matter. This method uses the RCF algorithm optimized for independent data points:
ml action='train' algorithm='rcf' <number_of_trees> <sample_size> <output_after> <training_data_size> <anomaly_score_threshold>
Parameters
The batch RCF algorithm supports the following parameters.
| Parameter | Required/Optional | Description |
|---|---|---|
number_of_trees | Optional | The number of trees in the forest. Default is 30. |
sample_size | Optional | The number of random samples provided to each tree from the training dataset. Default is 256. |
output_after | Optional | The number of points required by the stream samplers before results are returned. Default is 32. |
training_data_size | Optional | The size of the training dataset. Default is the full dataset size. |
anomaly_score_threshold | Optional | The anomaly score threshold. Default is 1.0. |
category_field | Optional | The category field used to group input values. The predict operation is applied to each category independently. |
K-means clustering
Use this syntax to group data points into clusters based on similarity:
ml action='train' algorithm='kmeans' <centroids> <iterations> <distance_type>
Parameters
The k-means clustering algorithm supports the following parameters.
| Parameter | Required/Optional | Description |
|---|---|---|
centroids | Optional | The number of clusters to group data points into. Default is 2. |
iterations | Optional | The number of iterations. Default is 10. |
distance_type | Optional | The distance type. Valid values are COSINE, L1, and EUCLIDEAN. Default is EUCLIDEAN. |
Example 1: Time-series anomaly detection
This example trains an RCF model and uses it to detect anomalies in time-series ridership data:
source=nyc_taxi
| fields value, timestamp
| ml action='train' algorithm='rcf' time_field='timestamp'
| where value=10844.0
The query returns the following results:
| value | timestamp | score | anomaly_grade |
|---|---|---|---|
| 10844.0 | 2014-07-01 00:00:00 | 0.0 | 0.0 |
Example 2: Time-series anomaly detection by category
This example trains an RCF model and uses it to detect anomalies in time-series ridership data across multiple category values:
source=nyc_taxi
| fields category, value, timestamp
| ml action='train' algorithm='rcf' time_field='timestamp' category_field='category'
| where value=10844.0 or value=6526.0
The query returns the following results:
| category | value | timestamp | score | anomaly_grade |
|---|---|---|---|---|
| night | 10844.0 | 2014-07-01 00:00:00 | 0.0 | 0.0 |
| day | 6526.0 | 2014-07-01 06:00:00 | 0.0 | 0.0 |
Example 3: Non-time-series anomaly detection
This example trains an RCF model and uses it to detect anomalies in non-time-series ridership data:
source=nyc_taxi
| fields value
| ml action='train' algorithm='rcf'
| where value=10844.0
The query returns the following results:
| value | score | anomalous |
|---|---|---|
| 10844.0 | 0.0 | False |
Example 4: Non-time-series anomaly detection by category
This example trains an RCF model and uses it to detect anomalies in non-time-series ridership data across multiple category values:
source=nyc_taxi
| fields category, value
| ml action='train' algorithm='rcf' category_field='category'
| where value=10844.0 or value=6526.0
The query returns the following results:
| category | value | score | anomalous |
|---|---|---|---|
| night | 10844.0 | 0.0 | False |
| day | 6526.0 | 0.0 | False |
Example 5: K-means clustering of the Iris dataset
This example uses k-means clustering to classify three Iris species (Iris setosa, Iris virginica, and Iris versicolor) based on the combination of four features measured from each sample (the lengths and widths of sepals and petals):
source=iris_data
| fields sepal_length_in_cm, sepal_width_in_cm, petal_length_in_cm, petal_width_in_cm
| ml action='train' algorithm='kmeans' centroids=3
The query returns the following results:
| sepal_length_in_cm | sepal_width_in_cm | petal_length_in_cm | petal_width_in_cm | ClusterID |
|---|---|---|---|---|
| 5.1 | 3.5 | 1.4 | 0.2 | 1 |
| 5.6 | 3.0 | 4.1 | 1.3 | 0 |
| 6.7 | 2.5 | 5.8 | 1.8 | 2 |