Link Search Menu Expand Document Documentation Menu

Rare terms aggregations

The rare_terms aggregation is a bucket aggregation that identifies infrequent terms in a dataset. In contrast to the terms aggregation, which finds the most common terms, the rare_terms aggregation finds terms that appear with the lowest frequency. The rare_terms aggregation is suitable for applications like anomaly detection, long-tail analysis, and exception reporting.

It is possible to use terms to search for infrequent values by ordering the returned values by ascending count ("order": {"count": "asc"}). However, we strongly discourage this practice because it can lead to inaccurate results when multiple shards are involved. A term that is globally infrequent might not appear as infrequent on every individual shard or might be entirely absent from the least frequent results returned by some shards. Conversely, a term that appears infrequently on one shard might be common on another. In both scenarios, rare terms can be missed during shard-level aggregation, resulting in incorrect overall results. Instead of the terms aggregation, we recommend using the rare_terms aggregation, which is specifically designed to handle these cases more accurately.

Approximated results

Computing exact results for the rare_terms aggregation necessitates compiling a complete map of the values on all shards, which requires excessive runtime memory. For this reason, the rare_terms aggregation results are approximated.

Most errors in rare_terms computations are false negatives or “missed” values, which define the sensitivity of the aggregation’s detection test. The rare_terms aggregation uses a CuckooFilter algorithm to achieve a balance of appropriate sensitivity and acceptable memory use. For a description of the CuckooFilter algorithm, see this paper.

Controlling sensitivity

Sensitivity error in the rare_terms aggregation algorithm is measured as the fraction of rare values that are missed, or false negatives/target values. For example, if the aggregation misses 100 rare values in a dataset with 5,000 rare values, sensitivity error is 100/5000 = 0.02, or 2%.

You can adjust the precision parameter in rare_terms aggregations to control the trade-off between sensitivity and memory use.

These factors also affect the sensitivity-memory trade-off:

  • The total number of unique values
  • The fraction of rare items in the dataset

The following guidelines can help you decide which precision value to use.

Calculating memory use

Runtime memory use is described in absolute terms, typically in MB of RAM.

Memory use increases linearly with the number of unique items. The linear scaling factor varies from roughly 1.0 to 2.5 MB per 1 million unique values, depending on the precision parameter. For the default precision of 0.001, the memory cost is about 1.75 MB per 1 million unique values.

Managing sensitivity error

Sensitivity error increases linearly with the total number of unique values. For information about estimating the number of unique values, see Cardinality aggregation.

Sensitivity error rarely exceeds 2.5% at the default precision, even for datasets with 10–20 million unique values. For a precision of 0.00001, sensitivity error is rarely above 0.6%. However, a very low absolute number of rare values can cause large variances in the error rate (if there are only two rare values, missing one of them results in a 50% error rate).

Compatibility with other aggregations

The rare_terms aggregation uses breadth-first collection mode and is incompatible with aggregations that require depth-first collection mode in some subaggregations and nesting configurations.

For more information about breadth-first search in OpenSearch, see Collect mode.

Parameters

The rare_terms aggregation takes the following parameters.

Parameter Required/Optional Data type Description
field Required String The field to analyze for rare terms. Must be of a numeric type or a text type with a keyword mapping.
max_doc_count Optional Integer The maximum document count required in order for a term to be considered rare. Default is 1. Maximum is 100.
precision Optional Integer Controls the precision of the algorithm used to identify rare terms. Higher values provide more precise results but consume more memory. Default is 0.001. Minimum (most precise allowable) is 0.00001.
include Optional Array/regex Terms to include in the result. Can be a regular expression or an array of values.
exclude Optional Array/regex Terms to exclude from the result. Can be a regular expression or an array of values.
missing Optional String The value to use for documents that do not have a value for the field being aggregated.

Example

The following request returns all destination airport codes that appear only once in the OpenSearch Dashboards sample flight data:

GET /opensearch_dashboards_sample_data_flights/_search
{
  "size": 0,
  "aggs": {
    "rare_destination": {
      "rare_terms": {
        "field": "DestAirportID",
        "max_doc_count": 1
      }
    }
  }
}

The response shows that there are two airports that meet the criterion of appearing only once in the data:

{
  "took": 12,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "rare_destination": {
      "buckets": [
        {
          "key": "ADL",
          "doc_count": 1
        },
        {
          "key": "BUF",
          "doc_count": 1
        }
      ]
    }
  }
}

Document count limit

Use the max_doc_count parameter to specify the largest document count that the rare_terms aggregation can return. There is no limit on the number of terms returned by rare_terms, so a large max_doc_count value can potentially return very large result sets. For this reason, 100 is the largest allowable max_doc_count.

The following request returns all destination airport codes that appear two times at most in the OpenSearch Dashboards sample flight data:

GET /opensearch_dashboards_sample_data_flights/_search
{
  "size": 0,
  "aggs": {
    "rare_destination": {
      "rare_terms": {
        "field": "DestAirportID",
        "max_doc_count": 2
      }
    }
  }
}

The response shows that seven destination airport codes meet the criterion of appearing in two or fewer documents, including the two from the previous example:

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "rare_destination": {
      "buckets": [
        {
          "key": "ADL",
          "doc_count": 1
        },
        {
          "key": "BUF",
          "doc_count": 1
        },
        {
          "key": "ABQ",
          "doc_count": 2
        },
        {
          "key": "AUH",
          "doc_count": 2
        },
        {
          "key": "BIL",
          "doc_count": 2
        },
        {
          "key": "BWI",
          "doc_count": 2
        },
        {
          "key": "MAD",
          "doc_count": 2
        }
      ]
    }
  }
}

Filtering (include and exclude)

Use the include and exclude parameters to filter values returned by the rare_terms aggregation. Both parameters can be included in the same aggregation. The exclude filter takes precedence; any excluded values are removed from the result, regardless of whether they were explicitly included.

The arguments to include and exclude can be regular expressions (regex), including string literals, or arrays. Mixing regex and array arguments results in an error. For example, the following combination is not allowed:

"rare_terms": {
  "field": "DestAirportID",
  "max_doc_count": 2,
  "exclude": ["ABQ", "AUH"],
  "include": "A.*"
}

Example: Filtering

The following example modifies the previous example to include all airport codes beginning with “A” but exclude the “ABQ” airport code:

GET /opensearch_dashboards_sample_data_flights/_search
{
  "size": 0,
  "aggs": {
    "rare_destination": {
      "rare_terms": {
        "field": "DestAirportID",
        "max_doc_count": 2,
        "include": "A.*",
        "exclude": "ABQ"
      }
    }
  }
}

The response shows the two airport codes that meet the filtering requirements:

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "rare_destination": {
      "buckets": [
        {
          "key": "ADL",
          "doc_count": 1
        },
        {
          "key": "AUH",
          "doc_count": 2
        }
      ]
    }
  }
}

Example: Filtering with array input

The following example returns all destination airport codes that appear two times at most in the OpenSearch Dashboards sample flight data but specifies an array of airport codes to exclude:

GET /opensearch_dashboards_sample_data_flights/_search
{
  "size": 0,
  "aggs": {
    "rare_destination": {
      "rare_terms": {
        "field": "DestAirportID",
        "max_doc_count": 2,
        "exclude": ["ABQ", "BIL", "MAD"]
      }
    }
  }
}

The response omits the excluded airport codes:

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "rare_destination": {
      "buckets": [
        {
          "key": "ADL",
          "doc_count": 1
        },
        {
          "key": "BUF",
          "doc_count": 1
        },
        {
          "key": "AUH",
          "doc_count": 2
        },
        {
          "key": "BWI",
          "doc_count": 2
        }
      ]
    }
  }
}