Pipeline aggregations
Pipeline aggregations chain together multiple aggregations by using the output of one aggregation as the input for another. They compute complex statistical and mathematical measures like derivatives, moving averages, and cumulative sums. Some pipeline aggregations duplicate the functionality of metric and bucket aggregations but, in many cases, are more intuitive to use.
Pipeline aggregations are executed after all other sibling aggregations. This has performance implications. For example, using the bucket_selector
pipeline aggregation to narrow a list of buckets does not reduce the number of computations performed on omitted buckets.
Pipeline aggregations cannot be sub-aggregated but can be chained to other pipeline aggregations. For example, you can calculate a second derivative by chaining two consecutive derivative
aggregations. Keep in mind that pipeline aggregations append to existing output. For example, computing a second derivative by chaining derivative
aggregations outputs both the first and second derivatives.
Pipeline aggregation types
Pipeline aggregations are of two types: sibling and parent.
Sibling aggregations
A sibling pipeline aggregation takes the output of a nested aggregation and produces new buckets or new aggregations at the same level as the nested buckets.
A sibling aggregation must be a multi-bucket aggregation (have multiple grouped values for a certain field), and the metric must be a numeric value.
Parent aggregations
A parent aggregation takes the output of an outer aggregation and produces new buckets or new aggregations at the same level as the existing buckets. Unlike sibling pipeline aggregations, which operate across all buckets and produce a single output, parent pipeline aggregations process each bucket individually and write the result back into each bucket.
The specified metric for a parent aggregation must be a numeric value.
We strongly recommend setting min_doc_count
to 0
(the default for histogram
aggregations) for parent aggregations. If min_doc_count
is greater than 0
, then the aggregation omits buckets, which might lead to incorrect results.
Supported pipeline aggregations
OpenSearch supports the following pipeline aggregations.
Name | Type | Description |
---|---|---|
avg_bucket | Sibling | Calculates the average of a metric in each bucket of a previous aggregation. |
bucket_script | Parent | Executes a script to perform per-bucket numeric computations across a set of buckets. |
bucket_selector | Parent | Evaluates a script to determine whether buckets returned by a histogram (or date_histogram ) aggregation should be included in the final result. |
bucket_sort | Parent | Sorts or truncates the buckets produced by its parent multi-bucket aggregation. |
cumulative_sum | Parent | Calculates the cumulative sum across the buckets of a previous aggregation. |
derivative | Parent | Calculates first-order and second-order derivatives of each bucket of an aggregation. |
extended_stats | Sibling | A more comprehensive version of the stats_bucket aggregation that provides additional metrics. |
max_bucket | Sibling | Calculates the maximum of a metric in each bucket of a previous aggregation. |
min_bucket | Sibling | Calculates the minimum of a metric in each bucket of a previous aggregation. |
moving_avg (Deprecated) | Parent | Calculates a sequence of averages of a metric contained in windows (adjacent subsets) of an ordered dataset. |
moving_fn | Parent | Executes a script over a sliding window. |
percentiles_bucket | Sibling | Calculates the percentile placement of bucketed metrics. |
serial_diff | Parent | Calculates the difference between metric values in the current bucket and a previous bucket. It stores the result in the current bucket. |
stats_bucket | Sibling | Returns a variety of stats (count , min , max , avg , and sum ) for the buckets of a previous aggregation. |
sum_bucket | Sibling | Calculates the sum of a metric in each bucket of a previous aggregation. |
Buckets path
A pipeline aggregation uses the buckets_path
parameter to reference the output of other aggregations. The buckets_path
parameter has the following syntax:
buckets_path = <agg_name>[ > <agg_name> ... ][ .<metric_name> ]
This syntax uses the following elements.
Element | Description |
---|---|
<agg_name> | The name of the aggregation. |
> | A child selector used to navigate from one aggregation (parent) to another nested aggregation (child). |
.<metric_name> | Specifies a metric to retrieve from a multi-value aggregation. Required only if the target aggregation produces multiple metrics. |
To visualize the buckets path, suppose you have the following aggregation structure:
"aggs": {
"parent_agg": {
"terms": {
"field": "category"
},
"aggs": {
"child_agg": {
"stats": {
"field": "price"
}
}
}
}
}
To reference the average price from the child_agg
, which is nested in the parent_agg
, use parent_agg>child_agg.avg
.
Examples:
-
my_sum.sum
: Refers to the sum metric from themy_sum
aggregation. -
popular_tags>my_sum.sum
: Refers to thesum
metric from themy_sum
aggregation, which is nested under thepopular_tags
aggregation.
For multi-value metric aggregations like stats
or percentiles
, you must include the metric name (for example, .min
) in the path. For single-value metrics like sum
or avg
, the metric name is optional if unambiguous.
Buckets path example
The following example operates on the OpenSearch Dashboards logs sample data. It creates a histogram of values in the bytes
field, sums the phpmemory
fields in each histogram bucket, and finally sums the buckets using the sum_bucket
pipeline aggregation. The buckets_path
follows the number_of_bytes>sum_total_memory
path from the number_of_bytes
parent aggregation to the sum_total_memory
subaggregation:
GET opensearch_dashboards_sample_data_logs/_search
{
"size": 0,
"aggs": {
"number_of_bytes": {
"histogram": {
"field": "bytes",
"interval": 10000
},
"aggs": {
"sum_total_memory": {
"sum": {
"field": "phpmemory"
}
}
}
},
"sum_copies": {
"sum_bucket": {
"buckets_path": "number_of_bytes>sum_total_memory"
}
}
}
}
Note that the buckets_path
contains the names of the component aggregations. Paths are directed, meaning that they cascade one way, downward from parents to children.
The pipeline aggregation returns the total memory summed from all the buckets:
{
...
"aggregations": {
"number_of_bytes": {
"buckets": [
{
"key": 0,
"doc_count": 13372,
"sum_total_memory": {
"value": 91266400
}
},
{
"key": 10000,
"doc_count": 702,
"sum_total_memory": {
"value": 0
}
}
]
},
"sum_copies": {
"value": 91266400
}
}
}
Count paths
You can direct the buckets_path
to use a count rather than a value as its input. To do so, use the _count
buckets path variable.
The following example computes basic stats on a histogram of the number of bytes from the OpenSearch Dashboards logs sample data. It creates a histogram of values in the bytes
field and then computes the stats on the counts in the histogram buckets.
GET opensearch_dashboards_sample_data_logs/_search
{
"size": 0,
"aggs": {
"number_of_bytes": {
"histogram": {
"field": "bytes",
"interval": 10000
}
},
"count_stats": {
"stats_bucket": {
"buckets_path": "number_of_bytes>_count"
}
}
}
}
The results show stats about the document counts of the buckets:
{
...
"aggregations": {
"number_of_bytes": {
"buckets": [
{
"key": 0,
"doc_count": 13372
},
{
"key": 10000,
"doc_count": 702
}
]
},
"count_stats": {
"count": 2,
"min": 702,
"max": 13372,
"avg": 7037,
"sum": 14074
}
}
}
Data gaps
Real-world data can be missing from nested aggregations for a number of reasons, including:
- Missing values in documents.
- Empty buckets anywhere in the chain of aggregations.
- Missing data needed to calculate a bucket value (for example, rolling functions such as
derivative
require one or more previous values to start).
You can specify a policy to handle missing data using the gap_policy
property: either skip the missing data or replace the missing data with zeros.
The gap_policy
parameter is valid for all pipeline aggregations.
Parameter | Required/Optional | Data type | Description |
---|---|---|---|
gap_policy | Optional | String | The policy to apply to missing data. Valid values are skip and insert_zeros . Default is skip . |
format | Optional | String | A DecimalFormat formatting string. Returns the formatted output in the aggregation’s value_as_string property. |