Link Search Menu Expand Document Documentation Menu

Aggregation functions

Aggregation functions perform calculations across multiple rows to return a single result value. These functions are used with the stats, eventstats, and streamstats commands to analyze and summarize data.

The following table shows how NULL and missing values are handled by aggregation functions.

Function null Missing
COUNT Not counted Not counted
SUM Ignored Ignored
AVG Ignored Ignored
MAX Ignored Ignored
MIN Ignored Ignored
FIRST Ignored Ignored
LAST Ignored Ignored
LIST Ignored Ignored
VALUES Ignored Ignored

Functions

The following aggregation functions are available in PPL for data analysis and summarization.

COUNT

Usage: COUNT(expr), C(expr), c(expr), count(expr)

Counts the number of expr values in the retrieved rows. C(), c(), and count() are available as abbreviations for COUNT(). For filtered counting, use an eval expression to specify the filtering condition.

Parameters:

  • expr (Optional): The expression whose values are to be counted.

Return type: LONG

Example

source=accounts
| stats count(), c(), count, c

The query returns the following results:

count() c() count c
4 4 4 4

The following example counts only records that match a specific condition:

source=accounts
| stats count(eval(age > 30)) as mature_users

The query returns the following results:

mature_users
3

SUM

Usage: SUM(expr)

Returns the sum of expr values.

Parameters:

  • expr (Required): The expression whose values are to be summed.

Return type: Same as input type (INTEGER, LONG, FLOAT, or DOUBLE)

Example

source=accounts
| stats sum(age) by gender

The query returns the following results:

sum(age) gender
28 F
101 M

AVG

Usage: AVG(expr)

Returns the average value of expr.

Parameters:

  • expr (Required): The expression whose values are to be averaged.

Return type: DOUBLE for numeric inputs; same as input type for DATE, TIME, or TIMESTAMP inputs

Example

source=accounts
| stats avg(age) by gender

The query returns the following results:

avg(age) gender
28.0 F
33.666666666666664 M

MAX

Usage: MAX(expr)

Returns the maximum value of expr. For non-numeric fields, this function returns the value that comes last in alphabetical order.

Parameters:

  • expr (Required): The expression for which to find the maximum value.

Return type: Same as input type

Example

source=accounts
| stats max(age)

The query returns the following results:

max(age)
36

The following example returns the value from the firstname text field that comes last in alphabetical order:

source=accounts
| stats max(firstname)

The query returns the following results:

max(firstname)
Nanette

MIN

Usage: MIN(expr)

Returns the minimum value of expr. For non-numeric fields, this function returns the value that comes first in alphabetical order.

Parameters:

  • expr (Required): The expression for which to find the minimum value.

Return type: Same as input type

Example

source=accounts
| stats min(age)

The query returns the following results:

min(age)
28

The following example returns the value from the firstname text field that comes first in alphabetical order:

source=accounts
| stats min(firstname)

The query returns the following results:

min(firstname)
Amber

VAR_SAMP

Usage: VAR_SAMP(expr)

Returns the sample variance of expr.

Parameters:

  • expr (Required): The expression for which to calculate the sample variance.

Return type: DOUBLE

Example

source=accounts
| stats var_samp(age)

The query returns the following results:

var_samp(age)
10.916666666666666

VAR_POP

Usage: VAR_POP(expr)

Returns the population variance of expr.

Parameters:

  • expr (Required): The expression for which to calculate the population variance.

Return type: DOUBLE

Example

source=accounts
| stats var_pop(age)

The query returns the following results:

var_pop(age)
8.1875

STDDEV_SAMP

Usage: STDDEV_SAMP(expr)

Returns the sample standard deviation of expr.

Parameters:

  • expr (Required): The expression for which to calculate the sample standard deviation.

Return type: DOUBLE

Example

source=accounts
| stats stddev_samp(age)

The query returns the following results:

stddev_samp(age)
3.304037933599835

STDDEV_POP

Usage: STDDEV_POP(expr)

Returns the population standard deviation of expr.

Parameters:

  • expr (Required): The expression for which to calculate the population standard deviation.

Return type: DOUBLE

Example

source=accounts
| stats stddev_pop(age)

The query returns the following results:

stddev_pop(age)
2.8613807855648994

DISTINCT_COUNT, DC

Usage: DISTINCT_COUNT(expr), DC(expr)

Returns the approximate number of distinct values using the HyperLogLog++ algorithm. Both functions are equivalent. For more information about algorithm accuracy and precision control, see Controlling precision.

Parameters:

  • expr (Required): The expression for which to count distinct values.

Return type: LONG

Example

source=accounts
| stats dc(state) as distinct_states, distinct_count(state) as dc_states_alt by gender

The query returns the following results:

distinct_states dc_states_alt gender
1 1 F
3 3 M

DISTINCT_COUNT_APPROX

Usage: DISTINCT_COUNT_APPROX(expr)

Returns the approximate count of distinct values in expr using the HyperLogLog++ algorithm.

Parameters:

  • expr (Required): The expression for which to count approximate distinct values.

Return type: LONG

Example

source=accounts
| stats distinct_count_approx(gender)

The query returns the following results:

distinct_count_approx(gender)
2

EARLIEST

Usage: EARLIEST(field [, time_field])

Returns the earliest value of a field based on timestamp ordering.

Parameters:

  • field (Required): The field for which to return the earliest value.
  • time_field (Optional): The field to use for time-based ordering. Defaults to @timestamp if not specified.

Return type: Same as input field type

Example

source=events
| stats earliest(message) by host
| sort host

The query returns the following results:

earliest(message) host
Starting up server1
Initializing server2

The following example uses a custom time field instead of the default @timestamp field for ordering:

source=events
| stats earliest(status, event_time) by category
| sort category

The query returns the following results:

earliest(status, event_time) category
pending orders
active users

LATEST

Usage: LATEST(field [, time_field])

Returns the latest value of a field based on timestamp ordering.

Parameters:

  • field (Required): The field for which to return the latest value.
  • time_field (Optional): The field to use for time-based ordering. Defaults to @timestamp if not specified.

Return type: Same as input field type

Example

source=events
| stats latest(message) by host
| sort host

The query returns the following results:

latest(message) host
Shutting down server1
Maintenance mode server2

The following example uses a custom time field instead of the default @timestamp field for ordering:

source=events
| stats latest(status, event_time) by category
| sort category

The query returns the following results:

latest(status, event_time) category
cancelled orders
inactive users

TAKE

Usage: TAKE(field [, size])

Returns the original values from a field. This function does not guarantee the order of the returned values.

Parameters:

  • field (Required): A text field from which to extract values.
  • size (Optional): The number of values to return. Defaults to 10.

Return type: ARRAY

Example

source=accounts
| stats take(firstname)

The query returns the following results:

take(firstname)
[Amber,Hattie,Nanette,Dale]

PERCENTILE, PERCENTILE_APPROX

Usage: PERCENTILE(expr, percent), PERCENTILE_APPROX(expr, percent)

Returns the approximate percentile value of expr at the specified percentage.

Parameters:

  • expr (Required): The expression for which to calculate the percentile.
  • percent (Required): A constant number between 0 and 100.

Return type: Same as input type

Starting in version 3.1.0, the percentile implementation switched from AVLTreeDigest to MergingDigest. For more information, see the corresponding issue.

Example

source=accounts
| stats percentile(age, 90) by gender

The query returns the following results:

percentile(age, 90) gender
28 F
36 M

Percentile shortcut functions

For convenience, OpenSearch PPL provides shortcut functions for common percentiles:

  • PERC<percent>(expr) - Equivalent to PERCENTILE(expr, <percent>).
  • P<percent>(expr) - Equivalent to PERCENTILE(expr, <percent>).

Both integer and decimal percentiles from 0 to 100 are supported (for example, PERC95, P99.5):

source=accounts 
| stats perc99.5(age);

The query returns the following results:

perc99.5(age)
36
source=accounts 
| stats p50(age);

The query returns the following results:

p50(age)
33

MEDIAN

Usage: MEDIAN(expr)

Returns the median (50th percentile) value of expr. This is equivalent to PERCENTILE(expr, 50).

Parameters:

  • expr (Required): The expression for which to calculate the median.

Return type: Same as input type

Example

source=accounts
| stats median(age)

The query returns the following results:

median(age)
33

FIRST

Usage: FIRST(field)

Returns the first non-null value of a field based on natural document order. Returns NULL if no records exist or if all records have NULL values for the field.

Parameters:

  • field (Required): The field for which to return the first value.

Return type: Same as input field type

Example

source=accounts
| stats first(firstname) by gender

The query returns the following results:

first(firstname) gender
Nanette F
Amber M

LAST

Usage: LAST(field)

Returns the last non-null value of a field based on natural document order. Returns NULL if no records exist or if all records have NULL values for the field.

Parameters:

  • field (Required): The field for which to return the last value.

Return type: Same as input field type

Example

source=accounts
| stats last(firstname) by gender

The query returns the following results:

last(firstname) gender
Nanette F
Dale M

LIST

Usage: LIST(expr)

Collects all values from the specified expression into an array. Values are converted to strings, NULL values are filtered out, and duplicates are preserved. This function returns up to 100 values without a guaranteed order.

Parameters:

  • expr (Required): The field expression from which to collect values.

Return type: ARRAY

This aggregation function does not support array, struct, or object field types.

Example

The following example collects all values from a string field into an array:

source=accounts
| stats list(firstname)

The query returns the following results:

list(firstname)
[Amber,Hattie,Nanette,Dale]

VALUES

Usage: VALUES(expr)

Collects all unique values from the specified expression into a sorted array. Values are converted to strings, NULL values are filtered out, and duplicates are removed.

Parameters:

  • expr (Required): The expression from which to collect unique values.

Return type: ARRAY

The plugins.ppl.values.max.limit setting controls the maximum number of unique values returned:

  • The default value is 0, which returns an unlimited number of values.
  • Setting this to any positive integer limits the number of unique values.

Example

The following example collects unique values from a string field into a sorted array:

source=accounts
| stats values(firstname)

The query returns the following results:

values(firstname)
[Amber,Dale,Hattie,Nanette]