Link Search Menu Expand Document Documentation Menu

Codec processor combinations

At ingestion time, data received by the s3 source can be parsed by codecs. Codecs compresses and decompresses large data sets in a certain format before ingestion them through an OpenSearch Data Prepper pipeline processor.

While most codecs can be used with most processors, the following codec processor combinations can make your pipeline more efficient when used with the following input types.

JSON array

A JSON array is used to order elements of different types. Because an array is required in JSON, the data contained within the array must be tabular.

The JSON array does not require a processor.

NDJSON

Unlike a JSON array, NDJSON allows for each row of data to be delimited by a newline, meaning data is processed per line instead of an array.

The NDJSON input type is parsed using the newline codec, which parses each single line as a single log event. The parse_json processor then outputs each line as a single event.

CSV

The CSV data type inputs data as a table. It can used without a codec or processor, but it does require one or the other, for example, either just the csv processor or the csv codec.

The CSV input type is most effective when used with the following codec processor combinations.

csv codec

When the csv codec is used without a processor, it automatically detects headers from the CSV and uses them for index mapping.

newline codec

The newline codec parses each row as a single log event. The codec will only detect a header when header_destination is configured. The csv processor then outputs the event into columns. The header detected in header_destination from the newline codec can be used in the csv processor under column_names_source_key.

Parquet

Apache Parquet is a columnar storage format built for Hadoop. When configuring a pipeline, you can use the parquet codec to read Parquet data directly from the Amazon Simple Storage Service (Amazon S3) object. This will retrieve all data from Parquet. Alternatively, you can use S3 Select instead of the codec. In this case, S3 Select parses the Parquet file directly. This can be more efficient if you are filtering or loading a subset of data.

Additional S3 charges apply when using S3 Select.

Avro

Apache Avro is a columnar storage format built for Hadoop. It is most efficient without the use of a codec. When used with S3 Select, Avro can deliver strong performance by enabling selective data retrieval.

event_json

The event_json output codec converts event data and metadata into JSON format to send to a sink, such as an S3 sink. The event_json input codec reads the event and its metadata to create an event in Data Prepper.

350 characters left

Have a question? .

Want to contribute? or .