Codec processor combinations
At ingestion time, data received by the s3 source can be parsed by codecs. Codecs compresses and decompresses large data sets in a certain format before ingestion them through an OpenSearch Data Prepper pipeline processor.
While most codecs can be used with most processors, the following codec processor combinations can make your pipeline more efficient when used with the following input types.
JSON array
A JSON array is used to order elements of different types. Because an array is required in JSON, the data contained within the array must be tabular.
The JSON array does not require a processor.
NDJSON
Unlike a JSON array, NDJSON allows for each row of data to be delimited by a newline, meaning data is processed per line instead of an array.
The NDJSON input type is parsed using the newline codec, which parses each single line as a single log event. The parse_json processor then outputs each line as a single event.
CSV
The CSV data type inputs data as a table. It can used without a codec or processor, but it does require one or the other, for example, either just the csv processor or the csv codec.
The CSV input type is most effective when used with the following codec processor combinations.
csv codec
 When the csv codec is used without a processor, it automatically detects headers from the CSV and uses them for index mapping.
newline codec
 The newline codec parses each row as a single log event. The codec will only detect a header when header_destination is configured. The csv processor then outputs the event into columns. The header detected in header_destination from the newline codec can be used in the csv processor under column_names_source_key.
Parquet
Apache Parquet is a columnar storage format built for Hadoop. It is most efficient without the use of a codec. Positive results, however, can be achieved when it’s configured with S3 Select.
Avro
[Apache Avro] helps streamline streaming data pipelines. It is most efficient when used with the avro codec inside an s3 sink.
event_json
 The event_json output codec converts event data and metadata into JSON format to send to a sink, such as an S3 sink. The event_json input codec reads the event and its metadata to create an event in Data Prepper.