kafka
You can use the Apache Kafka source (kafka) in Data Prepper to read records from one or more Kafka topics. These records hold events that your Data Prepper pipeline can ingest. The kafka source uses Kafka’s Consumer API to consume messages from the Kafka broker, which then creates Data Prepper events for further processing by the Data Prepper pipeline.
Usage
The following example shows the kafka source in a Data Prepper pipeline:
kafka-pipeline:
  source:
    kafka:
      bootstrap_servers:
        - 127.0.0.1:9093
      topics:
        - name: Topic1
          group_id: groupID1
        - name: Topic2
          group_id: groupID1
Configuration
Use the following configuration options with the kafka source.
| Option | Required | Type | Description | 
|---|---|---|---|
| bootstrap_servers | Yes, when not using Amazon Managed Streaming for Apache Kafka (Amazon MSK) as a cluster. | IP address | The host or port for the initial connection to the Kafka cluster. You can configure multiple Kafka brokers by using the IP address or port number for each broker. When using Amazon MSK as your Kafka cluster, the bootstrap server information is obtained from MSK using the MSK Amazon Resource Name (ARN) provided in the configuration. | 
| topics | Yes | JSON array | The Kafka topics that the Data Prepper kafkasource uses to read messages. You can configure up to 10 topics. For more information abouttopicsconfiguration options, see Topics. | 
| schema | No | JSON object | The schema registry configuration. For more information, see Schema. | 
| authentication | No | JSON object | Set the authentication options for both the pipeline and Kafka. For more information, see Authentication. | 
| encryption | No | JSON object | The encryption configuration. For more information, see Encryption. | 
| aws | No | JSON object | The AWS configuration. For more information, see aws. | 
| acknowledgments | No | Boolean | If true, enables thekafkasource to receive end-to-end acknowledgments when events are received by OpenSearch sinks. Default isfalse. | 
| client_dns_lookup | Yes, when a DNS alias is used. | String | Sets Kafka’s client.dns.lookupoption. Default isdefault. | 
Topics
Use the following options in the topics array.
| Option | Required | Type | Description | 
|---|---|---|---|
| name | Yes | String | The name of each Kafka topic. | 
| group_id | Yes | String | Sets Kafka’s group.idoption. | 
| workers | No | Integer | The number of multithreaded consumers associated with each topic. Default is 2. The maximum value is200. | 
| serde_format | No | String | Indicates the serialization and deserialization format of the messages in the topic. Default is plaintext. | 
| auto_commit | No | Boolean | When false, the consumer’s offset will not be periodically committed to Kafka in the background. Default isfalse. | 
| commit_interval | No | Integer | When auto_commitis set totrue, sets how frequently, in seconds, the consumer offsets are auto-committed to Kafka through Kafka’sauto.commit.interval.msoption. Default is5s. | 
| session_timeout | No | Integer | The amount of time during which the source detects client failures when using Kafka’s group management features, which can be used to balance the data stream. Default is 45s. | 
| auto_offset_reset | No | String | Automatically resets the offset to an earlier or the latest offset through Kafka’s auto.offset.resetoption. Default islatest. | 
| thread_waiting_time | No | Integer | The amount of time that threads wait for the preceding thread to complete its task and to signal the next thread. The Kafka consumer API poll timeout value is set to half of this setting. Default is 5s. | 
| max_partition_fetch_bytes | No | Integer | Sets the maximum limit in megabytes for max data returns from each partition through Kafka’s max.partition.fetch.bytessetting. Default is1mb. | 
| heart_beat_interval | No | Integer | The expected amount of time between heartbeats to the consumer coordinator when using Kafka’s group management facilities through Kafka’s heartbeat.interval.mssetting. Default is5s. | 
| fetch_max_wait | No | Integer | The maximum amount of time during which the server blocks a fetch request when there isn’t sufficient data to satisfy the fetch_min_bytesrequirement through Kafka’sfetch.max.wait.mssetting. Default is500ms. | 
| fetch_max_bytes | No | Integer | The maximum record size accepted by the broker through Kafka’s fetch.max.bytessetting. Default is50mb. | 
| fetch_min_bytes | No | Integer | The minimum amount of data the server returns during a fetch request through Kafka’s retry.backoff.mssetting. Default is1b. | 
| retry_backoff | No | Integer | The amount of time to wait before attempting to retry a failed request to a given topic partition. Default is 10s. | 
| max_poll_interval | No | Integer | The maximum delay between invocations of a poll()when using group management through Kafka’smax.poll.interval.msoption. Default is300s. | 
| consumer_max_poll_records | No | Integer | The maximum number of records returned in a single poll()call through Kafka’smax.poll.recordssetting. Default is500. | 
| key_mode | No | String | Indicates how the key field of the Kafka message should be handled. The default setting is include_as_field, which includes the key in thekafka_keyevent. Theinclude_as_metadatasetting includes the key in the event’s metadata. Thediscardsetting discards the key. | 
Schema
The following option is required inside the schema configuration.
| Option | Type | Description | 
|---|---|---|
| type | String | Sets the type of schema based on your registry, either the AWS Glue Schema Registry, aws_glue, or the Confluent Schema Registry,confluent. When using theaws_glueregistry, set any AWS configuration options. | 
The following configuration options are only required when using a confluent registry.
| Option | Type | Description | 
|---|---|---|
| registry_url | String | Deserializes a record value from a bytearrayinto a string. Default isorg.apache.kafka.common.serialization.StringDeserializer. | 
| version | String | Deserializes a record key from a bytearrayinto a string. Default isorg.apache.kafka.common.serialization.StringDeserializer. | 
| schema_registry_api_key | String | The schema registry API key. | 
| schema_registry_api_secret | String | The schema registry API secret. | 
Authentication
The following option is required inside the authentication object.
| Option | Type | Description | 
|---|---|---|
| sasl | JSON object | The Simple Authentication and Security Layer (SASL) authentication configuration. | 
SASL
Use one of the following options when configuring SASL authentication.
| Option | Type | Description | 
|---|---|---|
| plaintext | JSON object | The PLAINTEXT authentication configuration. | 
| aws_msk_iam | String | The Amazon MSK AWS Identity and Access Management (IAM) configuration. If set to role, thests_role_armset in theawsconfiguration is used. Default isdefault. | 
SASL PLAINTEXT
The following options are required when using the SASL PLAINTEXT protocol.
| Option | Type | Description | 
|---|---|---|
| username | String | The username for the PLAINTEXT auth. | 
| password | String | The password for the PLAINTEXT auth. | 
Encryption
Use the following options when setting SSL encryption.
| Option | Required | Type | Description | 
|---|---|---|---|
| type | No | String | The encryption type. Use noneto disable encryption. Default isssl. | 
| insecure | No | Boolean | A Boolean flag used to turn off SSL certificate verification. If set to true, certificate authority (CA) certificate verification is turned off and insecure HTTP requests are sent. Default isfalse. | 
AWS
Use the following options when setting up authentication for aws services.
| Option | Required | Type | Description | 
|---|---|---|---|
| region | No | String | The AWS Region to use for credentials. Defaults to standard SDK behavior to determine the Region. | 
| sts_role_arn | No | String | The AWS Security Token Service (AWS STS) role to assume for requests to Amazon Simple Queue Service (Amazon SQS) and Amazon Simple Storage Service (Amazon S3). Default is null, which will use the standard SDK behavior for credentials. | 
| msk | No | JSON object | The MSK configuration settings. | 
MSK
Use the following options inside the msk object.
| Option | Required | Type | Description | 
|---|---|---|---|
| arn | Yes | String | The MSK ARN to use. | 
| broker_connection_typeNo | String | The type of connector to use with the MSK broker, either public,single_vpc, ormultip_vpc. Default issingle_vpc. |