Apache Kafka

What is Apache Kafka

Kafka is an event streaming platform, because it is designed to handle and process continuous streams of events or data in real time.

When to use

Apache Kafka is a versatile platform that can be used in various scenarios where high-throughput, low-latency, and real-time data processing are required. Here are some common use cases and situations where Kafka is particularly well-suited:

Real-time Data Streaming: Kafka is ideal for applications that require real-time data processing, such as log aggregation, event sourcing, and stream processing.
Data Integration: Kafka can serve as a central hub for integrating data from multiple sources, allowing you to collect, transform, and distribute data across different systems.
Event-Driven Architectures: Kafka is commonly used in event-driven architectures to decouple services and enable asynchronous communication between them.
Analytics and Monitoring: Kafka can be used to collect and analyze large volumes of data in real time, making it suitable for monitoring applications and analytics pipelines.

Kafka Components

Producers: Applications that publish data to Kafka topics. The Producers is a source of data while publish messages/events.
Consumers: Applications that subscribe to topics and process the data. Consumers act as a receiver. It's responsible to receive/consume message/events.
Topics: Categories or feeds to which records are published.
Brokers: The Kafka Brokers is nothing but just a server that stores and manages the data. It works as an intermediary (helps in message exchanges) between producers and consumers.
Clusters: A group of Kafka brokers that work together to provide high availability and fault tolerance. There can be one or more brokers in the Kafka Clusters.
Zookeeper: A service for coordinating distributed applications, used by Kafka for managing brokers and topics.
Partitions: Subdivisions of topics that allow for parallel processing and scalability.
Offsets: Unique identifiers for each record within a partition, used to track the position of consumers.
Consumer Groups: A group of consumers that work together to consume data from a topic, allowing for load balancing and fault tolerance.
Streams API: A library for building stream processing applications that can read from and write to Kafka topics.
Connect API: A framework for integrating Kafka with external systems, such as databases and message queues.
Schema Registry: A service for managing and enforcing data schemas in Kafka, ensuring data compatibility and consistency.

How Kafka Partitions Work by Default

By default, Kafka partitions messages in a round-robin fashion across all available partitions for a topic. This means that when a producer sends messages to a topic, the messages are distributed evenly across the partitions, allowing for parallel processing and increased throughput.

However, Kafka also allows for more advanced partitioning strategies. Producers can specify a partition key when sending messages, which determines the partition to which the message is sent. This is useful for ensuring that related messages are sent to the same partition, allowing for ordered processing.

Partition Assignment: When you create a topic, you specify the number of partitions. Kafka clients (producers and consumers) automatically interact with these partitions to distribute or consume data.

Load Distribution: Kafka automatically distributes partitions across brokers in the cluster to ensure load balancing. This helps to prevent any single broker from becoming a bottleneck and allows for better resource utilization.

Message Routing: Producers send messages to partitions based on:

A keyed strategy (e.g., hash of the key determines partition)
Or, if no key is provided, Kafka applies a round-robin approach to evenly distribute messages among partitions.

What To Do After Messages Are Sent to Dead Letter Topic (DLT)?

After messages are sent to a Dead Letter Topic (DLT), you can take several actions to handle the failed messages:

Investigate the Cause: Analyze the messages in the DLT to understand why they failed. This may involve checking logs, error messages, and the original message content.
Implement Fixes: Once you identify the root cause of the failures, implement the necessary fixes in your application or data pipeline.
Reprocess Messages: After fixing the issues, you can reprocess the messages from the DLT. This may involve sending them back to the original topic or a different topic for reprocessing.
Monitor and Alert: Set up monitoring and alerting for your DLT to catch future failures early. This can help you respond more quickly and reduce the impact of failed messages.

Steps to Manage Dead Letter Topic Messages

Analyze the Failed Messages

Once a message lands in the DLT, the first step is to investigate why it failed:

Was there a data issue? (e.g., malformed JSON, invalid payload).
Was there a system failure? (e.g., timeout, unavailable resource).
Did the consumer's logic contain errors? (e.g., bugs in the processing code).

🛠️ How to Analyze:

kafka-console-consumer.sh --topic error-dlt --bootstrap-server localhost:9092 --from-beginning

Log and store the metadata for each failed message:

Original topic name.
Offset and partition number.
Error details (e.g., exception message).

Example with Spring Boot Consumer:

@KafkaListener(topics = "error-dlt", groupId = "dlt-analyzer-group")
public void handleFailedMessages(ConsumerRecord<String, String> record) {
    System.out.println("DLT Message Key: " + record.key());
    System.out.println("DLT Message Value: " + record.value());
    System.out.println("Original Topic: " + record.topic());
    System.out.println("Partition: " + record.partition());
    System.out.println("Offset: " + record.offset());
}

Classify the Failures
After analyzing the messages, classify the root cause of the failures into categories such as:
- Recoverable Errors: Missing or temporary data dependencies (e.g., database downtime). Timeout issues (e.g., calling an external API that failed temporarily). Processing retries can often resolve these issues.
- Non-Recoverable Errors: Invalid input data (e.g., malformed JSON, missing required fields). Business rule violations (e.g., payment amount exceeds limits). Requires manual intervention or reprocessing with clean data.

Reprocess Failed Messages

Once you fix the issue (bug or configuration), the most common action is to reprocess the failed messages from the DLT. This ensures no data is lost.

🛠️ Reprocessing Options:

Option 1: Write a New Consumer for DLT

Create a dedicated Spring Kafka Listener that reprocesses messages from the DLT.

@KafkaListener(topics = "error-dlt", groupId = "dlt-reprocessor-group")
public void reprocessMessage(ConsumerRecord<String, String> record) {
    try {
        System.out.println("Reprocessing message: " + record.value());
        // Simulate your business logic here
        process(record.value());
        System.out.println("Message reprocessed successfully!");
    } catch (Exception e) {
        System.err.println("Failed again: " + e.getMessage());
        // Optionally send back to another DLT or archive permanently
    }
}

private void process(String message) {
    // Your business logic here
    if (message.contains("error")) {
        throw new RuntimeException("Simulated processing failure");
    }
}

Option 2: Replay Messages to the Original Topic

You can move messages from the DLT back to the original topic so they can be retried by the original consumer after fixing the issue.

kafka-console-consumer.sh --topic error-dlt --bootstrap-server localhost:9092 | \
kafka-console-producer.sh --topic input-topic --bootstrap-server localhost:9092

Or automate this with a Spring Boot service:

@KafkaListener(topics = "error-dlt", groupId = "dlt-replay-group")
public void replayMessageToOriginalTopic(ConsumerRecord<String, String> record,
                                         KafkaTemplate<String, String> kafkaTemplate) {
    final String originalTopic = "input-topic"; // The topic to resend the message to
    kafkaTemplate.send(originalTopic, record.key(), record.value());
    System.out.println("Replayed message to original topic: " + originalTopic);
}

Option 3: Manual Reprocessing
If the issue requires a data cleanup or manual analysis, export DLT messages and process them manually:
```
kafka-run-class.sh kafka.tools.DumpLogSegments --files <log file path>
```
Fix data or clean invalid messages manually.

Inject cleaned messages back into the original topic:
```
kafka-console-producer.sh --topic input-topic --bootstrap-server localhost:9092
```

Archive Messages for Auditing or Recovery
For non-recoverable errors, you may want to archive failed messages for future audits or compliance purposes.
- Dedicated Archival Topic: Move permanent failures from DLT to another Kafka topic, like archived-errors:
```
@KafkaListener(topics = "error-dlt", groupId = "dlt-archive-group")
public void archiveMessage(ConsumerRecord<String, String> record, KafkaTemplate<String, String> template) {
    String archiveTopic = "archived-errors";
    template.send(archiveTopic, record.key(), record.value());
    System.out.println("Message archived to " + archiveTopic);
}
```
- Export to External Storage: Export failed messages to a durable storage solution:
  - Amazon S3/Google Cloud Storage: Store messages as files.
  - Elasticsearch: Store for fast searching and indexing.
  - Relational Databases: Archive for long-term auditability.
Set Up Alerts for DLT Messages
Monitor the arrival of messages in the DLT and set up alerts when thresholds are exceeded. For example:
- Use Prometheus + Grafana to track message counts in error-dlt.
- Integrate Alertmanager to notify your team (via Slack, email, etc.) if too many messages land in the DLT.
Example Prometheus Query:
```
sum(rate(kafka_topic_partition_dlt_message_count[5m]))
```
Prevent Future Failures
After resolving the immediate issue, take steps to prevent similar failures in the future:
- Data Validation Before Publishing: Validate messages at the producer level to avoid sending invalid data. Example: Add validation logic in Spring Boot Producer.
```
if (!isValid(message)) {
    throw new IllegalArgumentException("Invalid message format");
}
kafkaTemplate.send("input-topic", message);
```
- Robust Retry Mechanisms: Configure retries before messages are sent to the DLT.
```
spring.kafka.listener.retry.max-attempts=3
spring.kafka.listener.backoff.interval=1000ms
```
- Improve Logging and Observability: Add detailed logs for failed messages. Implement observability tools like OpenTelemetry for distributed tracing.
- Automate Data Sanitization: Clean up known problematic data patterns via automated transformations before publishing.

Suggested Workflow for Managing DLT Messages

Monitor the DLT: Set up alerting to detect when messages arrive in the DLT.
Investigate & Classify Errors: Analyze whether the errors are recoverable or non-recoverable.
Reprocess Recoverable Messages: Replay messages to the original topic or handle them programmatically.
Archive Non-Recoverable Messages: Export messages to a safe location for later inspection/auditing.
Fix Root Cause: Address the bug or misconfiguration to avoid similar errors in the future.