Apache Kafka Example: How Rollbar Removed Technical Debt - Part 2

In the first part of our series of blog posts on how we remove technical debt using Apache Kafka at Rollbar, we covered some important topics such as:

Sizing the Kafka cluster
Measuring your expected throughput to size the topics correctly
Write and configure the Kafka producer so it gives the latency and throughput desired

In the second part of the series, we’ll give an overview of how our Kafka consumer works, how we monitor it, and which deployment and release process we followed so we could replace an old system without any downtime.

Kafka Consumer

Most of our backend projects are coded in Python so we wrote a process using Python 3.8 that would consume messages from a Kafka topic and write them to the database in batches. We decided to use the Confluent Kafka client since it has better performance and uses librdkafka.

We decided to write a CLI that allows us to run it like this:

$ ./bin/ingestion 
  --kafka-brokers kafka:9092 
  --kafka-topics stream.raw_items.raw 
  --kafka-group ingestion_raw_item 
  --db-host db 
  --db-user $DBUSER 
  --db-password $DBPASSWORD 
  --db-name mox_raw 
  --consumer-timeout 0.15

We basically set up the Kafka consumer configuration along the database configuration. An important setting that affects the user experience is consumer-timeout, which represents the maximum time the consumer will block consuming and/or waiting for new messages.

This value affects the delay seen by the user, having different delays for the first and last message of the batch. A small timeout period will provide lower latencies while will increase the number of parallel INSERT operations in DB and then cause gap deadlocks.

Cluster and Consumer Monitoring

We are monitoring our architecture using Prometheus so the new projects we write need to expose their metrics using a /metrics endpoint. We are using the official Python library to expose these metrics and our Prometheus server running in Kubernetes will scrape all the pods for our ingestion Deployment. We are using the Prometheus Helm chart that by default will scrape all pods with annotation prometheus.io/scrape: "true" in the Kubernetes Deployment.

Besides your own code, you'll want also to monitor the Kafka brokers so you have visibility on a few important metrics:

Incoming messages per topic
Bytes in/out per topic
Consumer groups lag
CPU usage
JVM memory used
Free disk space

We are using a couple of projects to monitor our Kafka cluster, the Prometheus JMX exporter and Kafka Exporter. While the Prometheus JMX exporter can be enabled changing the command to run Kafka, Kafka exporter needs to be deployed into your infrastructure, something that thanks to Kubernetes was a very easy task. Kafka exporter also brings a pre-built Grafana dashboard.

Deployment Strategy

As we've commented above, we're using Kubernetes to deploy our architecture. We are using Spinnaker to build our CD pipelines and deploy to our staging cluster and production cluster when everything goes fine in previous stages.

We do an intense use of feature flags in our development process, thanks to LaunchDarkly, and we could switch between our legacy ingestion process and the Kafka based one progressively using percentage rollouts. This allowed us to monitor the Kafka cluster and the performance of the system in a safe way without any downtime. In addition, this resulted in high transparency for our customers.

We've integrated Spinnaker with our own Rollbar account so we can track the new deployed versions of the ingestion project. At same time we keep statistics for each deployed version, something that is a key part of our CI/CD process. In our future posts, we'll go into more details of our Kubernetes configuration and our deployment process.

Conclusions

We successfully removed a legacy piece of our architecture thanks to a streaming technology like Kafka, that fits very well in our pipeline processing use case. With this project we have not only removed a legacy and custom logic piece of our architecture but prepared ourselves for the future, where we'll continue migrating stages of our processing pipeline to deliver a better and faster product to our users.

Kafka will become a core part of our architecture as we adopt its use more and more. On future blog posts we’ll explain our experience managing Kafka and preparing our processing pipeline to use it.

Apache Kafka Example: How Rollbar Removed Technical Debt – Part 2

Table of Contents

Kafka Consumer

Cluster and Consumer Monitoring

Deployment Strategy

Conclusions

Start continuously improving your code today.