Data Eng Weekly #331

After a few weeks off (hopefully folks in the US had a nice Thanksgiving!), we're back with your weekly fix of data engineering articles. Apache Kafka and Apache Airflow are covered from several angels in this issue, and there are posts on the future of data engineering, columnar file formats, bloom filters, and Cruise's platform for data pipelines. Lots of great posts from folks building large scale data platforms!

This article provides an overview of a talk at the recent QCon San Francisco on the future of data engineering. The talk covers six stages of data engineering and what it takes to evolve from one stage to the next through the lense of the data architecture a WePay. The talk also covers what's ahead in the field. If you want to dive in more, the article links out to the slides for the presentation.

Zeebee is a workflow engine that can be used to execute and/or monitor workflows that span multiple microservices. This post looks at how it can integrate with Apache Kafka—as a source of data for monitoring or as a sink for publishing information about the state of the workflow. There are several good diagrams in the post to illustrate the key concepts.

Zulily writes about how they have evolved their Apache Airflow architecture—moving from celery executors to the Kubernetes executor, leveraging AWS RDS for metadata, and using AWS EFS for the DAGs. The post also describes their CI/CD workflow, and more.

A good introduction to the Apache Kafka Client Consumer's PartitionAssignor strategies. The post covers the three builtin strategies (Range, RoundRobin, StickyAssignor), the StreamsPartitionAssignor from Kafka Streams, and how to implement a custom strategy. As an example, the post walks through building a FailoverAssignor that could be used for an active/passive setup.

This article provides a good introduction to columnar file formats—describing how they physically store data (with an example of translating a CSV to a columnar CSV format), the benefits of columnar formats, and some of the trade-offs.

Pinterest writes about the service they've built to support large amounts of offline updates to their sharded MySQL cluster. This service, which exposes APIs for batch write operations, groups writes/updates based on operation type and shard. It also uses Kafka as a buffer—consumers fetch batch operation details and write to MySQL at a configured rate limit to keep the load of offline operations from impacting user-interactive queries. The post dives into technical details, including how they handle hot shards, variation in write operations, and the improvements they've seen from this new system.

A look at several techniques for monitoring Apache Kafka and related components. The post describes Quantyca's approach using MetricBeat with Burrow and Elastalert. They have an example of sending an alert to Slack.

Azkarra Streams is a new framework for building Apache Kafka Streams applications. It provides a library that eliminates a lot of the boilerplate of a typical streams application, and it has a built in HTTP server to monitor the state of your application(s), a simple DAG visualizer, and a builtin HTTP request endpoint to query your Kafka Streams stores (along with a web UI to look at results).

Cruise writes about Terra, their platform built on the Apache Beam SDK for data pipelines. Terra supplements Beam's features by adding permissions management, job submission (including pulling python/C++ dependencies), lineage, and more. The post has some sample code that shows how it all fits together.

Airtunnel is a new open source project that provides blueprints for building Apache Airflow DAGs. The project is designed for several design principles: consistency (e.g. in naming of data sets, scripts, and workflows), declarative first (Airtunnel uses YAML to declare data assets), and metadata driven. Airtunnel, which includes custom operators, metadata extensions to collect data asset lineage, and more, is available on github.

Bloom Filters are ubiquitous in distributed data stores because they can eliminate certain expensive operations. This post dives into the features of a bloom filter, how it works, and contains a basic implementation in Python.


Curated by Datadog ( )


Off the Ground w/ Apache Airflow + Ordinary People w/ Ability for Extraordinary (Santa Monica) - Thursday, December 12


Apache Kafka Committer & Co-Founder Jun Rao on Why Kafka Needs No Keeper (Minneapolis) - Monday, December 9


HCSC Big Data Hadoop Meetup (Chicago) - Wednesday, December 11


Running Apache Airflow at Kabbage (Atlanta) - Tuesday, December 10

North Carolina

Building a Stream Processing Architecture with Apache Kafka (Charlotte) - Wednesday, December 11

New York

An Introduction to Kafka Streams and KSQL (Webster) - Tuesday, December 10


December Apache Spark Meetup (Cambridge) - Tuesday, December 10


Data Science and Engineering Club (Dublin) - Wednesday, December 11


Apache Kafka: Metamorphosis (Lisboa) - Thursday, December 12


From Hadoop to NoSQL & Graph to Translytical (Middelharnis) - Tuesday, December 10


The Learnings of Karate Kid Applied to Apache Kafka (Berlin) - Wednesday, December 11

Apache NiFi + Hacking Around the IoTree (Frankfurt) - Wednesday, December 11

Kubernetes with Kafka Flavor (Berlin) - Thursday, December 12


Using PySpark with Google Colab + Spark 3.0 Preview (Milano) - Wednesday, December 11

It's a Streamer World! A Journey Through Processing Flows of Data (Milano) - Wednesday, December 11


First Warsaw Apache Airflow Workshop (Warsaw) - Friday, December 13


Building Consciousness on Real Time Events: ksqlDB Recipes (Chennai) - Wednesday, December 11

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.