Data Eng Weekly #331

Dec 09, 2019

After a few weeks off (hopefully folks in the US had a nice Thanksgiving!), we're back with your weekly fix of data engineering articles. Apache Kafka and Apache Airflow are covered from several angels in this issue, and there are posts on the future of data engineering, columnar file formats, bloom filters, and Cruise's platform for data pipelines. Lots of great posts from folks building large scale data platforms!

This article provides an overview of a talk at the recent QCon San Francisco on the future of data engineering. The talk covers six stages of data engineering and what it takes to evolve from one stage to the next through the lense of the data architecture a WePay. The talk also covers what's ahead in the field. If you want to dive in more, the article links out to the slides for the presentation.

https://www.infoq.com/news/2019/11/data-engineering-future-qconsf/

Zeebee is a workflow engine that can be used to execute and/or monitor workflows that span multiple microservices. This post looks at how it can integrate with Apache Kafka—as a source of data for monitoring or as a sink for publishing information about the state of the workflow. There are several good diagrams in the post to illustrate the key concepts.

https://blog.bernd-ruecker.com/zeebe-loves-kafka-d82516030f99

Zulily writes about how they have evolved their Apache Airflow architecture—moving from celery executors to the Kubernetes executor, leveraging AWS RDS for metadata, and using AWS EFS for the DAGs. The post also describes their CI/CD workflow, and more.

https://zulily-tech.com/2019/11/19/evolution-of-zulilys-airflow-infrastructure/

A good introduction to the Apache Kafka Client Consumer's PartitionAssignor strategies. The post covers the three builtin strategies (Range, RoundRobin, StickyAssignor), the StreamsPartitionAssignor from Kafka Streams, and how to implement a custom strategy. As an example, the post walks through building a FailoverAssignor that could be used for an active/passive setup.

https://medium.com/streamthoughts/understanding-kafka-partition-assignment-strategies-and-how-to-write-your-own-custom-assignor-ebeda1fc06f3

This article provides a good introduction to columnar file formats—describing how they physically store data (with an example of translating a CSV to a columnar CSV format), the benefits of columnar formats, and some of the trade-offs.

https://blog.matthewrathbone.com/2019/11/21/guide-to-columnar-file-formats.html

Pinterest writes about the service they've built to support large amounts of offline updates to their sharded MySQL cluster. This service, which exposes APIs for batch write operations, groups writes/updates based on operation type and shard. It also uses Kafka as a buffer—consumers fetch batch operation details and write to MySQL at a configured rate limit to keep the load of offline operations from impacting user-interactive queries. The post dives into technical details, including how they handle hot shards, variation in write operations, and the improvements they've seen from this new system.

https://medium.com/pinterest-engineering/using-kafka-to-throttle-qps-on-mysql-shards-in-bulk-write-apis-a326ae0f1ac1

A look at several techniques for monitoring Apache Kafka and related components. The post describes Quantyca's approach using MetricBeat with Burrow and Elastalert. They have an example of sending an alert to Slack.

https://medium.com/quantyca/how-to-monitor-your-kafka-cluster-efficiently-d45ce37c02f1

Azkarra Streams is a new framework for building Apache Kafka Streams applications. It provides a library that eliminates a lot of the boilerplate of a typical streams application, and it has a built in HTTP server to monitor the state of your application(s), a simple DAG visualizer, and a builtin HTTP request endpoint to query your Kafka Streams stores (along with a web UI to look at results).

https://medium.com/streamthoughts/introducing-azkarra-streams-the-first-micro-framework-for-apache-kafka-streams-e13605f3a3a6

Cruise writes about Terra, their platform built on the Apache Beam SDK for data pipelines. Terra supplements Beam's features by adding permissions management, job submission (including pulling python/C++ dependencies), lineage, and more. The post has some sample code that shows how it all fits together.

https://medium.com/cruise/introducing-terra-cruises-data-processing-platform-c6a476bb5b72

Airtunnel is a new open source project that provides blueprints for building Apache Airflow DAGs. The project is designed for several design principles: consistency (e.g. in naming of data sets, scripts, and workflows), declarative first (Airtunnel uses YAML to declare data assets), and metadata driven. Airtunnel, which includes custom operators, metadata extensions to collect data asset lineage, and more, is available on github.

https://medium.com/bcggamma/airtunnel-a-blueprint-for-workflow-orchestration-using-airflow-173054b458c3

Bloom Filters are ubiquitous in distributed data stores because they can eliminate certain expensive operations. This post dives into the features of a bloom filter, how it works, and contains a basic implementation in Python.

https://diogodanielsoaresferreira.github.io/bloom-filter/

Events

Curated by Datadog ( http://www.datadog.com )

California

Off the Ground w/ Apache Airflow + Ordinary People w/ Ability for Extraordinary (Santa Monica) - Thursday, December 12

https://www.meetup.com/LA-HUG/events/264629190/

Minnesota

Apache Kafka Committer & Co-Founder Jun Rao on Why Kafka Needs No Keeper (Minneapolis) - Monday, December 9

https://www.meetup.com/TwinCities-Apache-Kafka/events/266301147/

Illinois

HCSC Big Data Hadoop Meetup (Chicago) - Wednesday, December 11

https://www.meetup.com/HCSC-Technology-Group/events/266496994/

Georgia

Running Apache Airflow at Kabbage (Atlanta) - Tuesday, December 10

https://www.meetup.com/BigDataATL/events/266664587/

North Carolina

Building a Stream Processing Architecture with Apache Kafka (Charlotte) - Wednesday, December 11

https://www.meetup.com/Charlotte-Java-Developers-Meetup/events/266370651/

New York

An Introduction to Kafka Streams and KSQL (Webster) - Tuesday, December 10

https://www.meetup.com/RTG-Rochester-Technology-Group/events/266463037/

Massachusetts

December Apache Spark Meetup (Cambridge) - Tuesday, December 10

https://www.meetup.com/Boston-Apache-Spark-User-Group/events/265735754/

IRELAND

Data Science and Engineering Club (Dublin) - Wednesday, December 11

https://www.meetup.com/Data-Science-and-Engineering-Club/events/266969502/

PORTUGAL

Apache Kafka: Metamorphosis (Lisboa) - Thursday, December 12

https://www.meetup.com/Tech-Mate/events/266757230/

NETHERLANDS

From Hadoop to NoSQL & Graph to Translytical (Middelharnis) - Tuesday, December 10

https://www.meetup.com/Code-in-the-Middel/events/266260609/

GERMANY

The Learnings of Karate Kid Applied to Apache Kafka (Berlin) - Wednesday, December 11

https://www.meetup.com/Berlin-Apache-Kafka-Meetup-by-Confluent/events/266567425/

Apache NiFi + Hacking Around the IoTree (Frankfurt) - Wednesday, December 11

https://www.meetup.com/IoT-Hessen/events/265584374/

Kubernetes with Kafka Flavor (Berlin) - Thursday, December 12

https://www.meetup.com/DigitalOceanBerlin/events/266915196/

ITALY

Using PySpark with Google Colab + Spark 3.0 Preview (Milano) - Wednesday, December 11

https://www.meetup.com/Spark-More-Milano/events/266753286/

It's a Streamer World! A Journey Through Processing Flows of Data (Milano) - Wednesday, December 11

https://www.meetup.com/Milano-Kafka-meetup/events/266750779/

POLAND

First Warsaw Apache Airflow Workshop (Warsaw) - Friday, December 13

https://www.meetup.com/Warsaw-Airflow-Meetup/events/266996789/

INDIA

Building Consciousness on Real Time Events: ksqlDB Recipes (Chennai) - Wednesday, December 11

https://www.meetup.com/Chennai-Kafka/events/266140963/

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.

Data Eng Weekly