Data Eng Weekly #329

Some great stuff in this week’s issue—Lyft’s open source metadata and data discovery platform, how Netflix uses GraphQL to build search indexes, an open source backup tool for Apache Cassandra, coverage of the Ceph distributed file systems evolution, and several other posts about Apache Airflow, Apache Spark, CockroachDB, event-driven microservices, and more!

Lyft writes about Amudsen, their open-source tool for metadata management and data discovery. The post covers the architecture of the system—the metadata, search, and frontend services as well as the databuilder (data ingestion framework). Since open sourcing earlier this year, they've had a number of contributions, such as one to make the datastore more extensible (supporting Apache Atlas in addition to Neo4j).

A look at how to use Docker Swarm to scale out Apache Airflow using a custom Airflow operator.

One data engineer's learnings after a year on the job. There are some good reflections on tools, automation (after a workflow of a certain size, a tool like Airflow is important), monitoring, and metadata/documentation housekeeping.

Netflix describes how they use GraphQL to build indexes ofrom data stored across multiple services. They key idea is to do issue a (batch) GraphQL query to return a full denormalized record (e.g. a show and its episodes, etc.) and store the results in Elasticsearch. Next, they can listen to changes on Kafka, and follow the relationships from the GraphQL schema to invalidate/reindex data.

Medusa is a new open source backup tool for Apache Cassandra. It stores backup data in cloud storage (e.g. Amazon S3 or Google Cloud Storage), and it creates smart incremental backups by taking advantage of the immutable nature of SSTables. There's much more details about the tools features and how to use it in this post on The Last Pickle blog.

An in-depth look at the Apache Kafka consumer rebalance protocol. The post describes the pieces of the protocol like JoinGroup, SyncGroup, Heartbeat, and LeaveGroup. It also looks at the recent additions of static membership and incremental cooperative rebalancing. There are lots of great diagrams to illustrate the key concepts.

A look at how to detect skew in your Apache Spark jobs, and several ways to fix a job with skew (hints, randomizing the join key, writing a custom partitioner). Which solution is best/fastest depends a bit on the inputs to your job.

The morning paper covers a paper on Ceph, the open-source distributed file system. Over the past few years, Ceph implemented a new store that bypasses a filesystemto better take advantage of SSD and HDD disks. The post describes the motivation of the changes, some of the other options they explored (including rocksdb), and the performance improvements they see with the new storage backend.

Cockroach Labs writes about how they've sped up distributed transactions with parallel commits, which avoids certain round trips across the WAN. The post describes the solution, including how failure handling works, and it shows that experimentally latency is cut in half.

This post describes why you should consider a relational database and consider taking advantage of advanced features (such as triggers and stored procedures). The author motivates based on experience working with jupyter notebooks and comparing the complexity of a NoSQL database like Mongo or Elasticsearch.

This article describes an event-driven architecture for maintaining a CRM and realtime database. An interesting component of the post details how to implement an audit system to ensure that all microservices consume the events. The basic idea is tag events with a unique ID and each microservice generates an audit event with that ID as it processes the event. Alerts are generated when (after aggregating based on a time window) the number of audit events for a particular ID is too low.


Curated by Datadog ( )


Data-Driven Development in Autonomous Driving + Spark Performance Tuning (Mountain View) - Tuesday, November 12

Data Engineering Meetup (San Diego) - Thursday, November 14


Mirror Maker 2.0 (Austin) - Tuesday, November 12


NOVA Data Engineering: First Meetup! (Herndon) - Thursday, November 14


Data Meetup (Sao Carlos) - Wednesday, November 13


Streaming Processing with Hazelcast Jet and Kafka (Stockholm) - Tuesday, November 12


Everything You Need to Know about Kafka Streams (A Coruna) - Thursday, November 14


Airflow @ SchoolMouv: Build, Schedule, and Monitor Pipelines at Scale (Toulouse) - Wednesday, November 13


Making Apache Spark Better with Delta Lake (Prague) - Thursday, November 14


QA in Beam + Beam Use Case + More! (Warsaw) - Thursday, November 14


Timeseries Forecasting as a Service + Run Spark and Flink Jobs on Kubernetes (Athens) - Thursday, November 14


Airflow Demystified + Big Data Demystified (Tel Aviv-Yafo) - Sunday, November 17


Kafka Beijing Meetup (Beijing) - Saturday, November 16


Kafka Is More ACID Than Your Database (Sydney) - Wednesday, November 13

K8s Meetup with Instaclustr & Google! (Pyrmont) - Wednesday, November 13

Sydney Data Engineering Meetup (Sydney) - Thursday, November 14

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.