Data Eng Weekly #330

Nov 18, 2019

A lot of breadth in this week's coverage of data engineering topics—from Delta to event processing at Spotify to the design goals of TileDB to understanding delivery guarantees in Apache Kafka. Also, a few articles on distributed systems—Yelp's autoscaling service, techniques for building reliable systems, and Facebook’s global routing infrastructure.

A look at how to use Delta's version tracking to "time travel" (read the contents of a table as it was at some previous point in time) and to produce audit details to track changes to a table. The post embeds several IPython notebooks with sample Scala code.

https://medium.com/@aravinthR/delta-time-travel-for-data-lake-part-2-b23879a4bc6d

Yelp writes about their newly open-sourced autoscaler for Kubernetes, Clusterman. Compared to the Kubernetes Autoscaler, there are some interesting features like support for spot instances and the ability to simulate production workloads.

https://engineeringblog.yelp.com/2019/11/open-source-clusterman.html

With the complexity that comes with a distributed system, it can be overwhelming to start the process of identifying and classifying failure modes. This post provides a great high level blueprint for how to approach this problem—including bucketing types of failure through Failure Modes and Effects Analysis and calculating a Risk Priority Number for each item in your analysis. Lots of great stuff (and spreadsheets to help!) for understanding how to build a reliable system.

https://medium.com/@adrianco/failure-modes-and-continuous-resilience-6553078caad5

Spotify shares lessons learned from operating a large-scale event delivery system, which powers some key business functions like tracking track plays to calculate royalties. They describe some of the principles (e.g. segregating data by event type and choosing liveness over lateness), how they think about Service Level Objectives for event delivery, and how they are exposing their event platform to other internal teams.

https://labs.spotify.com/2019/11/12/spotifys-event-delivery-life-in-the-cloud/

Coupang, an e-commerce company from South Korea, writes about the evolution of their data platform over the past several years. They've been through several phases: storing all data in a relational database, using Hadoop, Hive, & a MPP database, and migrating to the cloud. The post dives into some of the other important features of their pipeline, like how they track data quality, detect data abnormalities, and make data discoverable to users.

https://medium.com/coupang-tech/evolving-the-coupang-data-platform-308e305a9c45

TileDB is a new storage engine for multi-dimensional data like that commonly used for machine learning and genetics analysis. It supports both sparse and dense arrays, and it can use blob storage (such as S3) as the storage backend. The post describes the motivation for a new storage system, and how they've optimized the implementation for efficiency and cross language support.

https://medium.com/tiledb/tiledb-a-database-for-data-scientists-ddf4ca122176

Criteo writes about how they enforce data quality for the 450 PB that they have in their Hadoop clusters. Since Hadoop is very flexible/lenient in the data it supports, they run over 7,000 data quality checks per day using statistics from Hive tables and custom queries.

https://medium.com/criteo-labs/big-data-quality-at-criteo-66c6bd0d42d8

The morning paper covers Taiji, Facebook's system for routing and managing global traffic. Taiji takes advantage of knowledge about a user and their social connections to efficiently route traffic from the edge to a data center. By better taking advantage of caches, changes in routing materially reduce load on the database systems.

https://blog.acolyer.org/2019/11/15/facebook-taiji/

An in-depth look at how to the various Apache Kafka configurations for Producer and Consumers can be used to avoid duplicate data in your pipeline. This post describes the various types of delivery guarantees, how they map to Kafka settings, and discusses how to put them all together. The post has lots of great diagrams, and there's a bonus section on other strategies for deduplicating data in a stream.

https://medium.com/@andy.bryant/processing-guarantees-in-kafka-12dd2e30be0e