Data Eng Weekly #330

A lot of breadth in this week's coverage of data engineering topics—from Delta to event processing at Spotify to the design goals of TileDB to understanding delivery guarantees in Apache Kafka. Also, a few articles on distributed systems—Yelp's autoscaling service, techniques for building reliable systems, and Facebook’s global routing infrastructure.

A look at how to use Delta's version tracking to "time travel" (read the contents of a table as it was at some previous point in time) and to produce audit details to track changes to a table. The post embeds several IPython notebooks with sample Scala code.

Yelp writes about their newly open-sourced autoscaler for Kubernetes, Clusterman. Compared to the Kubernetes Autoscaler, there are some interesting features like support for spot instances and the ability to simulate production workloads.

With the complexity that comes with a distributed system, it can be overwhelming to start the process of identifying and classifying failure modes. This post provides a great high level blueprint for how to approach this problem—including bucketing types of failure through Failure Modes and Effects Analysis and calculating a Risk Priority Number for each item in your analysis. Lots of great stuff (and spreadsheets to help!) for understanding how to build a reliable system.

Spotify shares lessons learned from operating a large-scale event delivery system, which powers some key business functions like tracking track plays to calculate royalties. They describe some of the principles (e.g. segregating data by event type and choosing liveness over lateness), how they think about Service Level Objectives for event delivery, and how they are exposing their event platform to other internal teams.

Coupang, an e-commerce company from South Korea, writes about the evolution of their data platform over the past several years. They've been through several phases: storing all data in a relational database, using Hadoop, Hive, & a MPP database, and migrating to the cloud. The post dives into some of the other important features of their pipeline, like how they track data quality, detect data abnormalities, and make data discoverable to users.

TileDB is a new storage engine for multi-dimensional data like that commonly used for machine learning and genetics analysis. It supports both sparse and dense arrays, and it can use blob storage (such as S3) as the storage backend. The post describes the motivation for a new storage system, and how they've optimized the implementation for efficiency and cross language support.

Criteo writes about how they enforce data quality for the 450 PB that they have in their Hadoop clusters. Since Hadoop is very flexible/lenient in the data it supports, they run over 7,000 data quality checks per day using statistics from Hive tables and custom queries.

The morning paper covers Taiji, Facebook's system for routing and managing global traffic. Taiji takes advantage of knowledge about a user and their social connections to efficiently route traffic from the edge to a data center. By better taking advantage of caches, changes in routing materially reduce load on the database systems.

An in-depth look at how to the various Apache Kafka configurations for Producer and Consumers can be used to avoid duplicate data in your pipeline. This post describes the various types of delivery guarantees, how they map to Kafka settings, and discusses how to put them all together. The post has lots of great diagrams, and there's a bonus section on other strategies for deduplicating data in a stream.


Curated by Datadog ( )


Apache Kafka and Immutable Infrastructure + An Introductory Kafka Talk! (Santa Monica) - Tuesday, November 19

Learning How to Build-Event Streaming Applications with Pac-Man (San Diego) - Tuesday, November 19

Bay Area Apache Flink Meetup @ Cloudera (Palo Alto) - Wednesday, November 20

Building Data Lineage, Data Orchestration, and Data Mesh (Mountain View) - Thursday, November 21


The Rise of Apache Flink and Stream Processing (Bellevue) - Wednesday, November 20


Foreign-Key Joins with Kafka Streams (Plano) - Thursday, November 21


Introduction to Apache Spark (Madison) - Monday, November 18


Data Science and Engineering Club @ Zalando (Dublin) - Thursday, November 21


Fine-Tuning Kafka: Let's Look under the Hood! (Paris) - Wednesday, November 20


Apache Beam Intro and Use Cases (Antwerpen) - Friday, November 22


Kafka Is More ACID Than Your Database (Kongens Lyngby) - Tuesday, November 19


Apache Beam @ + Portable Schema + BeamSQL (Zurich) - Tuesday, November 19


Apache Kafka as a Database (Brno-stred) - Wednesday, November 20


Event-Driven Microservices with CQRS Using Axon: Excitingly Boring (Singapore) - Wednesday, November 20


Melbourne Data Engineering Meetup (Docklands) - Wednesday, November 20