Data Eng Weekly #330
A lot of breadth in this week's coverage of data engineering topics—from Delta to event processing at Spotify to the design goals of TileDB to understanding delivery guarantees in Apache Kafka. Also, a few articles on distributed systems—Yelp's autoscaling service, techniques for building reliable systems, and Facebook’s global routing infrastructure.
A look at how to use Delta's version tracking to "time travel" (read the contents of a table as it was at some previous point in time) and to produce audit details to track changes to a table. The post embeds several IPython notebooks with sample Scala code.
https://medium.com/@aravinthR/delta-time-travel-for-data-lake-part-2-b23879a4bc6d
Yelp writes about their newly open-sourced autoscaler for Kubernetes, Clusterman. Compared to the Kubernetes Autoscaler, there are some interesting features like support for spot instances and the ability to simulate production workloads.
https://engineeringblog.yelp.com/2019/11/open-source-clusterman.html
With the complexity that comes with a distributed system, it can be overwhelming to start the process of identifying and classifying failure modes. This post provides a great high level blueprint for how to approach this problem—including bucketing types of failure through Failure Modes and Effects Analysis and calculating a Risk Priority Number for each item in your analysis. Lots of great stuff (and spreadsheets to help!) for understanding how to build a reliable system.
https://medium.com/@adrianco/failure-modes-and-continuous-resilience-6553078caad5
Spotify shares lessons learned from operating a large-scale event delivery system, which powers some key business functions like tracking track plays to calculate royalties. They describe some of the principles (e.g. segregating data by event type and choosing liveness over lateness), how they think about Service Level Objectives for event delivery, and how they are exposing their event platform to other internal teams.
https://labs.spotify.com/2019/11/12/spotifys-event-delivery-life-in-the-cloud/
Coupang, an e-commerce company from South Korea, writes about the evolution of their data platform over the past several years. They've been through several phases: storing all data in a relational database, using Hadoop, Hive, & a MPP database, and migrating to the cloud. The post dives into some of the other important features of their pipeline, like how they track data quality, detect data abnormalities, and make data discoverable to users.
https://medium.com/coupang-tech/evolving-the-coupang-data-platform-308e305a9c45
TileDB is a new storage engine for multi-dimensional data like that commonly used for machine learning and genetics analysis. It supports both sparse and dense arrays, and it can use blob storage (such as S3) as the storage backend. The post describes the motivation for a new storage system, and how they've optimized the implementation for efficiency and cross language support.
https://medium.com/tiledb/tiledb-a-database-for-data-scientists-ddf4ca122176
Criteo writes about how they enforce data quality for the 450 PB that they have in their Hadoop clusters. Since Hadoop is very flexible/lenient in the data it supports, they run over 7,000 data quality checks per day using statistics from Hive tables and custom queries.
https://medium.com/criteo-labs/big-data-quality-at-criteo-66c6bd0d42d8
The morning paper covers Taiji, Facebook's system for routing and managing global traffic. Taiji takes advantage of knowledge about a user and their social connections to efficiently route traffic from the edge to a data center. By better taking advantage of caches, changes in routing materially reduce load on the database systems.
https://blog.acolyer.org/2019/11/15/facebook-taiji/
An in-depth look at how to the various Apache Kafka configurations for Producer and Consumers can be used to avoid duplicate data in your pipeline. This post describes the various types of delivery guarantees, how they map to Kafka settings, and discusses how to put them all together. The post has lots of great diagrams, and there's a bonus section on other strategies for deduplicating data in a stream.
https://medium.com/@andy.bryant/processing-guarantees-in-kafka-12dd2e30be0e
Events
Curated by Datadog ( http://www.datadog.com )
California
Apache Kafka and Immutable Infrastructure + An Introductory Kafka Talk! (Santa Monica) - Tuesday, November 19
https://www.meetup.com/Los-Angeles-Apache-Kafka-Meetup/events/265903100/
Learning How to Build-Event Streaming Applications with Pac-Man (San Diego) - Tuesday, November 19
https://www.meetup.com/San-Diego-Java-Users-Group/events/266167833/
Bay Area Apache Flink Meetup @ Cloudera (Palo Alto) - Wednesday, November 20
https://www.meetup.com/Bay-Area-Apache-Flink-Meetup/events/266226960/
Building Data Lineage, Data Orchestration, and Data Mesh (Mountain View) - Thursday, November 21
https://www.meetup.com/SF-Big-Analytics/events/265730920/
Washington
The Rise of Apache Flink and Stream Processing (Bellevue) - Wednesday, November 20
https://www.meetup.com/Big-Data-Bellevue-BDB/events/262650434/
Texas
Foreign-Key Joins with Kafka Streams (Plano) - Thursday, November 21
https://www.meetup.com/Dallas-Kafka/events/265880083/
Wisconsin
Introduction to Apache Spark (Madison) - Monday, November 18
https://www.meetup.com/Women-in-Big-Data-Wisconsin-Chapter/events/265857508/
IRELAND
Data Science and Engineering Club @ Zalando (Dublin) - Thursday, November 21
https://www.meetup.com/Data-Science-and-Engineering-Club/events/265937165/
FRANCE
Fine-Tuning Kafka: Let's Look under the Hood! (Paris) - Wednesday, November 20
https://www.meetup.com/PerfUG/events/254607871/
BELGIUM
Apache Beam Intro and Use Cases (Antwerpen) - Friday, November 22
https://www.meetup.com/Belgium-Apache-Beam-Meetup/events/264933301/
DENMARK
Kafka Is More ACID Than Your Database (Kongens Lyngby) - Tuesday, November 19
https://www.meetup.com/Copenhagen-Javagruppen-Meetup/events/266441150/
SWITZERLAND
Apache Beam @ Ricardo.ch + Portable Schema + BeamSQL (Zurich) - Tuesday, November 19
https://www.meetup.com/Zurich-Apache-Beam-Meetup/events/265529665/
CZECH REPUBLIC
Apache Kafka as a Database (Brno-stred) - Wednesday, November 20
https://www.meetup.com/Brno-Java-Meetup/events/265476406/
SINGAPORE
Event-Driven Microservices with CQRS Using Axon: Excitingly Boring (Singapore) - Wednesday, November 20
https://www.meetup.com/singajug/events/266381397/
AUSTRALIA
Melbourne Data Engineering Meetup (Docklands) - Wednesday, November 20
https://www.meetup.com/Melbourne-Data-Engineering-Meetup/events/265892291/