Data Eng Weekly #356

This week's issue has posts from Scribd and Slack on Apache Airflow, using Envoy with Apache Kafka, the open sourcing of LinkedIn's DataHub, Elasticsearch in production, and Apache Flink's new SQL DDL support. Also a post on the data infrastructure behind the "Spotify Wrapped Campaign" and an article with advice on running a data team.

Slack wrote about their experiences upgrading Apache Airflow from version 1.8 (which they had been running for two years) to version 1.10. The post describes the upgrade strategies that they considered, the steps they took (many around schema changes and backing up the metadata database), how they tested the upgrade, and some issues they found after the upgrade.

Spotify writes about their large scale analysis of a decade of playback data to power their "Spotify Wrapped Campaign" at the end of 2019. They performed a number of intermediate jobs, which allowed them to more quickly iterate and verify the quality of outputs. They talk about some of the changes they made since 2018's campaign, including changing the way that they store data in order to avoid large amounts of shuffling (and thus higher processing costs).

Scribd writes about their journey from a home grown workflow engine to Apache Airflow. Their main DAG has over 1400 tasks (there's a fun visualization in the post), so it's a big undertaking to make the move. The post describes the main motivators, and some of the high-level changes they've made to move to Airflow.

A mix of technical and managerial advice, this post shares lessons learned from running the data team at Gitlab for a year. Technical topics include how to choose the right tools (including strategically buying some products) and investing in process/tools for onboarding. And if you're on the manager side, there's a bunch of advice about how big your team should be, how to get executive buy in, and more.

An introduction to using Envoy 1.13 as a reverse proxy for Apache Kafka traffic. The post describes how to configure an Envoy filter to gather metrics on requests/responses for client traffic-only or all traffic (including requests for inter-broker replication).

LinkedIn has open sourced DataHub, their tool for metadata management of data platforms. While the technical details of DataHub were covered in previous posts, this article describes how LinkedIn plans to maintain the code both internally and as an open source project, and also how the features of the two versions differ.

Apache Flink 1.10 adds new SQL DDL syntax for configuring data sources and sinks. The post has some examples for defining new tables (e.g. Kafka and Elasticsearch) and details on Flink's catalog system.

The morning paper has coverage of a paper on Microsoft's Raven, which embeds ML runtimes into SQL Server. The idea is to keep models as data in the database so that you can take advantage of features like transactions and improve performance. Pretty interesting details, including how they've released as part of public preview in Azure SQL Database Edge.

An in depth look at the architecture of Elasticsearch towards the goal of planning and monitoring a production deployment.

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.