Data Eng Weekly #356

Feb 24, 2020

This week's issue has posts from Scribd and Slack on Apache Airflow, using Envoy with Apache Kafka, the open sourcing of LinkedIn's DataHub, Elasticsearch in production, and Apache Flink's new SQL DDL support. Also a post on the data infrastructure behind the "Spotify Wrapped Campaign" and an article with advice on running a data team.

Slack wrote about their experiences upgrading Apache Airflow from version 1.8 (which they had been running for two years) to version 1.10. The post describes the upgrade strategies that they considered, the steps they took (many around schema changes and backing up the metadata database), how they tested the upgrade, and some issues they found after the upgrade.

https://slack.engineering/reliably-upgrading-apache-airflow-at-slacks-scale-2a31f3d03a06

Spotify writes about their large scale analysis of a decade of playback data to power their "Spotify Wrapped Campaign" at the end of 2019. They performed a number of intermediate jobs, which allowed them to more quickly iterate and verify the quality of outputs. They talk about some of the changes they made since 2018's campaign, including changing the way that they store data in order to avoid large amounts of shuffling (and thus higher processing costs).

https://labs.spotify.com/2020/02/18/wrapping-up-the-decade-a-data-story/

Scribd writes about their journey from a home grown workflow engine to Apache Airflow. Their main DAG has over 1400 tasks (there's a fun visualization in the post), so it's a big undertaking to make the move. The post describes the main motivators, and some of the high-level changes they've made to move to Airflow.

https://tech.scribd.com/blog/2020/modernizing-an-old-data-pipeline.html

A mix of technical and managerial advice, this post shares lessons learned from running the data team at Gitlab for a year. Technical topics include how to choose the right tools (including strategically buying some products) and investing in process/tools for onboarding. And if you're on the manager side, there's a bunch of advice about how big your team should be, how to get executive buy in, and more.

https://about.gitlab.com/blog/2020/02/10/lessons-learned-as-data-team-manager/

An introduction to using Envoy 1.13 as a reverse proxy for Apache Kafka traffic. The post describes how to configure an Envoy filter to gather metrics on requests/responses for client traffic-only or all traffic (including requests for inter-broker replication).

https://medium.com/@adam.kotwasinski/deploying-envoy-and-kafka-8aa7513ec0a0

LinkedIn has open sourced DataHub, their tool for metadata management of data platforms. While the technical details of DataHub were covered in previous posts, this article describes how LinkedIn plans to maintain the code both internally and as an open source project, and also how the features of the two versions differ.

https://engineering.linkedin.com/blog/2020/open-sourcing-datahub--linkedins-metadata-search-and-discovery-p

Apache Flink 1.10 adds new SQL DDL syntax for configuring data sources and sinks. The post has some examples for defining new tables (e.g. Kafka and Elasticsearch) and details on Flink's catalog system.

https://flink.apache.org/news/2020/02/20/ddl.html

The morning paper has coverage of a paper on Microsoft's Raven, which embeds ML runtimes into SQL Server. The idea is to keep models as data in the database so that you can take advantage of features like transactions and improve performance. Pretty interesting details, including how they've released as part of public preview in Azure SQL Database Edge.

https://blog.acolyer.org/2020/02/21/extending-relational-query-processing/

An in depth look at the architecture of Elasticsearch towards the goal of planning and monitoring a production deployment.

https://facinating.tech/2020/02/22/in-depth-guide-to-running-elasticsearch-in-production/

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.

Data Eng Weekly #335

Joe Crobak

Feb 18, 2020

Lots of great stuff in this week's issue, including the release of MR3, a couple of posts on schema migration (including one on CI/CD for Redshift), how Bitbucket has scaled their databases, and more.

HERE Mobility has written about their CI/CD pipeline for Amazon Redshift. They've built a tool to apply database schema changes, validate the structure of the database (e.g. find broken views), verify redshift SQL syntax, and automatically deploy to Redshift. Lots of good tips in the post about automating validations and using version control for SQL scripts.

https://medium.com/big-data-engineering/redshift-cicd-how-we-did-it-and-why-you-should-do-it-to-e46ecf734eab

A look at the trade-offs between various types of data driven architectures, like event sourcing, change data capture, command query response segregation, and the outbox pattern. The post dives deep into the architectures, including clarifying the difference between domain events and change events. There are some useful diagrams that help to better understand how the various pieces fit together in each architecture pattern.

https://debezium.io/blog/2020/02/10/event-sourcing-vs-cdc/

The MR3 framework, which offers an alternative execution model to YARN and MapReduce for Hive/Spark/Hadoop workloads, has released version 1.0. This release includes a number of improvements for a deployment in the cloud, including improvements for S3 and Amazon EKS.

https://groups.google.com/forum/#!msg/hive-mr3/3VwpqBnZfT4/9emGzbZ9BQAJ

This post describes how to use the open source osquery project to collect data and send it to Apache Kafka as part of a security information and event management platform. The post describes the basics of osquery and how to build a custom extension, written in python, for producing data to Apache Kafka.

https://www.confluent.io/blog/siem-with-osquery-log-aggregation-and-confluent/

Github writes about how they've automated the process of applying schema migrations at scale (meaning number of tables, number of developers, and size of server fleet). Using Github Actions and the tools skeema and skeefree, they create "safe" (i.e. no direct DROP TABLE and also rewriting ALTER TABLEs to be efficient) schema migrations by comparing the current table definition in source code to what's defined in the database. The post describes both these tools (which are a bit specialized for MySQL) and the workflow (which includes chatops and github actions).

https://github.blog/2020-02-14-automating-mysql-schema-migrations-with-github-actions-and-more/

A look at performance with Ozone, the blob storage layer for HDFS. In comparison to tests run directly against HDFS, when using the TPC-DS benchmark most of the queries run faster with Ozone as the storage layer.

https://blog.cloudera.com/benchmarking-ozone-clouderas-next-generation-storage-for-cdp/

Bitbucket writes about how they've scaled their databases by moving reads to replicas. To ensure that queries go to read replicas containing up to date data, they keep track of a per-user log sequence number from Postgres (stored in Redis). They also share how their changes have improved the performance profile of their databases.

https://bitbucket.org/blog/scaling-bitbuckets-database

An intro to five data engineering projects that are worth checking out, if you're not already using them: dbt (managing SQL code), Prefect (a new workflow engine), Dask (distributed computing for Python), DVC (the "data version control"), and Great Expectations (for testing data). In addition to those, the post calls out a few other projects worth investigating.

https://medium.com/@squarecog/five-interesting-data-engineering-projects-48ffb9c9c501

Data Eng Weekly #334

Joe Crobak

Feb 10, 2020

Still catching up a bit on all the articles from my hiatus, and I pulled in posts from that time on Netflix's DBLog and the `xsv` CLI tool. In general, there’s lots of breadth in this week's issue—running Presto on Kubernetes, improvements to consumer flow control in Apache Kafka 2.4.0, building a queryable dataset with ksqlDB, Microsoft's automated analytics service for large scale deployments, and a look at migrating to CockroachDB.

Netflix writes about DBLog, their change data capture framework that integrates with MySQL, PostgreSQL and more. Probably its most novel feature is a mechanism for creating a dump of the entire database by chunking up the keyspace and processing data without a lock. DBLog also has a solution for high availability, and Netflix plans to open source it later this year.

https://netflixtechblog.com/dblog-a-generic-change-data-capture-framework-69351fb9099b

A look at how to use the `xsv` command line tool to join data from multiple csv files. There are a few other tools recommended in the comments, too.

https://www.johndcook.com/blog/2019/12/31/sql-join-csv-files/

An overview of running Presto, its dependencies (like the Hive metastore and a relational database), and the redash query editor/visualizer on a Kubernetes cluster. The configuration, as well as code for some of the trickier bits like initializing/migrating the metastore database tables, is available on github.

https://medium.com/@joshua_robinson/presto-powered-s3-data-warehouse-on-kubernetes-aea89d2f40e8

Lightbend describes how the Alpakka framework relies on the Kafka Consumer's flow control (i.e. pause/resume) features, and the improvement in Apache Kafka 2.4.0 that drastically improves throughput and reduces bandwidth on brokers. At a high-level, the solution takes better advantage of data that is prefetched by the Kafka Consumer before `pause` is called.

https://www.lightbend.com/blog/alpakka-kafka-flow-control-optimizations

Presto added support for the SQL standard syntax for limiting query results, which is `FETCH FIRST n ROWS`. Vs the `LIMIT` syntax, this new one has some additional functionality for duplicates/ties.

https://prestosql.io/blog/2020/02/03/beyond-limit-presto-meets-offset-and-ties.html

An introduction to using ksqlDB to build a queryable dataset, including how load raw data into Apache Kafka, perform some transformations, and execute a SQL query to lookup individual results using an HTTP endpoint.

https://www.confluent.io/blog/build-materialized-cache-with-ksqldb/

This article describes what it's like to move from Postgres to CockroachDB. The main reason to switch was Cockroach's scalability, and there are some trade-offs (and new features!) that it comes with (e.g. no partial indexes or deferring of foreign key checks). Lots of good info if you're considering a similar migration.

https://www.openmymind.net/Migrating-To-CockroachDB/

A look at what the trade-offs are when it comes to sizing and density of disks for data nodes in a HDFS cluster. It seems like it's a good idea to choose more small disks, because this decreases how long it takes to detect bit rot and the time to recover from node failure/decommissioning.

https://blog.cloudera.com/disk-and-datanode-size-in-hdfs/

The Apache Flink blog has a post describing how to write unit tests for Flink operators. Stateless operators are pretty straightforward, and there are some examples for stateful and time operators.

https://flink.apache.org/news/2020/02/07/a-guide-for-unit-testing-in-apache-flink.html

Microsoft has published a paper on Gandalf, their tool for monitoring the status of deployments across Azure infrastructure. It correlates signals (logs, metrics, events) as the rollout of a new component version progresses and can stop a rollout based on what it observes. When it does, Gandalf describes which signals caused it to stop. The paper has lots of details and several case studies.

https://www.microsoft.com/en-us/research/publication/an-intelligent-end-to-end-analytics-service-for-safe-deployment-in-large-scale-cloud-infrastructure/

An overview of the architecture of Google Cloud Spanner, including details about the implementation of TrueTime, which is used to guarantee consistency across regions, and the lifecycle of reads and writes. There are lots of diagrams taken from public videos/talks on Spanner.

https://thedataguy.in/internals-of-google-cloud-spanner/

Data Eng Weekly #333

Joe Crobak

Feb 03, 2020

Hey all! it's been a while—hopefully everyone had a nice new year. I took the last ~6 weeks off, so there's quite a bit to catch up on. This week's issue has sixteen of the best articles from that time covering topics like Apache Kafka producers, distributed storage engines for Prometheus, Presto+Pinto, and Jepsen analysis of etcd. Also, Yelp writes about their Kafka architecture, and Teads writes about optimizing Spark applications using User Defined Aggregate Functions. Lots to read up on, whether you're looking for some tips to apply to your own system, new tools to try out, or learning more about how systems work under the hood.

A look back at the themes that have emerged from the last decade in technology, including a number of distributed systems items like the return of SQL and streaming. The post also predicts big areas for the next few years, like future of PaaS and Kubernetes (and also areas with no connection to distributed systems like retail, journalism, and social media).

https://medium.com/@copyconstruct/a-decade-in-review-in-tech-1cde76c9b43c

A look at the important configurations to improve throughput of your Kafka Producers, as well as the key metrics to monitor on the broker related to efficient producing.

https://www.jesseyates.com/2020/01/01/high-performance-kafka-producers.html

This post covers several solutions for running large scale prometheus deployments. These include Thanos, Cortex, M3DB, and VictoriaMetrics. There are a mix of architectures—some push and some pull as well as various trade-offs for things like cold storage backends.

https://monitoring2.substack.com/p/big-prometheus

This post describes using SQLite to replace a networked database, which is quite an interesting idea. By sharing SQLite as a datastore across containers, they're able to improve latency by over 20x for a system that's processing 500k messages per second. Several details to dig into here, and some commentary on the larger impacts of a change like this (which pushes complexity out of a data plane and into the control plane).

https://medium.com/@rbranson/sharing-sqlite-databases-across-containers-is-surprisingly-brilliant-bacb8d753054

A good look at how the Java Virtual Machine manages heap space, and how the garbage collector interacts with various memory regions in order to reclaim space. The post describes the various types of garbage collection strategy, the types of events that trigger a garbage collection, and more.

https://sematext.com/blog/java-garbage-collection/

A thorough introduction to connecting to databases from Java, covering JDBC, Hibernate, the Java Persistence API, lightweight libraries for Java SQL, and more. The post has lots of code examples, advice for when certain libraries are appropriate, and more.

https://www.marcobehler.com/guides/java-databases-jdbc-hibernate-spring-data

Presto's 2019 year in review—covering new syntax (e.g. adding comments and fetching just the first N rows), query optimizations (e.g. improvements to the cost based optimizer and lazy materialization), new connectors (elasticsearch, google sheets), and much more. The post also looks at what's next for Presto in 2020.

https://prestosql.io/blog/2020/01/01/2019-summary.html

Uber writes about how they've integrated Pinot, their real-time analytics system, with Presto for SQL queries. The article describes the architecture, and how they improved the connectors performance with predicate/limit/aggregate/more pushdown, and how it performs in practice.

https://eng.uber.com/engineering-sql-support-on-apache-pinot/

A look at instrumenting apps written in C, Java, and Golang for analysis using eBPF. The post is part of a larger series on eBPF, and there's an introduction to the main concepts at the top of the article.

https://sematext.com/blog/ebpf-userland-apps/

Quarkus, the java application framework, has a new extension to implement the outbox pattern (quite interesting to read about if you've missed earlier articles!) for change data capture. The Debezium blog has an introduction that walks through how to get started with Quarkus and the Outbox Quarkus Extension for generating events.

https://debezium.io/blog/2020/01/22/outbox-quarkus-extension/

Yelp writes about their Kafka infrastructure, which has several components. The post describes two of them in detail, the Stream Discovery and Allocation service (which enforces schemas and defines a stream as either fire and forget or acked) and "Monk Leaf" which is a service that runs locally and proxies to Kafka (implementing either the acked or fire and forget semantics). These components provide a platform that make it easy for developers to deploy applications and get data into the Kafka data pipeline.

https://engineeringblog.yelp.com/2020/01/streams-and-monk-how-yelp-approaches-kafka-in-2020.html

The third post in a series on Spark Job Optimization myths (the previous two looked at executor memory and number of executors), this post looks at why adding more memory to the Spark driver doesn't always improve performance. It has some tips that can improve the driver performance instead—like avoiding globals and avoiding expensive calls to functions like `collect()`. If you're looking at optimizing your Spark usage, this series is worth digging into.

https://www.davidmcginnis.net/post/spark-job-optimization-myth-3-i-need-more-driver-memory

A post from Teads describes both how to speed up a Spark application using a Spark User-Defined Aggregate Functions (UDAF) as well as how to optimize Spark applications in general. Their article walks through how they sped up one of their applications from 28 mins to 9 mins. Several of the optimizations are informed by the Spark execution DAG, several of which they analyze in the post.

https://medium.com/teads-engineering/apache-spark-udaf-could-be-an-option-c2bc25298276

An overview of the how the Apache Kafka idempotent producer works, including pointers to relevant pieces of the code that generate ids, complete batches, and more. The post details how the producer has been extended since the original implementation to support multiple in flight requests.

https://www.waitingforcode.com/apache-kafka/apache-kafka-idempotent-producer/read

Jepsen has a post on etcd, which is the key-value store used by Kubernetes and other distributed systems. The article describes several tests of correctness, which verified strict serializable operations and correct delivery of watches. They found some issues with locks, which the etcd team is addressing (more details in a companion blog post). As always, it's great to read about verification of distributed systems—the post brings together practice and theory in a way that builds extra context on both.

https://jepsen.io/analyses/etcd-3.4.3