Data Eng Weekly #334

Feb 10, 2020

Still catching up a bit on all the articles from my hiatus, and I pulled in posts from that time on Netflix's DBLog and the `xsv` CLI tool. In general, there’s lots of breadth in this week's issue—running Presto on Kubernetes, improvements to consumer flow control in Apache Kafka 2.4.0, building a queryable dataset with ksqlDB, Microsoft's automated analytics service for large scale deployments, and a look at migrating to CockroachDB.

Netflix writes about DBLog, their change data capture framework that integrates with MySQL, PostgreSQL and more. Probably its most novel feature is a mechanism for creating a dump of the entire database by chunking up the keyspace and processing data without a lock. DBLog also has a solution for high availability, and Netflix plans to open source it later this year.

https://netflixtechblog.com/dblog-a-generic-change-data-capture-framework-69351fb9099b

A look at how to use the `xsv` command line tool to join data from multiple csv files. There are a few other tools recommended in the comments, too.

https://www.johndcook.com/blog/2019/12/31/sql-join-csv-files/

An overview of running Presto, its dependencies (like the Hive metastore and a relational database), and the redash query editor/visualizer on a Kubernetes cluster. The configuration, as well as code for some of the trickier bits like initializing/migrating the metastore database tables, is available on github.

https://medium.com/@joshua_robinson/presto-powered-s3-data-warehouse-on-kubernetes-aea89d2f40e8

Lightbend describes how the Alpakka framework relies on the Kafka Consumer's flow control (i.e. pause/resume) features, and the improvement in Apache Kafka 2.4.0 that drastically improves throughput and reduces bandwidth on brokers. At a high-level, the solution takes better advantage of data that is prefetched by the Kafka Consumer before `pause` is called.

https://www.lightbend.com/blog/alpakka-kafka-flow-control-optimizations

Presto added support for the SQL standard syntax for limiting query results, which is `FETCH FIRST n ROWS`. Vs the `LIMIT` syntax, this new one has some additional functionality for duplicates/ties.

https://prestosql.io/blog/2020/02/03/beyond-limit-presto-meets-offset-and-ties.html

An introduction to using ksqlDB to build a queryable dataset, including how load raw data into Apache Kafka, perform some transformations, and execute a SQL query to lookup individual results using an HTTP endpoint.

https://www.confluent.io/blog/build-materialized-cache-with-ksqldb/

This article describes what it's like to move from Postgres to CockroachDB. The main reason to switch was Cockroach's scalability, and there are some trade-offs (and new features!) that it comes with (e.g. no partial indexes or deferring of foreign key checks). Lots of good info if you're considering a similar migration.

https://www.openmymind.net/Migrating-To-CockroachDB/

A look at what the trade-offs are when it comes to sizing and density of disks for data nodes in a HDFS cluster. It seems like it's a good idea to choose more small disks, because this decreases how long it takes to detect bit rot and the time to recover from node failure/decommissioning.

https://blog.cloudera.com/disk-and-datanode-size-in-hdfs/

The Apache Flink blog has a post describing how to write unit tests for Flink operators. Stateless operators are pretty straightforward, and there are some examples for stateful and time operators.

https://flink.apache.org/news/2020/02/07/a-guide-for-unit-testing-in-apache-flink.html

Microsoft has published a paper on Gandalf, their tool for monitoring the status of deployments across Azure infrastructure. It correlates signals (logs, metrics, events) as the rollout of a new component version progresses and can stop a rollout based on what it observes. When it does, Gandalf describes which signals caused it to stop. The paper has lots of details and several case studies.

https://www.microsoft.com/en-us/research/publication/an-intelligent-end-to-end-analytics-service-for-safe-deployment-in-large-scale-cloud-infrastructure/

An overview of the architecture of Google Cloud Spanner, including details about the implementation of TrueTime, which is used to guarantee consistency across regions, and the lifecycle of reads and writes. There are lots of diagrams taken from public videos/talks on Spanner.

https://thedataguy.in/internals-of-google-cloud-spanner/

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.