Data Eng Weekly #335

Feb 18, 2020

Lots of great stuff in this week's issue, including the release of MR3, a couple of posts on schema migration (including one on CI/CD for Redshift), how Bitbucket has scaled their databases, and more.

HERE Mobility has written about their CI/CD pipeline for Amazon Redshift. They've built a tool to apply database schema changes, validate the structure of the database (e.g. find broken views), verify redshift SQL syntax, and automatically deploy to Redshift. Lots of good tips in the post about automating validations and using version control for SQL scripts.

https://medium.com/big-data-engineering/redshift-cicd-how-we-did-it-and-why-you-should-do-it-to-e46ecf734eab

A look at the trade-offs between various types of data driven architectures, like event sourcing, change data capture, command query response segregation, and the outbox pattern. The post dives deep into the architectures, including clarifying the difference between domain events and change events. There are some useful diagrams that help to better understand how the various pieces fit together in each architecture pattern.

https://debezium.io/blog/2020/02/10/event-sourcing-vs-cdc/

The MR3 framework, which offers an alternative execution model to YARN and MapReduce for Hive/Spark/Hadoop workloads, has released version 1.0. This release includes a number of improvements for a deployment in the cloud, including improvements for S3 and Amazon EKS.

https://groups.google.com/forum/#!msg/hive-mr3/3VwpqBnZfT4/9emGzbZ9BQAJ

This post describes how to use the open source osquery project to collect data and send it to Apache Kafka as part of a security information and event management platform. The post describes the basics of osquery and how to build a custom extension, written in python, for producing data to Apache Kafka.

https://www.confluent.io/blog/siem-with-osquery-log-aggregation-and-confluent/

Github writes about how they've automated the process of applying schema migrations at scale (meaning number of tables, number of developers, and size of server fleet). Using Github Actions and the tools skeema and skeefree, they create "safe" (i.e. no direct DROP TABLE and also rewriting ALTER TABLEs to be efficient) schema migrations by comparing the current table definition in source code to what's defined in the database. The post describes both these tools (which are a bit specialized for MySQL) and the workflow (which includes chatops and github actions).

https://github.blog/2020-02-14-automating-mysql-schema-migrations-with-github-actions-and-more/

A look at performance with Ozone, the blob storage layer for HDFS. In comparison to tests run directly against HDFS, when using the TPC-DS benchmark most of the queries run faster with Ozone as the storage layer.

https://blog.cloudera.com/benchmarking-ozone-clouderas-next-generation-storage-for-cdp/

Bitbucket writes about how they've scaled their databases by moving reads to replicas. To ensure that queries go to read replicas containing up to date data, they keep track of a per-user log sequence number from Postgres (stored in Redis). They also share how their changes have improved the performance profile of their databases.

https://bitbucket.org/blog/scaling-bitbuckets-database

An intro to five data engineering projects that are worth checking out, if you're not already using them: dbt (managing SQL code), Prefect (a new workflow engine), Dask (distributed computing for Python), DVC (the "data version control"), and Great Expectations (for testing data). In addition to those, the post calls out a few other projects worth investigating.

https://medium.com/@squarecog/five-interesting-data-engineering-projects-48ffb9c9c501

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.