Data Eng Weekly #321

25 August 2019

If you're reading this, then the transition to the new email service is a success! You should expect the same great content with a slightly new look.

As for this week's issue, there's coverage of some tools (DBT, Debezium for MySQL), distributed systems architecture (the Databricks Delta Lake transaction log, Timescale's distributed time series DB, and an overview of consistency and isolation levels), and posts on RocksDB and Twitter's new open source telemetry agent. There should be something good for everyone!


Technical

This tutorial describes how to enable the MySQL binary log for streaming change data capture (i.e. producing a record for each insert, update, delete into a MySQL table) to Apache Kafka using Debezium.

https://blog.clairvoyantsoft.com/mysql-cdc-with-apache-kafka-and-debezium-3d45c00762e4

Klarna writes about Diftong, their tool for validating changes to workflows by comparing data sets produced before and after the changes. Diftong is a general purpose tool that works with any two data sets sharing a schema by applying some general purpose techniques—deduplicating data and calculating row-/column-level statistics. There's a full paper on the tool and how it's used at Klarna if you want to learn even more than this post.

https://engineering.klarna.com/how-we-built-a-tool-for-validating-big-data-workflows-170c196a4493

The Delta Lake framework maintains a transaction log alongside a data set to provide atomicity. The transaction log is stored as JSON with each file representing a commit. This post dives into the details of this implementation, including optimizations using checkpoints, optimistic concurrency control, and handling of conflicts.

https://databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html

Timescale writes about their distributed time series database built on PostgreSQL, which is under development and in private beta. The post describes how they use "chunking" rather than "sharding" to distribute data across nodes in the cluster, presents the high level architecture (access and data nodes), and demonstrates how the system handles inserts and queries.

https://blog.timescale.com/blog/building-a-distributed-time-series-database-on-postgresql/

The Dremio blog covers a relatively new feature of Apache Arrow, the Flight data transfer protocol. Flight is built on gRPC and aims to saturate networks while also having low CPU overhead by using the Arrow in-memory data representation (i.e. no deserialization or serialization).

https://www.dremio.com/understanding-apache-arrow-flight/

Rezolus is a new open source telemetry agent from Twitter. It's written in Rust, and it implements sophisticated data collection and sampling in order to detect short (e.g. <10 seconds) anomalous events.

https://blog.twitter.com/engineering/en_us/topics/open-source/2019/introducing-rezolus.html

Rockset writes about how they improved performance of bulk loading data into RocksDB. They parallelize writes, optimize compactions, and more. Overall, they get a 20x speedup over the original approach.

https://www.rockset.com/blog/optimizing-bulk-load-in-rocksdb/

The Telegraph Engineering blog writes about dbt, the Data Building Tool, for building data transformations. It describes the major functionality of dbt, like its UI for viewing data sources and models, its framework for writing templated queries, and its functionality for building data check tests (e.g. guaranteeing unique values or that a column is never null in a data set).

https://medium.com/the-telegraph-engineering/dbt-a-new-way-to-handle-data-transformation-at-the-telegraph-868ce3964eb4

This post provides an overview of both isolation levels and consistency levels, and it describes why in many case you need guarantees for both. In many cases, we use terms to that describe both an isolation and consistency level, so it's all a bit complicated. But definitely worth understanding these terms if you're working with data systems that are throwing these terms around!

https://fauna.com/blog/demystifying-database-systems-part-4-isolation-levels-vs-consistency-levels


Events

Curated by Datadog ( http://www.datadog.com )

California

Apache Heron Hands-On (Sunnyvale) - Monday, August 26

https://www.meetup.com/Apache-Heron-Bay-Area/events/nglzdryzlbzb/

Apache Druid and YuniKorn: Universal Resource Scheduler for Both K8s and Yarn (San Francisco) - Wednesday, August 28

https://www.meetup.com/SF-Big-Analytics/events/263274680/

Arizona

Kafka Streams on Kubernetes with RocksDB & Ktables Plus Avro! (Scottsdale) - Tuesday, August 27

https://www.meetup.com/Kafka-Phoenix/events/263627979/

Virginia 

Kicking Your Database to the Curb (Reston) - Tuesday, August 27

https://www.meetup.com/Apache-Kafka-DC/events/263755964/

GERMANY

Apache Spark on Kubernetes (Frankfurt) - Tuesday, August 27

https://www.meetup.com/HSUG-Rhein-Main/events/263799697/

SWITZERLAND

Apache Kafka Meetup at Swiss Re (Zurich) - Monday, August 26

https://www.meetup.com/Zurich-Apache-Kafka-Meetup-by-Confluent/events/262386007/

FINLAND

Helsinki Apache Kafka Meetup (Helsinki) - Tuesday, August 27

https://www.meetup.com/Helsinki-Apache-Kafka-Meetup/events/263025817/

AUSTRALIA

Data Engineering Melbourne Meetup (Melbourne) - Thursday, August 29

https://www.meetup.com/Data-Engineering-Melbourne/events/258551305/


Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.